# valarray again

Show 40 post(s) from this thread on one page
Page 1 of 2 12 Last
• 12-07-2007, 06:24 PM
jonnin
valarray again
Ok this is a performance question again... say we have

valarray<double> va;
valarray<double> vb;
valarray<double> vc;

... load a and b up with say 1 million random numbers, and resize c to have room for 1 million numbers etc.

then do
vc = va*vb;
and time this.

also do

int i;
for(i = 0; i < 1000000; i++)
vc[i] = va[i]*vb[i];

and time it

and also do
double * a = &va[0];
double * b = &vb[0];
double * c = &vc[0];

int i;
for(i = 0; i < 1000000; i++)
c[i] = a[i]*b[i];

and time this.

I get radical differences in the amount of time used, from slowest to fastest top to bottom. Why? I thought valarray was supposed to be fast?! I am about to toss them in the bloat box with vectors =(
• 12-08-2007, 09:52 AM
vijayan
valarray does seem to give comparable performance (with compiler optimizations enabled). this is what i get (on on old p3 laptop) with g++ 4.2
Code:

```#include <iostream> #include <valarray> #include <boost/random.hpp> #include <ctime> #include <iostream> using namespace std ; using namespace boost ; int main() {   enum { N = 1024*1024 };   valarray<double> a(N), b(N), c(N) ;   mt19937 engine ;   uniform_real<> distribution( 0.0, 100.0 ) ;   variate_generator< mt19937&, uniform_real<> > rng( engine, distribution ) ;   for( int i=0 ; i<8 ; ++i )   {     for( size_t i=0 ; i<N ; ++i ) { a[i] = rng() ; b[i] = rng() ; }     double* a1 = &a[0], *b1 = &b[0], *c1 = &c[0] ;     clock_t begin = clock() ;     for( size_t i=0 ; i<N ; ++i ) c1[i] = a1[i] * b1[i] ;     clock_t end = clock() ;     cout << "double operator* " << double(end-begin) / CLOCKS_PER_SEC << " seconds\n" ;         begin = clock() ;     c = a*b ;     end = clock() ;     cout << "valarray operator* " << double(end-begin) / CLOCKS_PER_SEC << " seconds\n" ;     begin = clock() ;     for( size_t i=0 ; i<N ; ++i ) c[i] = a[i] * b[i] ;     end = clock() ;     cout << "valarray[i] operator* " << double(end-begin) / CLOCKS_PER_SEC << " seconds\n" ;         cout << "------------------------------------------------------\n" ;   } }```
>c++ -O3 valarray.cc && ./a.out
double operator* 0.0859375 seconds
valarray operator* 0.078125 seconds
valarray[i] operator* 0.0859375 seconds
------------------------------------------------------
double operator* 0.0859375 seconds
valarray operator* 0.078125 seconds
valarray[i] operator* 0.0859375 seconds
------------------------------------------------------
double operator* 0.0859375 seconds
valarray operator* 0.0859375 seconds
valarray[i] operator* 0.078125 seconds
------------------------------------------------------
double operator* 0.0703125 seconds
valarray operator* 0.0859375 seconds
valarray[i] operator* 0.078125 seconds
------------------------------------------------------
double operator* 0.078125 seconds
valarray operator* 0.0859375 seconds
valarray[i] operator* 0.078125 seconds
------------------------------------------------------
double operator* 0.0859375 seconds
valarray operator* 0.0859375 seconds
valarray[i] operator* 0.078125 seconds
------------------------------------------------------
double operator* 0.078125 seconds
valarray operator* 0.0859375 seconds
valarray[i] operator* 0.0859375 seconds
------------------------------------------------------
double operator* 0.078125 seconds
valarray operator* 0.0859375 seconds
valarray[i] operator* 0.078125 seconds
------------------------------------------------------

> I am about to toss them in the bloat box with vectors =(
you are right. it is unlikely that c++ standard library implementations would pay any attention towards optimizing valarray performance. The people who introduced valarrays to the C++ standard library left the committee a long time before the standard was finished. and these classes were not designed very well. also there are other libraries which perform much better; to a c++ programmer involved in numeric processing, the best is perhaps blitz. http://www.oonumerics.org/blitz/
i think valarrays would perform better than vectors.
• 12-08-2007, 02:19 PM
Danny
valarray was not supposed to be fast, it as supposed to give the variability that built-in arrays don't offer. So the question is: what is your control group? Are you comparing valarray to built-in arrays? In that case you should also consider the initialization of built-in arrays, error handling (e.g., bounds checking) and of course, resizing should that be needed.
I would consider using std::tr1::array<T> instead, but your arrays might be just too big to fit into the process' stack memory.
• 12-08-2007, 06:54 PM
jonnin
they were actually pre allocated pointers compared to pre allocated valarrays, size of 90000 (300X300) which is sort of a typical size for some of my stuff. I do not do bounds checking or any matrix error checking, I simply cannot afford the penalty. The idea was to see if slice could outperform manually extracting columns in some algorithms, currently I transpose to get columns out (makes them rows, and pays a lot of page faults up front instead of more of them at random later).

maybe its a visual studio issue instead, if his g++ gets such similar answers. It takes me .5 seconds to do a 300X300 matrix multiply, the valarray takes 2 seconds. Of that, .5 seconds are dedicated to slicing out columns, 1 second is dedicated to the multiplies, and another .5 seconds is dedicated to the sum function and assignment to final answer. Swapping the multiply to a loop instead of the valarray * operator cut .5 seconds off. And the sum function in the valarray was super fast, far better than a loop.

I made a hack slice class that found a pointer to the locked class members (stride, size and something) and so I was able to re-use my slice variables but this had no effect on the performance (it worked, but its one of those things you only do when playing).

I do not generally have to resize anything once it is created, at most it is reshaped and I can pre-allocate the largest size it will ever be. But I do have to deal with some ugly algorithms, pseudoinverse, ricatti eqtn, lyapunov or sylvester eqtn, various eigenvalue problems and the like.

I am fairly happy with the current code but once in a while I poke at it, the subject came up in another post and I started what-iffing.

I will try it on cygwin & our gentoo compilers and see what happens. I had planned on that but this is a spare time thing.
• 12-09-2007, 08:10 PM
Danny
In that case I suspect that you valarray's allocator is the culprit. Although it's not exactly an easy take, I would try to plug in a different allocator which preallocates the entire matrix on the stack memory and see if that helps. I would also check how operator* is implemented (i.e., disassemble the code, not just the source). I believe the standard valarray class can perform much better than what you're experiencing.
• 12-10-2007, 06:33 AM
Razee Marikar
Did you try turning on code optimizations? Visual CPP compiler has optimizations turned off and debugging mode turned on by default. Try changing the compiler options. There are function call overheads associated with the class versions (vector and valarry), but with optimizations on, the difference will be little.
• 12-10-2007, 09:59 AM
jonnin
yes, I played with the optimals, without them the valarray code takes 10 seconds, not 2, for the same code. C array went to .85 seconds.
• 12-10-2007, 11:56 AM
jonnin
cygwin's g++ gave .5 seconds, must be a setting or a problem in .net's implementation. Perhaps this is the same troubles I have had with vector and the like ... they have all had a sluggish feel in .net (which I am stuck with for now). I will dig around and see if I can unravel exactly where the issue is, starting with Danny's suggestions.
• 12-10-2007, 03:10 PM
Danny
Please keep us posted. I supposed that .Net gets in the way in certain ways because it has to store additional information about the code itself, to allow decompilation for instance. I would disassemble the code and compare it with the optimized and non-optimized of Cygwin for instance. Microsoft have never excelled at performance, I regret to say, especially when they are trying to stick their proprietary nonsense into standard C++ code. You should also check whether the CRT uses Microsoft's "safe" versions of standard library functions. They silently replace fast C code with their sluggish replacements, which are purportedly safer but immensely slower.
• 12-10-2007, 03:41 PM
jonnin
I will certainly post any revelations. It will take me a bit to unravel the assembler outputs, most likely many days since this is on the side.

I tried the C array multiplier that was .5 seconds on .net and it cuts to .375 on cygwin, with the transpose out (reduces page faults, but now the algorithm is the same as the valarray one for fairness). (Cygwin with the transpose gets .2 seconds, single threaded, pretty sweet since it can be split easily to 2 processors that would be .1 seconds for a 300X300 (untested, but I am 95% confident this would be the case))

I also tweaked on the valarray version in .net, getting it down to .6 seconds by bypassing several [] operators, the simpler slice (one of the slices is across rows so memcpy works), and combining the sum of products into a single loop rather than the built in tools. So it now is half valarray and half double pointer code. That version was .48 seconds on cygwin, not enough change to matter.

As best as I can tell from all this, in .net, the valarray [] operator is sluggish, slices can be sluggish (it takes as long to slice out a row as a column, which is *wrong* because the columns have page faults while the rows are just sequential locations!) and assignment operations are possibly sluggish (is it bounds checking??).

and the possibility looms that the optimization flags I use are just better for the C arrays or something, I will revisit those as well.

... to be continued
• 01-04-2008, 11:03 AM
jonnin
In general, it seems to be simply the [] operator behind the bulk of the sluggishness in visual.

A double pointer, my old class, does this in 2 assembly operations, a
fld and a fstp pair. (a[x] = y)

The valarray<double> does the same thing in 12 operations, and 8 of these are in an invoked subroutine which means a jump (pipeline and more might get hosed up because of this, I did not have time to look that deeply). 10 extra instructions does not *seem* to be that bad, but 3 of them are pushes and 5 are mov operations, so it uses 8 moderately slow instructions.

For a small number of assignments this is not noticed, but for something like the matrix multiply, where you must access 3 (huge) arrays (result = aXb) in a tight loop, the penalty is very much a problem in real time code.

On a happy note, while messing around with it all I was able to tweak my double pointer version to go faster for non sparse matrices and also split it to exploit a multi-processor (core) machine. I am going to stick with the double pointers for now; the main program uses about 20 external devices and moving from compiler to compiler (thats a lot of vendor libraries to re-configure) will have to wait until everything is finished.
• 01-05-2008, 11:15 AM
Danny
Interesting! Here do the extra 10 instructions come from? I thought the compiler should be able to inline the [] operator call. Is the body of the [] overloaded operator that complex? Can you suggest a way to improve its implementation at the source file (i.e. C++) level?
• 01-05-2008, 07:41 PM
jonnin
Right in the middle of the assembly is a line to
call bighugename // with a comment saying the name represents operator []

I can post the assembler later next week. Sure it could be improved, one way is simply by taking a double * to valarray[0] and using that instead -- the code reverts back to nearly the same as straight double pointer (fast) code. But I do not think it actually compiles the source for valarray and all, I think it links in a library call (which is probably where the call/jump in assembler is taking us). I doubt it can be fixed in a legit way, only by some sort of hackery.

Even if it did inline the call, it would still be 10 extra data movement instructions that have nothing at all to do with moving a double into a memory location. I was not able to discover the purpose of the stack pushes (could not figure out where they got popped out & used).
• 01-07-2008, 11:52 AM
Danny
So we can conclude the the sluggish performance is an artifact of a crappy implementation rather than an inherent limitation of the Standard, right? That's a relief!
• 01-08-2008, 08:43 AM
vijayan
you could try compiling with the /G7, /arch:SSE2 and the /fp:fast compiler options.
the default setting for vc++ 8.0 is /fp:precise. this would cause rounding to the highest available precision to be performed every time an assignment is made or when a floating-point value is returned. and may explain the extra instructions that are generated.
Show 40 post(s) from this thread on one page
Page 1 of 2 12 Last