-
I will take a look, but wouldn't those affect any doubles in the system, regardless of container used? Also these are copy instructions, not math, so there should be no rounding or anything at all involved (?!).
-
> but wouldn't those affect any doubles in the system, regardless of container used?
yes, i think they would affect all doubles. ( #pragma float_control would allow selective fast floating point operations. )
> Also these are copy instructions, not math, so there should be no rounding or anything at all involved (?!).
my guess (this is a guess) is that under under fp:precise, the double value would be moved to a floating point register (where it would be held in register precision) and then the contents of the register would be moved to the target of the assignment. this could explicitly force the compiler to narrow the intermediate result to the target precision of the left-hand-side of the assignment—in this case to double.
Last edited by vijayan; 01-08-2008 at 01:00 PM.
-
This is just the data movement portion of the assembly. Compiled in release mode with sensible optimization set, but not the low precision (which would mess up my numerics for the real project anyway).
straight up c++ double pointer:
; 11 : va[1] = 1;
fld1
fstp QWORD PTR _va$[ebp+8]
Straight up valarray code:
push 1
lea ecx, DWORD PTR _va$[ebp]
call ??A?$valarray@N@std@@QAEAANI@Z ; std::valarray<double>::operator[]
fld1
fstp QWORD PTR [eax]
where the called function is this:
It looks like the second set of code after the first return may not execute, I did not notice that before.
??A?$valarray@N@std@@QAEAANI@Z PROC NEAR ; std::valarray<double>::operator[], COMDAT
; _this$ = ecx
; 273 : { // subscript mutable sequence
push ebp
mov ebp, esp
push ecx
mov DWORD PTR _this$[ebp], ecx
; 274 : return (_Myptr[_Off]);
mov eax, DWORD PTR _this$[ebp]
mov eax, DWORD PTR [eax]
mov ecx, DWORD PTR __Off$[ebp]
lea eax, DWORD PTR [eax+ecx*8]
; 275 : }
leave
ret 4
Valarray using a double pointer hack:
; 14 : pva[100] = 12.3;
mov eax, DWORD PTR _pva$[ebp]
fld QWORD PTR __real@402899999999999a
fstp QWORD PTR [eax+800]
But what seems to be killing us is that procedure call, the extra instructions do not help but the useless jump is brutal.
-
Danny, Hack, can we disable the smiles again?
-
> But what seems to be killing us is that procedure call ...
you are absolutely right.
just to check it out, i generated the assembly code with the gcc (4.2) compiler (which had earlier given similiar timings for valarrays and c style arrays). and there is hardly any difference in the machine code generated with the gnu toolset.
the microsoft compiler and/or their valarray implementation is the culprit for the dreadful performance.
here is code i tried with gcc 4.2.
Code:
#include <valarray>
void foo_valarray( std::valarray<double>& va, size_t i, double d )
{
va[i] = d ;
}
void foo_c_array( double* ca, size_t i, double d )
{
ca[i] = d ;
}
extern std::valarray<double> va ;
void bar_valarray()
{
va[1] = 1.0 ;
}
extern double ca[] ;
void bar_c_array()
{
ca[1] = 1.0 ;
}
c++ -c -O3 -fomit-frame-pointer -Wa,-a,-ad -march=pentium4 valarray2.cc > valarray2.asm
and this is the assembly code generated.
Code:
GAS LISTING /var/tmp//ccqTgL2o.s page 1
1 .file "valarray2.cc"
2 .text
3 .align 2
4 .globl _Z11foo_c_arrayPdjd
6 _Z11foo_c_arrayPdjd:
7 .LFB2240:
8 0000 DD44240C fldl 12(%esp)
9 0004 8B542408 movl 8(%esp), %edx
10 0008 8B442404 movl 4(%esp), %eax
11 000c DD1CD0 fstpl (%eax,%edx,8)
12 000f C3 ret
13 .LFE2240:
15 .globl __gxx_personality_v0
16 .align 2
17 .globl _Z11bar_c_arrayv
19 _Z11bar_c_arrayv:
20 .LFB2242:
21 0010 C7050800 movl $0, ca+8
21 00000000
21 0000
22 001a C7050C00 movl $1072693248, ca+12
22 00000000
22 F03F
23 0024 C3 ret
24 .LFE2242:
26 0025 90 .align 2
27 .globl _Z12bar_valarrayv
29 _Z12bar_valarrayv:
30 .LFB2241:
31 0026 A1040000 movl va+4, %eax
31 00
32 002b C7400800 movl $0, 8(%eax)
32 000000
33 0032 C7400C00 movl $1072693248, 12(%eax)
33 00F03F
34 0039 C3 ret
35 .LFE2241:
37 .align 2
38 .globl _Z12foo_valarrayRSt8valarrayIdEjd
40 _Z12foo_valarrayRSt8valarrayIdEjd:
41 .LFB2239:
42 003a 8B442404 movl 4(%esp), %eax
43 003e 8B5004 movl 4(%eax), %edx
44 0041 DD44240C fldl 12(%esp)
45 0045 8B442408 movl 8(%esp), %eax
46 0049 DD1CC2 fstpl (%edx,%eax,8)
47 004c C3 ret
48 .LFE2239:
50 .ident "GCC: (GNU) 4.2.3 20071024 (prerelease)"
GAS LISTING /var/tmp//ccqTgL2o.s page 2
DEFINED SYMBOLS
*ABS*:0000000000000000 valarray2.cc
/var/tmp//ccqTgL2o.s:6 .text:0000000000000000 _Z11foo_c_arrayPdjd
/var/tmp//ccqTgL2o.s:19 .text:0000000000000010 _Z11bar_c_arrayv
/var/tmp//ccqTgL2o.s:29 .text:0000000000000026 _Z12bar_valarrayv
/var/tmp//ccqTgL2o.s:40 .text:000000000000003a _Z12foo_valarrayRSt8valarrayIdEjd
UNDEFINED SYMBOLS
__gxx_personality_v0
ca
va
Last edited by vijayan; 01-09-2008 at 04:00 PM.
-
 Originally Posted by jonnin
Danny, Hack, can we disable the smiles again?
I think they're disabled between [CODE] tags and enabled otherwise. I'll ask our admin to cancel them completely.
Danny Kalev
-
I think I will leave it here, as my time to play with this is limited. If anyone else has additional commentary I will be reading...
-
 Originally Posted by vijayan
> But what seems to be killing us is that procedure call ...
you are absolutely right.
just to check it out, i generated the assembly code with the gcc (4.2) compiler (which had earlier given similiar timings for valarrays and c style arrays). and there is hardly any difference in the machine code generated with the gnu toolset.
the microsoft compiler and/or their valarray implementation is the culprit for the dreadful performance.
here is code i tried with gcc 4.2.
Code:
#include <valarray>
void foo_valarray( std::valarray<double>& va, size_t i, double d )
{
va[i] = d ;
}
void foo_c_array( double* ca, size_t i, double d )
{
ca[i] = d ;
}
extern std::valarray<double> va ;
void bar_valarray()
{
va[1] = 1.0 ;
}
extern double ca[] ;
void bar_c_array()
{
ca[1] = 1.0 ;
}
c++ -c -O3 -fomit-frame-pointer -Wa,-a,-ad -march=pentium4 valarray2.cc > valarray2.asm
and this is the assembly code generated.
Code:
GAS LISTING /var/tmp//ccqTgL2o.s page 1
1 .file "valarray2.cc"
2 .text
3 .align 2
4 .globl _Z11foo_c_arrayPdjd
6 _Z11foo_c_arrayPdjd:
7 .LFB2240:
8 0000 DD44240C fldl 12(%esp)
9 0004 8B542408 movl 8(%esp), %edx
10 0008 8B442404 movl 4(%esp), %eax
11 000c DD1CD0 fstpl (%eax,%edx,8)
12 000f C3 ret
13 .LFE2240:
15 .globl __gxx_personality_v0
16 .align 2
17 .globl _Z11bar_c_arrayv
19 _Z11bar_c_arrayv:
20 .LFB2242:
21 0010 C7050800 movl $0, ca+8
21 00000000
21 0000
22 001a C7050C00 movl $1072693248, ca+12
22 00000000
22 F03F
23 0024 C3 ret
24 .LFE2242:
26 0025 90 .align 2
27 .globl _Z12bar_valarrayv
29 _Z12bar_valarrayv:
30 .LFB2241:
31 0026 A1040000 movl va+4, %eax
31 00
32 002b C7400800 movl $0, 8(%eax)
32 000000
33 0032 C7400C00 movl $1072693248, 12(%eax)
33 00F03F
34 0039 C3 ret
35 .LFE2241:
37 .align 2
38 .globl _Z12foo_valarrayRSt8valarrayIdEjd
40 _Z12foo_valarrayRSt8valarrayIdEjd:
41 .LFB2239:
42 003a 8B442404 movl 4(%esp), %eax
43 003e 8B5004 movl 4(%eax), %edx
44 0041 DD44240C fldl 12(%esp)
45 0045 8B442408 movl 8(%esp), %eax
46 0049 DD1CC2 fstpl (%edx,%eax,8)
47 004c C3 ret
48 .LFE2239:
50 .ident "GCC: (GNU) 4.2.3 20071024 (prerelease)"
GAS LISTING /var/tmp//ccqTgL2o.s page 2
DEFINED SYMBOLS
*ABS*:0000000000000000 valarray2.cc
/var/tmp//ccqTgL2o.s:6 .text:0000000000000000 _Z11foo_c_arrayPdjd
/var/tmp//ccqTgL2o.s:19 .text:0000000000000010 _Z11bar_c_arrayv
/var/tmp//ccqTgL2o.s:29 .text:0000000000000026 _Z12bar_valarrayv
/var/tmp//ccqTgL2o.s:40 .text:000000000000003a _Z12foo_valarrayRSt8valarrayIdEjd
UNDEFINED SYMBOLS
__gxx_personality_v0
ca
va
For those of us who do not speak machine code as their first (or second) language, can you summarize your conclusions please? Where's the problem with the MS generated code? Can you at least try to guess why CV++ generates that procedure call and what in heaven's name that procedure does? Finally, is it possible to hand optimize the VC++ machine code and eliminate that procedure call?
Danny Kalev
-
ok:
the problem simply is the line with the call instruction, which invokes a procedure. This makes a jump, which breaks the pipeline and costs you (potentially) page faults, lost instructions (re-fill of the pipelines (there are 2 per intel cpu)) and a the like. Worse, they are slow commands which write to memory.
You could hand optimize it, but its not trivial to re-write the built in library, you would have to either generate the assembler, modify (every time you compile!) and then assemble that, or hack on the library itself, or something. Its easier just to use an array of doubles IMO, I was only going to use the valarray to do built in A*B and A+B sorts of things which are trivial loops. If you *had* to have it you could download a better valarray class probably, in the old days you could patch in a new stl into vs 6.0 because its built in one was awful.
I was unable to figure out what that code actually does. Its just a few lines movng a few bytes around, presumably for some internal safety check or baby sitting or something. If it could just inline it I suspect it would clear up the performance hits for the most part. As to why, It seems that this is the routine that is invoked whenever the compiler sees the valaray[] operators, from the comment in the machine language. So it looks like it just inserts this whenever it sees a [] operation, which is fairly horrible since I am willing to bet the internal valarray code also uses its own [] operator in places... ( I did not check this)
Finally, g++ generates code that is without the procedure call, its nearly the same as double array code and therefore just as fast (not even registerable differences, and this is *me* talking)..
-
> For those of us who do not speak machine code as their first (or second) language, can you summarize your conclusions ...
i guess i could skip this as i can't see how i can improve on what jonnin has posted about this.
> Can you at least try to guess why VC++ generates that procedure call and what in heaven's name that procedure does?
that procedure is the out of line implementation of valarray<>::operator[]
the version of vc++ that generates the procedure call is vc++ 6.0. it is a ten year old compiler with poor standards compliance and libraries.
i tried the same code with vc++ 8.0 (Visual C++ Express 2005) and the code it generates is much much better. the array subscript operator is inlined and performance is comparable to that of gcc 4.2
with the switches /O2 /Ob1 /Oi /Oy /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /FD /EHsc /MD /FAs /Fa"Release\\" /Fo"Release\\" /Fd"Release\vc80.pdb" /W3 /nologo /c /Wp64 /Zi /TP
this is the code that is generated:
Code:
; Listing generated by Microsoft (R) Optimizing Compiler Version 14.00.50727.762
TITLE c:\cygwin\home\projects\valarray_test\valarray_test\valarray2.cpp
.686P
.XMM
include listing.inc
.model flat
INCLUDELIB OLDNAMES
PUBLIC ??A?$valarray@N@std@@QAEAANI@Z ; std::valarray<double>::operator[]
EXTRN @__security_check_cookie@4:PROC
; Function compile flags: /Ogtpy
; File c:\program files\microsoft visual studio 8\vc\include\valarray
; COMDAT ??A?$valarray@N@std@@QAEAANI@Z
_TEXT SEGMENT
??A?$valarray@N@std@@QAEAANI@Z PROC ; std::valarray<double>::operator[], COMDAT
; _this$ = eax
; __Off$ = edx
; 295 : return (_Myptr[_Off]);
mov ecx, DWORD PTR [eax]
lea eax, DWORD PTR [ecx+edx*8]
; 296 : }
ret 0
??A?$valarray@N@std@@QAEAANI@Z ENDP ; std::valarray<double>::operator[]
_TEXT ENDS
PUBLIC __real@3ff0000000000000
PUBLIC ?bar_c_array@@YAXXZ ; bar_c_array
; COMDAT __real@3ff0000000000000
; File c:\cygwin\home\projects\valarray_test\valarray_test\valarray2.cpp
CONST SEGMENT
__real@3ff0000000000000 DQ 03ff0000000000000r ; 1
; Function compile flags: /Ogtpy
CONST ENDS
; COMDAT ?bar_c_array@@YAXXZ
_TEXT SEGMENT
?bar_c_array@@YAXXZ PROC ; bar_c_array, COMDAT
; 22 : ca[1] = 1.0 ;
fld1
fstp QWORD PTR ?ca@@3PANA+8
; 23 : }
ret 0
?bar_c_array@@YAXXZ ENDP ; bar_c_array
_TEXT ENDS
PUBLIC ?bar_valarray@@YAXXZ ; bar_valarray
; Function compile flags: /Ogtpy
; COMDAT ?bar_valarray@@YAXXZ
_TEXT SEGMENT
?bar_valarray@@YAXXZ PROC ; bar_valarray, COMDAT
; 16 : va[1] = 1.0 ;
fld1
mov eax, DWORD PTR ?va@@3V?$valarray@N@std@@A
fstp QWORD PTR [eax+8]
; 17 : }
ret 0
?bar_valarray@@YAXXZ ENDP ; bar_valarray
_TEXT ENDS
PUBLIC ?foo_c_array@@YAXPANIN@Z ; foo_c_array
; Function compile flags: /Ogtpy
; COMDAT ?foo_c_array@@YAXPANIN@Z
_TEXT SEGMENT
_ca$ = 8 ; size = 4
_i$ = 12 ; size = 4
_d$ = 16 ; size = 8
?foo_c_array@@YAXPANIN@Z PROC ; foo_c_array, COMDAT
; 10 : ca[i] = d ;
fld QWORD PTR _d$[esp-4]
mov eax, DWORD PTR _i$[esp-4]
mov ecx, DWORD PTR _ca$[esp-4]
fstp QWORD PTR [ecx+eax*8]
; 11 : }
ret 0
?foo_c_array@@YAXPANIN@Z ENDP ; foo_c_array
_TEXT ENDS
PUBLIC ?foo_valarray@@YAXAAV?$valarray@N@std@@IN@Z ; foo_valarray
; Function compile flags: /Ogtpy
; COMDAT ?foo_valarray@@YAXAAV?$valarray@N@std@@IN@Z
_TEXT SEGMENT
_va$ = 8 ; size = 4
_i$ = 12 ; size = 4
_d$ = 16 ; size = 8
?foo_valarray@@YAXAAV?$valarray@N@std@@IN@Z PROC ; foo_valarray, COMDAT
; 5 : va[i] = d ;
mov eax, DWORD PTR _va$[esp-4]
fld QWORD PTR _d$[esp-4]
mov ecx, DWORD PTR [eax]
mov edx, DWORD PTR _i$[esp-4]
fstp QWORD PTR [ecx+edx*8]
; 6 : }
ret 0
?foo_valarray@@YAXAAV?$valarray@N@std@@IN@Z ENDP ; foo_valarray
END
however, while trying out this test code (same as with gcc)
Code:
#include <iostream>
#include <valarray>
#include <boost/random.hpp>
#include <ctime>
#include <iostream>
using namespace std ;
using namespace boost ;
int main()
{
enum { N = 1024*1024 };
valarray<double> a(N), b(N), c(N) ;
mt19937 engine ;
uniform_real<> distribution( 0.0, 100.0 ) ;
variate_generator< mt19937&, uniform_real<> > rng( engine, distribution ) ;
for( int i=0 ; i<8 ; ++i )
{
for( size_t i=0 ; i<N ; ++i ) { a[i] = rng() ; b[i] = rng() ; }
double* a1 = &a[0], *b1 = &b[0], *c1 = &c[0] ;
clock_t begin = clock() ;
for( size_t i=0 ; i<N ; ++i ) c1[i] = a1[i] * b1[i] ;
clock_t end = clock() ;
cout << "double operator* " << double(end-begin) / CLOCKS_PER_SEC << " seconds\n" ;
begin = clock() ;
c = a*b ;
end = clock() ;
cout << "valarray operator* " << double(end-begin) / CLOCKS_PER_SEC << " seconds\n" ;
begin = clock() ;
for( size_t i=0 ; i<N ; ++i ) c[i] = a[i] * b[i] ;
end = clock() ;
cout << "valarray[i] operator* " << double(end-begin) / CLOCKS_PER_SEC << " seconds\n" ;
cout << "------------------------------------------------------\n" ;
}
}
the results are like
------------------------------------------------------
double operator* 0.015 seconds
valarray operator* 0.047 seconds
valarray[i] operator* 0.015 seconds
------------------------------------------------------
the reason why c = a*b is less efficient than
for( size_t i=0 ; i<N ; ++i ) c[i] = a[i] * b[i] ;
is that the expression a*b causes the creation of a temporary valarray which is then copied into c. vc++ 8.0 does not optimize away the creation of this anonymous temporary.
Last edited by vijayan; 01-10-2008 at 05:35 AM.
-
That was .net 2005 actually. I tossed 6.0 for .net 2002, a bit later than I would have liked. But not the express edition, we have the version just under the extremely expensive one, professional maybe?
Last edited by jonnin; 01-10-2008 at 09:57 AM.
-
so the problem with the generated code still persists with VC 8.0?
Danny Kalev
-
Scratch that, I need to redo it in 2005. The test project I was using somehow had reverted to 2002 and since they look the same, I did not notice! So yes, it still does it in 2002 but may not in 2005.
I need to make some change in 2002 so I can tell if it decides to open and surprise me again. Thankfully, I verified that my real projects are using 2005.
I will try to retime it this weekend.
-
> so the problem with the generated code still persists with VC 8.0?
the array subscript operation does not have a problem with vc++ 8.0; it compares well with the performance of g++ (same as that for a c style array).
atleast with the compiler settings that i tried, it does have a problem with expressions of these types: c = a * b ; or c = a + b ;
but not with c *= a ; or c += b ; or for( size_t i=0 ; i<N ; ++i ) c[i] = a[i] * b[i] ;
perhaps more aggressive optimization settings might make the problem go away; i've only had a cursory look at it so far.
-
No, it was doing it in 2002 (visual 7) not 8. My computer randomly decided to open the old IDE on my test project. I will have to redo a lot of this stuff with 2005 and see what the deal is. I started from the top, a full matrix multiply, and so far its .06 seconds for double pointer, .5 seconds for an OOP approach with slices and sum, * operations from valarray. I will post more later on, I have a busy week or so ahead of me.
Similar Threads
-
By GOBLIN-85 in forum C++
Replies: 6
Last Post: 11-29-2007, 11:59 AM
Posting Permissions
- You may not post new threads
- You may not post replies
- You may not post attachments
- You may not edit your posts
-
Forum Rules
|
Development Centers
-- Android Development Center
-- Cloud Development Project Center
-- HTML5 Development Center
-- Windows Mobile Development Center
|