DevX Home    Today's Headlines   Articles Archive   Tip Bank   Forums   

Page 2 of 2 FirstFirst 12
Results 16 to 30 of 30

Thread: valarray again

  1. #16
    Join Date
    Dec 2003
    Posts
    3,366
    I will take a look, but wouldn't those affect any doubles in the system, regardless of container used? Also these are copy instructions, not math, so there should be no rounding or anything at all involved (?!).

  2. #17
    Join Date
    Dec 2007
    Posts
    401
    > but wouldn't those affect any doubles in the system, regardless of container used?
    yes, i think they would affect all doubles. ( #pragma float_control would allow selective fast floating point operations. )
    > Also these are copy instructions, not math, so there should be no rounding or anything at all involved (?!).
    my guess (this is a guess) is that under under fp:precise, the double value would be moved to a floating point register (where it would be held in register precision) and then the contents of the register would be moved to the target of the assignment. this could explicitly force the compiler to narrow the intermediate result to the target precision of the left-hand-side of the assignment—in this case to double.
    Last edited by vijayan; 01-08-2008 at 01:00 PM.

  3. #18
    Join Date
    Dec 2003
    Posts
    3,366
    This is just the data movement portion of the assembly. Compiled in release mode with sensible optimization set, but not the low precision (which would mess up my numerics for the real project anyway).

    straight up c++ double pointer:

    ; 11 : va[1] = 1;

    fld1
    fstp QWORD PTR _va$[ebp+8]


    Straight up valarray code:

    push 1
    lea ecx, DWORD PTR _va$[ebp]
    call ??A?$valarray@N@std@@QAEAANI@Z ; std::valarray<double>::operator[]
    fld1
    fstp QWORD PTR [eax]

    where the called function is this:
    It looks like the second set of code after the first return may not execute, I did not notice that before.

    ??A?$valarray@N@std@@QAEAANI@Z PROC NEAR ; std::valarray<double>::operator[], COMDAT
    ; _this$ = ecx

    ; 273 : { // subscript mutable sequence

    push ebp
    mov ebp, esp
    push ecx
    mov DWORD PTR _this$[ebp], ecx

    ; 274 : return (_Myptr[_Off]);

    mov eax, DWORD PTR _this$[ebp]
    mov eax, DWORD PTR [eax]
    mov ecx, DWORD PTR __Off$[ebp]
    lea eax, DWORD PTR [eax+ecx*8]

    ; 275 : }

    leave
    ret 4



    Valarray using a double pointer hack:

    ; 14 : pva[100] = 12.3;

    mov eax, DWORD PTR _pva$[ebp]
    fld QWORD PTR __real@402899999999999a
    fstp QWORD PTR [eax+800]


    But what seems to be killing us is that procedure call, the extra instructions do not help but the useless jump is brutal.

  4. #19
    Join Date
    Dec 2003
    Posts
    3,366
    Danny, Hack, can we disable the smiles again?

  5. #20
    Join Date
    Dec 2007
    Posts
    401
    > But what seems to be killing us is that procedure call ...
    you are absolutely right.
    just to check it out, i generated the assembly code with the gcc (4.2) compiler (which had earlier given similiar timings for valarrays and c style arrays). and there is hardly any difference in the machine code generated with the gnu toolset.
    the microsoft compiler and/or their valarray implementation is the culprit for the dreadful performance.

    here is code i tried with gcc 4.2.
    Code:
    #include <valarray>
    
    void foo_valarray( std::valarray<double>& va, size_t i, double d )
    {
      va[i] = d ;
    }
    
    void foo_c_array( double* ca, size_t i, double d )
    {
      ca[i] = d ;
    }
    
    extern std::valarray<double> va ;
    void bar_valarray()
    {
      va[1] = 1.0 ;
    }
    
    extern double ca[] ;
    void bar_c_array()
    {
      ca[1] = 1.0 ;
    }
    c++ -c -O3 -fomit-frame-pointer -Wa,-a,-ad -march=pentium4 valarray2.cc > valarray2.asm

    and this is the assembly code generated.
    Code:
    GAS LISTING /var/tmp//ccqTgL2o.s 			page 1
    
    
       1              		.file	"valarray2.cc"
       2              		.text
       3              		.align 2
       4              	.globl _Z11foo_c_arrayPdjd
       6              	_Z11foo_c_arrayPdjd:
       7              	.LFB2240:
       8 0000 DD44240C 	fldl	12(%esp)
       9 0004 8B542408 		movl	8(%esp), %edx
      10 0008 8B442404 	movl	4(%esp), %eax
      11 000c DD1CD0   		fstpl	(%eax,%edx,8)
      12 000f C3       		ret
      13              	.LFE2240:
      15              	.globl __gxx_personality_v0
      16              		.align 2
      17              	.globl _Z11bar_c_arrayv
      19              	_Z11bar_c_arrayv:
      20              	.LFB2242:
      21 0010 C7050800 		movl	$0, ca+8
      21      00000000 
      21      0000
      22 001a C7050C00 	movl	$1072693248, ca+12
      22      00000000 
      22      F03F
      23 0024 C3       		ret
      24              	.LFE2242:
      26 0025 90       		.align 2
      27              	.globl _Z12bar_valarrayv
      29              	_Z12bar_valarrayv:
      30              	.LFB2241:
      31 0026 A1040000 		movl	va+4, %eax
      31      00
      32 002b C7400800 	movl	$0, 8(%eax)
      32      000000
      33 0032 C7400C00 	movl	$1072693248, 12(%eax)
      33      00F03F
      34 0039 C3       		ret
      35              	.LFE2241:
      37              		.align 2
      38              	.globl _Z12foo_valarrayRSt8valarrayIdEjd
      40              	_Z12foo_valarrayRSt8valarrayIdEjd:
      41              	.LFB2239:
      42 003a 8B442404 	movl	4(%esp), %eax
      43 003e 8B5004   		movl	4(%eax), %edx
      44 0041 DD44240C 	fldl	12(%esp)
      45 0045 8B442408 		movl	8(%esp), %eax
      46 0049 DD1CC2   		fstpl	(%edx,%eax,8)
      47 004c C3       		ret
      48              	.LFE2239:
      50              		.ident	"GCC: (GNU) 4.2.3 20071024 (prerelease)"
    GAS LISTING /var/tmp//ccqTgL2o.s 			page 2
    
    
    DEFINED SYMBOLS
                                *ABS*:0000000000000000 valarray2.cc
    /var/tmp//ccqTgL2o.s:6      .text:0000000000000000 _Z11foo_c_arrayPdjd
    /var/tmp//ccqTgL2o.s:19     .text:0000000000000010 _Z11bar_c_arrayv
    /var/tmp//ccqTgL2o.s:29     .text:0000000000000026 _Z12bar_valarrayv
    /var/tmp//ccqTgL2o.s:40     .text:000000000000003a _Z12foo_valarrayRSt8valarrayIdEjd
    
    UNDEFINED SYMBOLS
    __gxx_personality_v0
    ca
    va
    Last edited by vijayan; 01-09-2008 at 04:00 PM.

  6. #21
    Join Date
    Nov 2003
    Posts
    4,118
    Quote Originally Posted by jonnin
    Danny, Hack, can we disable the smiles again?
    I think they're disabled between [CODE] tags and enabled otherwise. I'll ask our admin to cancel them completely.
    Danny Kalev

  7. #22
    Join Date
    Dec 2003
    Posts
    3,366
    I think I will leave it here, as my time to play with this is limited. If anyone else has additional commentary I will be reading...

  8. #23
    Join Date
    Nov 2003
    Posts
    4,118
    Quote Originally Posted by vijayan
    > But what seems to be killing us is that procedure call ...
    you are absolutely right.
    just to check it out, i generated the assembly code with the gcc (4.2) compiler (which had earlier given similiar timings for valarrays and c style arrays). and there is hardly any difference in the machine code generated with the gnu toolset.
    the microsoft compiler and/or their valarray implementation is the culprit for the dreadful performance.

    here is code i tried with gcc 4.2.
    Code:
    #include <valarray>
    
    void foo_valarray( std::valarray<double>& va, size_t i, double d )
    {
      va[i] = d ;
    }
    
    void foo_c_array( double* ca, size_t i, double d )
    {
      ca[i] = d ;
    }
    
    extern std::valarray<double> va ;
    void bar_valarray()
    {
      va[1] = 1.0 ;
    }
    
    extern double ca[] ;
    void bar_c_array()
    {
      ca[1] = 1.0 ;
    }
    c++ -c -O3 -fomit-frame-pointer -Wa,-a,-ad -march=pentium4 valarray2.cc > valarray2.asm

    and this is the assembly code generated.
    Code:
    GAS LISTING /var/tmp//ccqTgL2o.s 			page 1
    
    
       1              		.file	"valarray2.cc"
       2              		.text
       3              		.align 2
       4              	.globl _Z11foo_c_arrayPdjd
       6              	_Z11foo_c_arrayPdjd:
       7              	.LFB2240:
       8 0000 DD44240C 	fldl	12(%esp)
       9 0004 8B542408 		movl	8(%esp), %edx
      10 0008 8B442404 	movl	4(%esp), %eax
      11 000c DD1CD0   		fstpl	(%eax,%edx,8)
      12 000f C3       		ret
      13              	.LFE2240:
      15              	.globl __gxx_personality_v0
      16              		.align 2
      17              	.globl _Z11bar_c_arrayv
      19              	_Z11bar_c_arrayv:
      20              	.LFB2242:
      21 0010 C7050800 		movl	$0, ca+8
      21      00000000 
      21      0000
      22 001a C7050C00 	movl	$1072693248, ca+12
      22      00000000 
      22      F03F
      23 0024 C3       		ret
      24              	.LFE2242:
      26 0025 90       		.align 2
      27              	.globl _Z12bar_valarrayv
      29              	_Z12bar_valarrayv:
      30              	.LFB2241:
      31 0026 A1040000 		movl	va+4, %eax
      31      00
      32 002b C7400800 	movl	$0, 8(%eax)
      32      000000
      33 0032 C7400C00 	movl	$1072693248, 12(%eax)
      33      00F03F
      34 0039 C3       		ret
      35              	.LFE2241:
      37              		.align 2
      38              	.globl _Z12foo_valarrayRSt8valarrayIdEjd
      40              	_Z12foo_valarrayRSt8valarrayIdEjd:
      41              	.LFB2239:
      42 003a 8B442404 	movl	4(%esp), %eax
      43 003e 8B5004   		movl	4(%eax), %edx
      44 0041 DD44240C 	fldl	12(%esp)
      45 0045 8B442408 		movl	8(%esp), %eax
      46 0049 DD1CC2   		fstpl	(%edx,%eax,8)
      47 004c C3       		ret
      48              	.LFE2239:
      50              		.ident	"GCC: (GNU) 4.2.3 20071024 (prerelease)"
    GAS LISTING /var/tmp//ccqTgL2o.s 			page 2
    
    
    DEFINED SYMBOLS
                                *ABS*:0000000000000000 valarray2.cc
    /var/tmp//ccqTgL2o.s:6      .text:0000000000000000 _Z11foo_c_arrayPdjd
    /var/tmp//ccqTgL2o.s:19     .text:0000000000000010 _Z11bar_c_arrayv
    /var/tmp//ccqTgL2o.s:29     .text:0000000000000026 _Z12bar_valarrayv
    /var/tmp//ccqTgL2o.s:40     .text:000000000000003a _Z12foo_valarrayRSt8valarrayIdEjd
    
    UNDEFINED SYMBOLS
    __gxx_personality_v0
    ca
    va
    For those of us who do not speak machine code as their first (or second) language, can you summarize your conclusions please? Where's the problem with the MS generated code? Can you at least try to guess why CV++ generates that procedure call and what in heaven's name that procedure does? Finally, is it possible to hand optimize the VC++ machine code and eliminate that procedure call?
    Danny Kalev

  9. #24
    Join Date
    Dec 2003
    Posts
    3,366
    ok:

    the problem simply is the line with the call instruction, which invokes a procedure. This makes a jump, which breaks the pipeline and costs you (potentially) page faults, lost instructions (re-fill of the pipelines (there are 2 per intel cpu)) and a the like. Worse, they are slow commands which write to memory.

    You could hand optimize it, but its not trivial to re-write the built in library, you would have to either generate the assembler, modify (every time you compile!) and then assemble that, or hack on the library itself, or something. Its easier just to use an array of doubles IMO, I was only going to use the valarray to do built in A*B and A+B sorts of things which are trivial loops. If you *had* to have it you could download a better valarray class probably, in the old days you could patch in a new stl into vs 6.0 because its built in one was awful.

    I was unable to figure out what that code actually does. Its just a few lines movng a few bytes around, presumably for some internal safety check or baby sitting or something. If it could just inline it I suspect it would clear up the performance hits for the most part. As to why, It seems that this is the routine that is invoked whenever the compiler sees the valaray[] operators, from the comment in the machine language. So it looks like it just inserts this whenever it sees a [] operation, which is fairly horrible since I am willing to bet the internal valarray code also uses its own [] operator in places... ( I did not check this)

    Finally, g++ generates code that is without the procedure call, its nearly the same as double array code and therefore just as fast (not even registerable differences, and this is *me* talking)..

  10. #25
    Join Date
    Dec 2007
    Posts
    401
    > For those of us who do not speak machine code as their first (or second) language, can you summarize your conclusions ...
    i guess i could skip this as i can't see how i can improve on what jonnin has posted about this.

    > Can you at least try to guess why VC++ generates that procedure call and what in heaven's name that procedure does?
    that procedure is the out of line implementation of valarray<>::operator[]
    the version of vc++ that generates the procedure call is vc++ 6.0. it is a ten year old compiler with poor standards compliance and libraries.
    i tried the same code with vc++ 8.0 (Visual C++ Express 2005) and the code it generates is much much better. the array subscript operator is inlined and performance is comparable to that of gcc 4.2
    with the switches /O2 /Ob1 /Oi /Oy /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /FD /EHsc /MD /FAs /Fa"Release\\" /Fo"Release\\" /Fd"Release\vc80.pdb" /W3 /nologo /c /Wp64 /Zi /TP
    this is the code that is generated:
    Code:
    ; Listing generated by Microsoft (R) Optimizing Compiler Version 14.00.50727.762 
    
    	TITLE	c:\cygwin\home\projects\valarray_test\valarray_test\valarray2.cpp
    	.686P
    	.XMM
    	include listing.inc
    	.model	flat
    
    INCLUDELIB OLDNAMES
    
    PUBLIC	??A?$valarray@N@std@@QAEAANI@Z			; std::valarray<double>::operator[]
    EXTRN	@__security_check_cookie@4:PROC
    ; Function compile flags: /Ogtpy
    ; File c:\program files\microsoft visual studio 8\vc\include\valarray
    ;	COMDAT ??A?$valarray@N@std@@QAEAANI@Z
    _TEXT	SEGMENT
    ??A?$valarray@N@std@@QAEAANI@Z PROC			; std::valarray<double>::operator[], COMDAT
    ; _this$ = eax
    ; __Off$ = edx
    
    ; 295  : 		return (_Myptr[_Off]);
    
    	mov	ecx, DWORD PTR [eax]
    	lea	eax, DWORD PTR [ecx+edx*8]
    
    ; 296  : 		}
    
    	ret	0
    ??A?$valarray@N@std@@QAEAANI@Z ENDP			; std::valarray<double>::operator[]
    _TEXT	ENDS
    PUBLIC	__real@3ff0000000000000
    PUBLIC	?bar_c_array@@YAXXZ				; bar_c_array
    ;	COMDAT __real@3ff0000000000000
    ; File c:\cygwin\home\projects\valarray_test\valarray_test\valarray2.cpp
    CONST	SEGMENT
    __real@3ff0000000000000 DQ 03ff0000000000000r	; 1
    ; Function compile flags: /Ogtpy
    CONST	ENDS
    ;	COMDAT ?bar_c_array@@YAXXZ
    _TEXT	SEGMENT
    ?bar_c_array@@YAXXZ PROC				; bar_c_array, COMDAT
    
    ; 22   :   ca[1] = 1.0 ;
    
    	fld1
    	fstp	QWORD PTR ?ca@@3PANA+8
    
    ; 23   : }
    
    	ret	0
    ?bar_c_array@@YAXXZ ENDP				; bar_c_array
    _TEXT	ENDS
    PUBLIC	?bar_valarray@@YAXXZ				; bar_valarray
    ; Function compile flags: /Ogtpy
    ;	COMDAT ?bar_valarray@@YAXXZ
    _TEXT	SEGMENT
    ?bar_valarray@@YAXXZ PROC				; bar_valarray, COMDAT
    
    ; 16   :   va[1] = 1.0 ;
    
    	fld1
    	mov	eax, DWORD PTR ?va@@3V?$valarray@N@std@@A
    	fstp	QWORD PTR [eax+8]
    
    ; 17   : }
    
    	ret	0
    ?bar_valarray@@YAXXZ ENDP				; bar_valarray
    _TEXT	ENDS
    PUBLIC	?foo_c_array@@YAXPANIN@Z			; foo_c_array
    ; Function compile flags: /Ogtpy
    ;	COMDAT ?foo_c_array@@YAXPANIN@Z
    _TEXT	SEGMENT
    _ca$ = 8						; size = 4
    _i$ = 12						; size = 4
    _d$ = 16						; size = 8
    ?foo_c_array@@YAXPANIN@Z PROC				; foo_c_array, COMDAT
    
    ; 10   :   ca[i] = d ;
    
    	fld	QWORD PTR _d$[esp-4]
    	mov	eax, DWORD PTR _i$[esp-4]
    	mov	ecx, DWORD PTR _ca$[esp-4]
    	fstp	QWORD PTR [ecx+eax*8]
    
    ; 11   : }
    
    	ret	0
    ?foo_c_array@@YAXPANIN@Z ENDP				; foo_c_array
    _TEXT	ENDS
    PUBLIC	?foo_valarray@@YAXAAV?$valarray@N@std@@IN@Z	; foo_valarray
    ; Function compile flags: /Ogtpy
    ;	COMDAT ?foo_valarray@@YAXAAV?$valarray@N@std@@IN@Z
    _TEXT	SEGMENT
    _va$ = 8						; size = 4
    _i$ = 12						; size = 4
    _d$ = 16						; size = 8
    ?foo_valarray@@YAXAAV?$valarray@N@std@@IN@Z PROC	; foo_valarray, COMDAT
    
    ; 5    :   va[i] = d ;
    
    	mov	eax, DWORD PTR _va$[esp-4]
    	fld	QWORD PTR _d$[esp-4]
    	mov	ecx, DWORD PTR [eax]
    	mov	edx, DWORD PTR _i$[esp-4]
    	fstp	QWORD PTR [ecx+edx*8]
    
    ; 6    : }
    
    	ret	0
    ?foo_valarray@@YAXAAV?$valarray@N@std@@IN@Z ENDP	; foo_valarray
    END
    however, while trying out this test code (same as with gcc)
    Code:
    #include <iostream>
    #include <valarray>
    #include <boost/random.hpp>
    #include <ctime>
    #include <iostream>
    using namespace std ;
    using namespace boost ;
    
    int main()
    {
      enum { N = 1024*1024 };
      valarray<double> a(N), b(N), c(N) ;
    
      mt19937 engine ;
      uniform_real<> distribution( 0.0, 100.0 ) ;
      variate_generator< mt19937&, uniform_real<> > rng( engine, distribution ) ;
    
      for( int i=0 ; i<8 ; ++i )
      {
        for( size_t i=0 ; i<N ; ++i ) { a[i] = rng() ; b[i] = rng() ; }
    
        double* a1 = &a[0], *b1 = &b[0], *c1 = &c[0] ;
        clock_t begin = clock() ;
        for( size_t i=0 ; i<N ; ++i ) c1[i] = a1[i] * b1[i] ;
        clock_t end = clock() ;
        cout << "double operator* " << double(end-begin) / CLOCKS_PER_SEC << " seconds\n" ;
        
        begin = clock() ;
        c = a*b ;
        end = clock() ;
        cout << "valarray operator* " << double(end-begin) / CLOCKS_PER_SEC << " seconds\n" ;
    
        begin = clock() ;
        for( size_t i=0 ; i<N ; ++i ) c[i] = a[i] * b[i] ;
        end = clock() ;
        cout << "valarray[i] operator* " << double(end-begin) / CLOCKS_PER_SEC << " seconds\n" ;
        
        cout << "------------------------------------------------------\n" ;
      }
    }
    the results are like
    ------------------------------------------------------
    double operator* 0.015 seconds
    valarray operator* 0.047 seconds
    valarray[i] operator* 0.015 seconds
    ------------------------------------------------------

    the reason why c = a*b is less efficient than
    for( size_t i=0 ; i<N ; ++i ) c[i] = a[i] * b[i] ;
    is that the expression a*b causes the creation of a temporary valarray which is then copied into c. vc++ 8.0 does not optimize away the creation of this anonymous temporary.
    Last edited by vijayan; 01-10-2008 at 05:35 AM.

  11. #26
    Join Date
    Dec 2003
    Posts
    3,366
    That was .net 2005 actually. I tossed 6.0 for .net 2002, a bit later than I would have liked. But not the express edition, we have the version just under the extremely expensive one, professional maybe?
    Last edited by jonnin; 01-10-2008 at 09:57 AM.

  12. #27
    Join Date
    Nov 2003
    Posts
    4,118
    so the problem with the generated code still persists with VC 8.0?
    Danny Kalev

  13. #28
    Join Date
    Dec 2003
    Posts
    3,366
    Scratch that, I need to redo it in 2005. The test project I was using somehow had reverted to 2002 and since they look the same, I did not notice! So yes, it still does it in 2002 but may not in 2005.

    I need to make some change in 2002 so I can tell if it decides to open and surprise me again. Thankfully, I verified that my real projects are using 2005.


    I will try to retime it this weekend.

  14. #29
    Join Date
    Dec 2007
    Posts
    401
    > so the problem with the generated code still persists with VC 8.0?
    the array subscript operation does not have a problem with vc++ 8.0; it compares well with the performance of g++ (same as that for a c style array).
    atleast with the compiler settings that i tried, it does have a problem with expressions of these types: c = a * b ; or c = a + b ;
    but not with c *= a ; or c += b ; or for( size_t i=0 ; i<N ; ++i ) c[i] = a[i] * b[i] ;
    perhaps more aggressive optimization settings might make the problem go away; i've only had a cursory look at it so far.

  15. #30
    Join Date
    Dec 2003
    Posts
    3,366
    No, it was doing it in 2002 (visual 7) not 8. My computer randomly decided to open the old IDE on my test project. I will have to redo a lot of this stuff with 2005 and see what the deal is. I started from the top, a full matrix multiply, and so far its .06 seconds for double pointer, .5 seconds for an OOP approach with slices and sum, * operations from valarray. I will post more later on, I have a busy week or so ahead of me.

Similar Threads

  1. C++ Matrix
    By GOBLIN-85 in forum C++
    Replies: 6
    Last Post: 11-29-2007, 11:59 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
HTML5 Development Center
 
 
FAQ
Latest Articles
Java
.NET
XML
Database
Enterprise
Questions? Contact us.
C++
Web Development
Wireless
Latest Tips
Open Source


   Development Centers

   -- Android Development Center
   -- Cloud Development Project Center
   -- HTML5 Development Center
   -- Windows Mobile Development Center