arrays - C++ 2D matrix iterator efficiency compared to nested for-loop -


i have row-major iterator 2d array, derefence operator follows:

int& iterator::operator*(){ return matrix_[y_][x_]; }  //matrix_ has type int** 

the (prefix) increment operator follows:

iterator& iterator::operator++() {     if((++x_ == xs_) && (++y_ != ys_)) //ys_, xs_ dimensions of matrix         x_ = 0;     return *this;    }  

i can use iterator optimised version of std::transform (doesn't return un-needed result, in order save few instructions)

template < class inputiterator, class outputiterator, class unaryoperator > inline void mytransform( inputiterator first1, inputiterator last1,                          outputiterator result, unaryoperator op ) {     (; first1 != last1; ++first1, ++result)         *result = op(*first1); }  

calling thus:

mytransform(matrix1.begin(),matrix1.end(),matrix2.begin(), myfunctor()); 

however, when compare performance classic, nested for-loop:

myfunctor() f; (int y=0; y<ysize; y++)     (int x=0; x<xsize; x++)         matrix2.[y][x] = f(matrix1.[y][x]); 

the iterator-based solution approx. 25% slower nested for-loop solution. case both msvc , intel c++ compilers (both of seem auto-inline needed).

now problem not seem iterator increment operator, if following (ugly) hybrid solution combining iterator-traversal , raw array access (the latter indexed using iterators' internal counts):

myfunctor f; (; mat1begin != mat1end; ++mat1begin, ++mat2begin) {      //mat1 , mat2 type int**     mat2[mat2begin.y_][mat2begin.x_] = f(mat1[mat1begin.y_][mat1begin.x_]); } 

it little faster nested for-loop solution. suggests me performance hit in iterator's dereferencing when doing assignment.

my question is, why dereferencing iterators in assignment

*result = op(*first1); 

incur such massive performance hit, relative raw array access? there technique can use simple design, performance (almost) equivalent raw array version?

in response helpful feedback community, have modified code outer counter of loop cached, code looks follows:

int& iterator::operator*() {     return column_[x_];  }   iterator& iterator::operator++() {     if(++x_ == xs_)      //ys_, xs_ dimensions of matrix     {         if(++y_ != ys_)         {              x_ = 0;             column_ = matrix_[y_];         }     }     return *this; }  

this improves performance ~85% of raw 2d array performance intel c++ compiler, , similar msvc compiler (actually call mytransform slower on msvc - more assembly instructions generated - let's ignore interested more in loop/dereference behaviour).

when convert code using pointer arithmetic (again caching column) follows, performance worse raw 2d array (~70%) on intel compiler, again ~85% of raw 2d array under msvc compiler

int& image32iterator::operator*() {     return *ptr_; }   //prefix image32iterator& image32iterator::operator++() {     if(++ptr_ == ptrend_)     {         if(++row_ != rowend_)         {              ptrend_ = (ptr_ = *row_) + xs_;         }     }     return *this; }  

so trying understand if ~85% performance maximum can using iterator-based solution. surprised pointer arithmetic solution performs worse (galling trying use pointer arithmetic see if > 85%!).

i'll continue investigations , update finds, insight welcome...

...so, focussing on issue of why pointer-arithmetic version of iterator performs badly intel, whereas performs okay msvc compiler, have looked @ assembly, , problem seems in code generated loop. other functions (i.e., constructors, iterator , dereference operators, inequality operator etc., generated code pretty same both intel , msvc, if more concise intel).

here assembler intel generated code, followed assembler msvc generated code. have changed loop while loop make generated assembler easier read:

intel generated code:

while(begin != end) 01392d31  push        eax   01392d32  lea         eax,[begin]   01392d35  lea         edx,[end]   01392d38  mov         dword ptr [esp],edx   01392d3b  mov         ecx,eax   01392d3d  call        imageexperiments::image32iterator::operator!= (139103ch)   01392d42  mov         byte ptr [ebp-74h],al   01392d45  movzx       eax,byte ptr [ebp-74h]   01392d49  movzx       eax,al   01392d4c  test        eax,eax   01392d4e  je          imageexperiments::greyscale_iterator2+0bch (1392dach)   {     *it8 = gsf(*begin); 01392d50  lea         eax,[begin]   01392d53  mov         ecx,eax   01392d55  call        imageexperiments::image32iterator::operator* (13910a5h)   01392d5a  mov         dword ptr [ebp-10h],eax   01392d5d  push        eax   01392d5e  lea         eax,[gsf]   01392d61  mov         edx,dword ptr [ebp-10h]   01392d64  mov         edx,dword ptr [edx]   01392d66  mov         dword ptr [esp],edx   01392d69  mov         ecx,eax   01392d6b  call        imageexperiments::greyscalefunctor::operator() (139101eh)   01392d70  mov         byte ptr [ebp-72h],al   01392d73  movzx       eax,byte ptr [ebp-72h]   01392d77  mov         byte ptr [ebp-71h],al   01392d7a  lea         eax,[it8]   01392d7d  mov         ecx,eax   01392d7f  call        imageexperiments::image8iterator::operator* (1391050h)   01392d84  mov         dword ptr [ebp-0ch],eax   01392d87  mov         eax,dword ptr [ebp-0ch]   01392d8a  movzx       edx,byte ptr [ebp-71h]   01392d8e  mov         byte ptr [eax],dl       ++begin;  01392d90  lea         eax,[begin]   01392d93  mov         ecx,eax   01392d95  call        imageexperiments::image32iterator::operator++ (1391028h)   01392d9a  mov         dword ptr [ebp-8],eax       ++it8; 01392d9d  lea         eax,[it8]   01392da0  mov         ecx,eax   01392da2  call        imageexperiments::image8iterator::operator++ (1391014h)   01392da7  mov         dword ptr [ebp-4],eax   01392daa  jmp         imageexperiments::greyscale_iterator2+41h (1392d31h)   } } 00ca2dac  leave   00ca2dad  ret 

msvc generated code:

while(begin != end) 010316e0  lea         eax,[end]   010316e3  push        eax   010316e4  lea         ecx,[begin]   010316e7  call        imageexperiments::image32iterator::operator!= (1031096h)   010316ec  movzx       ecx,al   010316ef  test        ecx,ecx   010316f1  je          imageexperiments::greyscale_iterator2+74h (1031724h)   {     *it8 = gsf(*begin); 010316f3  lea         ecx,[begin]   010316f6  call        imageexperiments::image32iterator::operator* (10311eah)   010316fb  mov         eax,dword ptr [eax]   010316fd  push        eax   010316fe  lea         ecx,[gsf]   01031701  call        imageexperiments::greyscalefunctor::operator() (1031032h)   01031706  mov         bl,al   01031708  lea         ecx,[it8]   0103170b  call        imageexperiments::image8iterator::operator* (1031118h)   01031710  mov         byte ptr [eax],bl       ++begin;  01031712  lea         ecx,[begin]   01031715  call        imageexperiments::image32iterator::operator++ (1031041h)       ++it8; 0103171a  lea         ecx,[it8]   0103171d  call        imageexperiments::image8iterator::operator++ (103101eh)   } 01031722  jmp         imageexperiments::greyscale_iterator2+30h (10316e0h)   } 01031724  pop         edi   01031725  pop         esi   01031726  pop         ebx   01031727  mov         esp,ebp   01031729  pop         ebp   0103172a  ret   

so looks me intel compiler generates approx. 50% larger number of instructions. have tried qualifying pointers __restrict see if make difference intel generation did not. if has suggestion why loop code intel compiler bulky/slow, compared msvc++ compiler, i'd interested!

you're hoisting double indirection matrix_[y_][x_] function call loop. possibly compiler managing cache pointer matrix_[y_] in 1 case not other; try caching matrix_[y_] in iterator?


Comments

Popular posts from this blog

c# - SVN Error : "svnadmin: E205000: Too many arguments" -

c# - Copy ObservableCollection to another ObservableCollection -

All overlapping substrings matching a java regex -