arrays - C++ 2D matrix iterator efficiency compared to nested for-loop -
i have row-major iterator 2d array, derefence operator follows:
int& iterator::operator*(){ return matrix_[y_][x_]; } //matrix_ has type int**
the (prefix) increment operator follows:
iterator& iterator::operator++() { if((++x_ == xs_) && (++y_ != ys_)) //ys_, xs_ dimensions of matrix x_ = 0; return *this; }
i can use iterator optimised version of std::transform (doesn't return un-needed result, in order save few instructions)
template < class inputiterator, class outputiterator, class unaryoperator > inline void mytransform( inputiterator first1, inputiterator last1, outputiterator result, unaryoperator op ) { (; first1 != last1; ++first1, ++result) *result = op(*first1); }
calling thus:
mytransform(matrix1.begin(),matrix1.end(),matrix2.begin(), myfunctor());
however, when compare performance classic, nested for-loop:
myfunctor() f; (int y=0; y<ysize; y++) (int x=0; x<xsize; x++) matrix2.[y][x] = f(matrix1.[y][x]);
the iterator-based solution approx. 25% slower nested for-loop solution. case both msvc , intel c++ compilers (both of seem auto-inline needed).
now problem not seem iterator increment operator, if following (ugly) hybrid solution combining iterator-traversal , raw array access (the latter indexed using iterators' internal counts):
myfunctor f; (; mat1begin != mat1end; ++mat1begin, ++mat2begin) { //mat1 , mat2 type int** mat2[mat2begin.y_][mat2begin.x_] = f(mat1[mat1begin.y_][mat1begin.x_]); }
it little faster nested for-loop solution. suggests me performance hit in iterator's dereferencing when doing assignment.
my question is, why dereferencing iterators in assignment
*result = op(*first1);
incur such massive performance hit, relative raw array access? there technique can use simple design, performance (almost) equivalent raw array version?
in response helpful feedback community, have modified code outer counter of loop cached, code looks follows:
int& iterator::operator*() { return column_[x_]; } iterator& iterator::operator++() { if(++x_ == xs_) //ys_, xs_ dimensions of matrix { if(++y_ != ys_) { x_ = 0; column_ = matrix_[y_]; } } return *this; }
this improves performance ~85% of raw 2d array performance intel c++ compiler, , similar msvc compiler (actually call mytransform slower on msvc - more assembly instructions generated - let's ignore interested more in loop/dereference behaviour).
when convert code using pointer arithmetic (again caching column) follows, performance worse raw 2d array (~70%) on intel compiler, again ~85% of raw 2d array under msvc compiler
int& image32iterator::operator*() { return *ptr_; } //prefix image32iterator& image32iterator::operator++() { if(++ptr_ == ptrend_) { if(++row_ != rowend_) { ptrend_ = (ptr_ = *row_) + xs_; } } return *this; }
so trying understand if ~85% performance maximum can using iterator-based solution. surprised pointer arithmetic solution performs worse (galling trying use pointer arithmetic see if > 85%!).
i'll continue investigations , update finds, insight welcome...
...so, focussing on issue of why pointer-arithmetic version of iterator performs badly intel, whereas performs okay msvc compiler, have looked @ assembly, , problem seems in code generated loop. other functions (i.e., constructors, iterator , dereference operators, inequality operator etc., generated code pretty same both intel , msvc, if more concise intel).
here assembler intel generated code, followed assembler msvc generated code. have changed loop while loop make generated assembler easier read:
intel generated code:
while(begin != end) 01392d31 push eax 01392d32 lea eax,[begin] 01392d35 lea edx,[end] 01392d38 mov dword ptr [esp],edx 01392d3b mov ecx,eax 01392d3d call imageexperiments::image32iterator::operator!= (139103ch) 01392d42 mov byte ptr [ebp-74h],al 01392d45 movzx eax,byte ptr [ebp-74h] 01392d49 movzx eax,al 01392d4c test eax,eax 01392d4e je imageexperiments::greyscale_iterator2+0bch (1392dach) { *it8 = gsf(*begin); 01392d50 lea eax,[begin] 01392d53 mov ecx,eax 01392d55 call imageexperiments::image32iterator::operator* (13910a5h) 01392d5a mov dword ptr [ebp-10h],eax 01392d5d push eax 01392d5e lea eax,[gsf] 01392d61 mov edx,dword ptr [ebp-10h] 01392d64 mov edx,dword ptr [edx] 01392d66 mov dword ptr [esp],edx 01392d69 mov ecx,eax 01392d6b call imageexperiments::greyscalefunctor::operator() (139101eh) 01392d70 mov byte ptr [ebp-72h],al 01392d73 movzx eax,byte ptr [ebp-72h] 01392d77 mov byte ptr [ebp-71h],al 01392d7a lea eax,[it8] 01392d7d mov ecx,eax 01392d7f call imageexperiments::image8iterator::operator* (1391050h) 01392d84 mov dword ptr [ebp-0ch],eax 01392d87 mov eax,dword ptr [ebp-0ch] 01392d8a movzx edx,byte ptr [ebp-71h] 01392d8e mov byte ptr [eax],dl ++begin; 01392d90 lea eax,[begin] 01392d93 mov ecx,eax 01392d95 call imageexperiments::image32iterator::operator++ (1391028h) 01392d9a mov dword ptr [ebp-8],eax ++it8; 01392d9d lea eax,[it8] 01392da0 mov ecx,eax 01392da2 call imageexperiments::image8iterator::operator++ (1391014h) 01392da7 mov dword ptr [ebp-4],eax 01392daa jmp imageexperiments::greyscale_iterator2+41h (1392d31h) } } 00ca2dac leave 00ca2dad ret
msvc generated code:
while(begin != end) 010316e0 lea eax,[end] 010316e3 push eax 010316e4 lea ecx,[begin] 010316e7 call imageexperiments::image32iterator::operator!= (1031096h) 010316ec movzx ecx,al 010316ef test ecx,ecx 010316f1 je imageexperiments::greyscale_iterator2+74h (1031724h) { *it8 = gsf(*begin); 010316f3 lea ecx,[begin] 010316f6 call imageexperiments::image32iterator::operator* (10311eah) 010316fb mov eax,dword ptr [eax] 010316fd push eax 010316fe lea ecx,[gsf] 01031701 call imageexperiments::greyscalefunctor::operator() (1031032h) 01031706 mov bl,al 01031708 lea ecx,[it8] 0103170b call imageexperiments::image8iterator::operator* (1031118h) 01031710 mov byte ptr [eax],bl ++begin; 01031712 lea ecx,[begin] 01031715 call imageexperiments::image32iterator::operator++ (1031041h) ++it8; 0103171a lea ecx,[it8] 0103171d call imageexperiments::image8iterator::operator++ (103101eh) } 01031722 jmp imageexperiments::greyscale_iterator2+30h (10316e0h) } 01031724 pop edi 01031725 pop esi 01031726 pop ebx 01031727 mov esp,ebp 01031729 pop ebp 0103172a ret
so looks me intel compiler generates approx. 50% larger number of instructions. have tried qualifying pointers __restrict see if make difference intel generation did not. if has suggestion why loop code intel compiler bulky/slow, compared msvc++ compiler, i'd interested!
you're hoisting double indirection matrix_[y_][x_]
function call loop. possibly compiler managing cache pointer matrix_[y_]
in 1 case not other; try caching matrix_[y_]
in iterator?
Comments
Post a Comment