java - Why is it faster to process a sorted array than an unsorted array? -

here piece of c++ code seems peculiar. strange reason, sorting data miraculously makes code 6 times faster.

#include <algorithm> #include <ctime> #include <iostream>  int main() {     // generate data     const unsigned arraysize = 32768;     int data[arraysize];      (unsigned c = 0; c < arraysize; ++c)         data[c] = std::rand() % 256;      // !!! this, next loop runs faster     std::sort(data, data + arraysize);      // test     clock_t start = clock();     long long sum = 0;      (unsigned = 0; < 100000; ++i)     {         // primary loop         (unsigned c = 0; c < arraysize; ++c)         {             if (data[c] >= 128)                 sum += data[c];         }     }      double elapsedtime = static_cast<double>(clock() - start) / clocks_per_sec;      std::cout << elapsedtime << std::endl;     std::cout << "sum = " << sum << std::endl; }

without std::sort(data, data + arraysize);, code runs in 11.54 seconds.
with sorted data, code runs in 1.93 seconds.

initially, thought might language or compiler anomaly. tried in java.

import java.util.arrays; import java.util.random;  public class main {     public static void main(string[] args)     {         // generate data         int arraysize = 32768;         int data[] = new int[arraysize];          random rnd = new random(0);         (int c = 0; c < arraysize; ++c)             data[c] = rnd.nextint() % 256;          // !!! this, next loop runs faster         arrays.sort(data);          // test         long start = system.nanotime();         long sum = 0;          (int = 0; < 100000; ++i)         {             // primary loop             (int c = 0; c < arraysize; ++c)             {                 if (data[c] >= 128)                     sum += data[c];             }         }          system.out.println((system.nanotime() - start) / 1000000000.0);         system.out.println("sum = " + sum);     } }

with similar less extreme result.

my first thought sorting brings data cache, thought how silly because array generated.

what going on?
why sorted array faster process unsorted array?
the code summing independent terms, , order should not matter.

you victim of branch prediction fail.

what branch prediction?

consider railroad junction:

_{image mecanismo, via wikimedia commons. used under cc-by-sa 3.0 license.}

now sake of argument, suppose in 1800s - before long distance or radio communication.

you operator of junction , hear train coming. have no idea way supposed go. stop train ask driver direction want. , set switch appropriately.

trains heavy , have lot of inertia. take forever start , slow down.

is there better way? guess direction train go!

if guessed right, continues on.
if guessed wrong, captain stop, up, , yell @ flip switch. can restart down other path.

if guess right every time, train never have stop.
if guess wrong often, train spend lot of time stopping, backing up, , restarting.

consider if-statement: @ processor level, branch instruction:

you processor , see branch. have no idea way go. do? halt execution , wait until previous instructions complete. continue down correct path.

modern processors complicated , have long pipelines. take forever "warm up" , "slow down".

is there better way? guess direction branch go!

if guessed right, continue executing.
if guessed wrong, need flush pipeline , roll branch. can restart down other path.

if guess right every time, execution never have stop.
if guess wrong often, spend lot of time stalling, rolling back, , restarting.

this branch prediction. admit it's not best analogy since train signal direction flag. in computers, processor doesn't know direction branch go until last moment.

so how strategically guess minimize number of times train must , go down other path? @ past history! if train goes left 99% of time, guess left. if alternates, alternate guesses. if goes 1 way every 3 times, guess same...

in other words, try identify pattern , follow it. more or less how branch predictors work.

most applications have well-behaved branches. modern branch predictors typically achieve >90% hit rates. when faced unpredictable branches no recognizable patterns, branch predictors virtually useless.

further reading: "branch predictor" article on wikipedia.

as hinted above, culprit if-statement:

if (data[c] >= 128)     sum += data[c];

notice data evenly distributed between 0 , 255. when data sorted, first half of iterations not enter if-statement. after that, enter if-statement.

this friendly branch predictor since branch consecutively goes same direction many times. simple saturating counter correctly predict branch except few iterations after switches direction.

quick visualization:

t = branch taken n = branch not taken  data[] = 0, 1, 2, 3, 4, ... 126, 127, 128, 129, 130, ... 250, 251, 252, ... branch = n  n  n  n  n  ...   n    n    t    t    t  ...   t    t    t  ...         = nnnnnnnnnnnn ... nnnnnnnttttttttt ... tttttttttt  (easy predict)

however, when data random, branch predictor rendered useless because can't predict random data. there around 50% misprediction. (no better random guessing)

data[] = 226, 185, 125, 158, 198, 144, 217, 79, 202, 118,  14, 150, 177, 182, 133, ... branch =   t,   t,   n,   t,   t,   t,   t,  n,   t,   n,   n,   t,   t,   t,   n  ...         = ttnttttntnntttn ...   (completely random - hard predict)

so can done?

if compiler isn't able optimize branch conditional move, can try hacks if willing sacrifice readability performance.

replace:

if (data[c] >= 128)     sum += data[c];

with:

int t = (data[c] - 128) >> 31; sum += ~t & data[c];

this eliminates branch , replaces bitwise operations.

_{(note hack not strictly equivalent original if-statement. in case, it's valid input values of data[].)}

benchmarks: core i7 920 @ 3.5 ghz

c++ - visual studio 2010 - x64 release

//  branch - random seconds = 11.777  //  branch - sorted seconds = 2.352  //  branchless - random seconds = 2.564  //  branchless - sorted seconds = 2.587

java - netbeans 7.1.1 jdk 7 - x64

//  branch - random seconds = 10.93293813  //  branch - sorted seconds = 5.643797077  //  branchless - random seconds = 3.113581453  //  branchless - sorted seconds = 3.186068823

observations:

with branch: there huge difference between sorted , unsorted data.
with hack: there no difference between sorted , unsorted data.
in c++ case, hack tad slower branch when data sorted.

a general rule of thumb avoid data-dependent branching in critical loops. (such in example)

update:

gcc 4.6.1 -o3 or -ftree-vectorize on x64 able generate conditional move. there no difference between sorted , unsorted data - both fast.
vc++ 2010 unable generate conditional moves branch under /ox.
intel compiler 11 miraculous. interchanges 2 loops, thereby hoisting unpredictable branch outer loop. not immune mispredictions, twice fast whatever vc++ , gcc can generate! in other words, icc took advantage of test-loop defeat benchmark...
if give intel compiler branchless code, out-right vectorizes it... , fast branch (with loop interchange).

this goes show mature modern compilers can vary wildly in ability optimize code...

Search This Blog

Convert PH