c - unbalanced nested for loops in openmp -


i've been trying parallelize algorithm unbalanced nested loops using openmp. can't post original code it's secret project of unheard government here's toy example:

for (i = 0; < 100; i++) {     #pragma omp parallel private(j, k)     (j = 0; j < 1000000; j++) {         (k = 0; k < 2; k++) {             temp = * j * k;      /* dummy operation (don't mind race) */         }         if (i % 2 == 0) temp = 0;  /* can't use openmp collapse */     } } 

currently example working slower in multiple threads (~1 sec in single thread ~2.4 sec in 2 threads etc.).

things note:

  • outer loop needs done in order (dependent on previous step) (as far know, openmp handles inner loops threads don't created/destroyed @ each step, right?)

  • typical index numbers given in example (100, 1000000, 2)

  • dummy operation consists of few operations

  • there conditional operations outside inner loop collapse not option (doesn't seem increase performance anyways)

looks embarrassingly parallel algorithm can't seem speedups last 2 days. best strategy here?

unfortunately embarrassingly parallel algorithm embarrassingly bad example of how performant parallelism should implemented. , since crystall ball tells me besides i, temp shared automatic variable, assume rest of text. tells me have pre-nehalem cpu...

there 2 sources of slowdown here - code transformation , cache coherency.

the way parallel regions implmentend code extracted in separate functions. shared local variables extracted structures shared between threads in team executes parallel region. under openmp transformations code sample become similiar this:

typedef struct {     int i;     int temp; } main_omp_fn_0_shared_vars;  void main_omp_fn_0 (void *data) {     main_omp_fn_0_shared_vars *vars = data;      // compute values of j_min , j_max thread      (j = j_min; j < j_max; j++) {         (k = 0; k < 2; k++) {             vars->temp = vars->i * j * k;         if (vars->i % 2 == 0) vars->temp = 0;     } }  int main (void) {     int i, temp;     main_omp_fn_0_shared_vars vars;      (i = 0; < 100; i++)     {         vars.i = i;         vars.temp = temp;          // how gcc implements parallel regions libgomp          // start main_omp_fn_0 in other threads         gomp_parallel_start(main_omp_fn_0, &vars, 0);          // start main_omp_fn_0 in main thread         main_omp_fn_0(&vars);          // wait other threads finish (implicit barrier)         gomp_parallel_end();          = vars.i;         temp = vars.temp;     } } 

you pay small penalty accessing temp , i way intermediate values cannot stored in registers loaded , stored each time.

the other source of degradation cache coherency protocol. accessing same memory location multiple threads executing on multiple cpu cores leads lots of cache invalidation events. worse, vars.i , vars.temp end in same cache line , although vars.i read , vars.temp written to, full cache invalidation occur @ each iteration of inner loop.

normally access shared variables protected explicit synchronisation constructs atomic statements , critical sections , performance degradation expected in case.


Comments

Popular posts from this blog

c# - SVN Error : "svnadmin: E205000: Too many arguments" -

c# - Copy ObservableCollection to another ObservableCollection -

All overlapping substrings matching a java regex -