opencl - clBuildProgram yields AccessViolationException when building this specific kernel -


this part of sort of parallel reduction/extremum kernel. have reduced minimum code still gets clbuildprogram crashing (note crashes, , doesn't return error code):

edit: seems happens when local_value declared global instead of local.

edit2 / solution: problem there infinite loop. should have written remaining_items >>= 1 instead of remaining_items >> 1. has been said in answers, nvidia compiler seems not robust when comes compile/optimization errors.

kernel void testkernel(local float *local_value) {     size_t thread_id = get_local_id(0);      int remaining_items = 1024;      while (remaining_items > 1)     {         // throw away right half of threads         remaining_items >> 1; // <-- spotted bug         if (thread_id > remaining_items)         {             return;         }          // greater value in right half of memory space         int right_index = thread_id + remaining_items;         float right_value = local_value[right_index];         if (right_value > local_value[thread_id])         {             local_value[thread_id] = right_value;         }          barrier(clk_global_mem_fence);     } }

removing lines return; and/or local_value[thread_id] = right_value; causes clbuildprogram finish successfully.

i can reproduce problem on of computers (nvidia gtx 560, gt 555m, gt 540m, they're fermi 2.1 architecture). it's apparent on nvidia cuda toolkit sdk versions 4.0, 4.1 , 4.2, when using either x64 or x86 libraries.

does have idea problem?

is possible, local (aka shared) memory automatically assumed (work_group_size) * siezof(its_base_type)? explain why works when lines mentioned above removed.


minimal host code (c99 compatible) reproduction:

#include <stdio.h> #include <stdlib.h> #include <string.h> #include <time.h>  #ifdef __apple__ #include <opencl/opencl.h> #else #include <cl/cl.h> #endif  #define return_throw(expression) { cl_int ret = expression; if (ret) { printf(#expression " failed: %d\n" , ret); exit(1); } } while (0) #define ref_throw(expression) { cl_int ret; expression; if (ret) { printf(#expression " failed: %d\n" , ret); exit(1); } } while (0)  int main(int argc, char **argv) {     // load kernel source code array source_str     file *fp;      fp = fopen("testkernel.cl", "rb");     if (!fp)     {         fprintf(stderr, "failed load kernel.\n");         exit(1);     }     fseek(fp, 0, seek_end);     int filesize = ftell(fp);     rewind(fp);     char *source_str = (char*)calloc(filesize, sizeof(char));     size_t bytes_read = fread(source_str, 1, filesize, fp);     source_str[bytes_read] = 0;     fclose(fp);      // platform information     cl_uint num_platforms;     return_throw(clgetplatformids(0, null, &num_platforms));      cl_platform_id *platform_ids = (cl_platform_id *)calloc(num_platforms, sizeof(cl_platform_id));     return_throw(clgetplatformids(num_platforms, platform_ids, null));      cl_device_id selected_device_id = null;      printf("available platforms:\n");     (cl_uint = 0; < num_platforms; i++)     {         char platform_name[50];         return_throw(clgetplatforminfo(platform_ids[i], cl_platform_name, 50, platform_name, null));         printf("%s\n", platform_name);          // devices platform         cl_uint num_devices;         return_throw(clgetdeviceids(platform_ids[i], cl_device_type_gpu, 0, null, &num_devices));          cl_device_id *device_ids = (cl_device_id *)calloc(num_devices, sizeof(cl_device_id));         return_throw(clgetdeviceids(platform_ids[i], cl_device_type_gpu, num_devices, device_ids, null));          // select first nvidia device         if (strstr(platform_name, "nvidia"))        // adapt accordingly         {             selected_device_id = device_ids[0];         }     }      if (selected_device_id == null)     {         printf("no nvidia device found\n");         exit(1);     }      // create opencl context     cl_context context;     ref_throw(context = clcreatecontext(null, 1, &selected_device_id, null, null, &ret));      // create program kernel source     cl_program program;     ref_throw(program = clcreateprogramwithsource(context, 1, (const char **)&source_str, null, &ret));      // build program     cl_int ret = clbuildprogram(program, 1, &selected_device_id, null, null, null);     if (ret)     {         printf("build error\n");         // build error - build log , display         size_t build_log_size;         ret = clgetprogrambuildinfo(program, selected_device_id, cl_program_build_log, 0, null, &build_log_size);         char *build_log = new char[build_log_size];         ret = clgetprogrambuildinfo(program, selected_device_id, cl_program_build_log, build_log_size, build_log, null);         printf("%s\n", build_log);         exit(1);     }      printf("build finished successfully\n");     return 0; }

in experience nvidia compiler isn't robust when comes handling build errors, have compile error somewhere.

i think problem indeed return, or more point combination barrier. according opencl spec barriers:

all work-items in work-group executing kernel on processor must execute function before allowed continue execution beyond barrier. function must encountered work-items in work-group executing kernel.

if barrier inside conditional statement, work-items must enter onditional if work-item enters conditional statement , executes barrier.

if barrer inside loop, work-items must execute barrier each iteration of loop before allowed continue execution beyond barrier.

so think problem lot of threads return before getting barrier, making code invalid. maybe should try this:

kernel void testkernel(local float *local_value) {     size_t thread_id = get_local_id(0);     int remaining_items = 1024;     while (remaining_items > 1) {         remaining_items >>= 1;// throw away right half of threads         if (thread_id <= remaining_items) {              // greater value in right half of memory space              int right_index = thread_id + remaining_items;              float right_value = local_value[right_index];              if (right_value > local_value[thread_id])                  local_value[thread_id] = right_value;         }         barrier(clk_global_mem_fence);     } } 

edit: furthermore noted in comments needs remaining_items>>=1 instead of remaining_items>>1 in order avoid producing infinite loop.


Comments

Popular posts from this blog

c# - SVN Error : "svnadmin: E205000: Too many arguments" -

c# - Copy ObservableCollection to another ObservableCollection -

All overlapping substrings matching a java regex -