Writing Efficient Code for OpenCL Applications

This is the part 2 of the OpenCL Webinar Series put out by Intel. There were some good information about optimization in general but lot of the information focused on how to optimize OpenCL running on 3rd Gen Intel Core Processor. I jot down some optimization notes that are applicable to all processors.

Avoid Invariant in OpenCL Kernel

Anything that's independent from kernel execution (invariants, constants), move it to the host.

Bad:

__kernel void test1 (__global int* data, int2 size, int base) {

       int offset = size.y * base + size.x;

      float offsetF = (float)offset;

...

Good:

__kernel void test1 (__global int* data, int offset, float offset) {

...

Avoid initialization in Kernel

Move one time initialization to the host or in a different kernel.

Bad: __kernel void something (__global int* data) { size_t tid = get_global_id(0) if (0== tid) { //Do Something } barrier(CLK_GLOBAL_MEM_FENCE) }

Use the Built in Functions

i.e. dot, hypot, clamp, etc.

Trade Accuracy vs Speed

If the output is correct, look at the performance and use "mad" or "native_sin"

Get rid of edge conditions

Bad:

__kernel void myKernel(__global int* data, int maxR, int maxC){

       int row = get_global_id(0);

       int col = get_global_id(1);

       if (row > maxRow){

...

       else if (col > maxCol) {

...

       else{

...

Use the ND Range (work within a smaller range. Just avoid the conditionals...)

Use the padded buffers...

Get rid of edge conditions in general

Use logical ops instead of comparison/conditions.

Reduce number of registers to increase parallelism

Avoid Byte/Short Load and Stores

Use load and store in a greater chunk.

i.e. uchar -> uint4

Don't use too many barriers in kernel

You are asking everything to wait until the work items are done.

If the work item doesn't run, the code may hang.

Use the vector types (float8, double4, int4, etc) for better performance on the CPU

Preferred work group size for kernels is 64 or 128 work items.

-Workgroup size multiple of 8

-Query

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE parameter by calling to clGetKernelWorkgroupInfo.

-Match number of workgroups to logical cores.

Buffer Object vs. Image Objects

CPU Device:

- Avoid using Image Object and use Buffer object instead.