Tag Archives: OpenCL

Latency Comparison of Lua, OpenCL, and native C/C++

Just-in-time compilation is an appealing technique for producing optimized code at run time rather than at compile time. In an earlier post I was already looking into the just-in-time compilation overhead of various OpenCL SDKs. This blog post looks into the cost of launching OpenCL kernels on the CPU and compares with the cost of calling a plain C/C++ function through a function pointer, and with the cost of calling a precompiled Lua script. Continue reading →

Strided Memory Access on CPUs, GPUs, and MIC

Optimization guides for GPUs discuss in length the importance of contiguous ("coalesced", etc.) memory access for achieving high memory bandwidth (e.g. this parallel4all blog post). But how does strided memory access compare across different architectures? Is this something specific to NVIDIA GPUs? Let's shed some light on these questions by some benchmarks. Continue reading →

GPU Memory Bandwidth vs. Thread Blocks (CUDA) / Workgroups (OpenCL)

The massive parallelism of GPUs provides ample of performance for certain algorithms in scientific computing. At the same time, however, Amdahl's Law imposes limits on possible performance gains from parallelization. Thus, let us look in this blog post on how *few* threads one can launch on GPUs while still getting good performance (here: memory bandwidth). Continue reading →

OpenCL Just-In-Time (JIT) Compilation Benchmarks

The beauty of the vendor-independent standard OpenCL is that a single kernel language is sufficient to program many different architectures, ranging from dual-core CPUs over Intel's Many Integrated Cores (MIC) architecture to GPUs and even FPGAs. The kernels are just-in-time compiled during the program run, which has several advantages and disadvantages. An incomplete list is as follows:

Advantage: Binary can be fully optimized for the underlying hardware
Advantage: High portability
Disadvantage: Just-in-Time compilation induces overhead
Disadvantage: No automatic performance portability

Today's blog post is about just-in-time (jit) compilation overhead. Ideally, jit-compilation is infinitely fast. In reality, it is sufficient to keep the jit-compilation time small compared to the overall execution time. But what is 'small'?

Continue reading →

CfP: Intl. Workshop on OpenCL 2016 (IWOCL 2016)

The International Workshop on OpenCL (IWOCL) is an annual meeting bringing together the experts on OpenCL, an open standard for programming heterogeneous parallel computing systems. In 2016 IWOCL will run its fourth installment from April 19-21 in Vienna, Austria, and as local chair of IWOCL 2016 I'm proud to share the IWOCL 2016 Call for Papers, Technical Submissions, Tutorials and Posters. Continue reading →

Karl Rupp

Computational Scientist