Three Suggestions for Improving OpenCL for Library Developers

OpenCL is not (yet) a success story in high performance computing. More researchers are drawn towards NVIDIA's CUDA, harvesting a richer toolchain and ease of getting started. A vendor-lock seems to be less a concern for my colleagues, even though I do not agree as somebody who is paid from public money.

Anyway, this blog post is not yet-another-OpenCL-vs-CUDA discussion. Instead, it provides three suggestions on how OpenCL could become more attractive for software library developers to grow the OpenCL library ecosystem. Only if OpenCL libraries provide 90+ percent of the functionality a user needs, the user will be willing to spend the time on getting the remaining percent (if any) done.

In the following I summarize the key elements of my IWOCL 2016 Keynote. Slides and a technical note are available for download.

Current Problems of OpenCL for Library Developers

Before discussing my suggestions, let's have a quick look at how OpenCL library development differs from OpenCL-enabled application development:

High number of kernels. A library may provide hundreds of different OpenCL kernels, which need to be just-in-time (jit) compiled at some point. Often a user only needs a small subset of these kernels, so jit-compiling all kernels upfront is not a viable option. As a previous blog post has shown, OpenCL kernel compilation has significant overhead. This is especially pronounced if a user wants to combine several OpenCL-enabled libraries.
Complex interaction of kernels. Unlike CUDA, OpenCL is not a single-source approach. This means that OpenCL kernels have to be provided as strings to the jit-compiler at runtime, so it is completely decoupled from the host compiler. While this allows for a clean and non-intrusive build process, it also makes it fairly difficult to allow user-customizations of implementations such as a user-defined sorting criterion for common sorting algorithms.
Heterogeneous OpenCL support. OpenCL library development would be a lot easier if all OpenCL SDK providers (this, in particular, includes hardware vendors) support the latest standard shortly after the release. However, NVIDIA still only provides OpenCL 1.2 support, so a library developer cannot easily make OpenCL 2.0 a requirement.

Other issues such as performance portability are not a library-specific issue. Also, there is no "compiler magic" (and I postulate that there will never be such magic) to completely solve performance portability issues. Performance portability is just hard. (Neil Trevett, IWOCL 2016)

My Suggestions

Here are my suggestions for making the life of an OpenCL library developer easier:

Reducing overhead of just-in-time compilation. Most notably, this can be achieved by making an optional OpenCL program cache a requirement for OpenCL SDKs. (An OpenCL program cache stores the binaries on the filesystem upon first compilation. In subsequent runs, the binary is loaded from disk within milliseconds). Today, libraries such as Boost.Compute, VexCL, or ViennaCL have to reimplement their own kernel caching strategies, because OpenCL SDKs from e.g. Intel and AMD still do not provide an OpenCL program cache (somewhat ironically, NVIDIA does provide am OpenCL program cache). It makes much more sense to require the small number of OpenCL SDKs to implement kernel caching, than to let hundreds of OpenCL-enabled libraries and applications reimplement such a cache. Not to mention the usability nightmare when combining several OpenCL-enabled libraries, each implementing their own OpenCL program cache.
Better support for user-provided function pointers. If OpenCL is used to run a kernel on the CPU, user experience could benefit tremendously if function pointers could be passed to OpenCL kernels and called from there. Wouldn't it be cool if you could call any of the thread-safe routines from glib in your OpenCL kernel? I see technical reasons why this can't be done for OpenCL kernels on GPUs; but I don't see technical reasons why this should not be possible if OpenCL kernels are run on CPUs.
Consider Fusion of OpenCL and Vulkan. Vulkan has gained a lot of momentum recently, with vendors being eager to support the latest standard to get the most out of their hardware. The same is not quite the case for OpenCL. Given that OpenCL and Vulkan share infrastructure, it is worth to consider the option of integrating OpenCL into Vulkan as something like Vulkan-Compute.

Final Words

OpenCL is evolving, with many new features such as SyCL on the (C++)-horizon. At the same time, OpenCL is not primarily aiming at high performance computing (unlike CUDA), but also needs to take the specifics of e.g. FPGA into account. Also, with AMD's Boltzmann initiative and the promise to be able to compile CUDA code, OpenCL will have a difficult stand in HPC.

This blog post is for calendar week 12 of my weekly blogging series for 2016.

Karl Rupp

Computational Scientist

Three Suggestions for Improving OpenCL for Library Developers

Current Problems of OpenCL for Library Developers

My Suggestions

Final Words