GPU & Accelerator Computing

While central processing units (CPUs) doubled their speed approximately every two years up until the early 2000s, physical limitations no longer allow such improvements for a single core. Instead, multiple cores can now be found on modern CPUs and hence it is required to employ suitable parallel algorithms in order to make use of the higher performance now provided via additional cores rather than higher clock frequency. For certain parallel algorithms it may even pay off to run general purpose computations on graphics processing units (GPUs), which are by nature tailored to efficiently work in parallel.

Programming environments for GPUs, however, only provide low-level access to the hardware. For the case of linear algebra operations, Florian Rudolf and I put a high-level C++ interface on top of various compute kernels for iterative solvers and released our results as ViennaCL. The library has gained a lot of additional functionality since then and now supports both dense and sparse linear algebra using CUDA, OpenCL, and OpenMP compute backends.

Finally, it is important to keep the following in mind: Even though some publications on GPUs claim speed-ups of a factor or hundred or more over a traditional CPU-based implementation, this is not backed up by actual hardware and thus only shows that the reference implementation is bad. A comparison of hardware specifications shows that GPUs may offer up to ten-fold performance gains, and higher gains are only possible and very specialized scenarios. Also, moving data from main memory to GPU memory is a very expensive operation because of the relatively low memory bandwidth provided by PCI-Express. Benchmarks often neglect this transfer overhead.