Category Archives: GPGPU/MIC Computing

Topics related to general purpose computations on GPUs or the MIC platform.

Strided Memory Access on CPUs, GPUs, and MIC

Optimization guides for GPUs discuss in length the importance of contiguous ("coalesced", etc.) memory access for achieving high memory bandwidth (e.g. this parallel4all blog post). But how does strided memory access compare across different architectures? Is this something specific to NVIDIA GPUs? Let's shed some light on these questions by some benchmarks. Continue reading →

GPU Memory Bandwidth vs. Thread Blocks (CUDA) / Workgroups (OpenCL)

The massive parallelism of GPUs provides ample of performance for certain algorithms in scientific computing. At the same time, however, Amdahl's Law imposes limits on possible performance gains from parallelization. Thus, let us look in this blog post on how *few* threads one can launch on GPUs while still getting good performance (here: memory bandwidth). Continue reading →

OpenCL Just-In-Time (JIT) Compilation Benchmarks

The beauty of the vendor-independent standard OpenCL is that a single kernel language is sufficient to program many different architectures, ranging from dual-core CPUs over Intel's Many Integrated Cores (MIC) architecture to GPUs and even FPGAs. The kernels are just-in-time compiled during the program run, which has several advantages and disadvantages. An incomplete list is as follows:

Advantage: Binary can be fully optimized for the underlying hardware
Advantage: High portability
Disadvantage: Just-in-Time compilation induces overhead
Disadvantage: No automatic performance portability

Today's blog post is about just-in-time (jit) compilation overhead. Ideally, jit-compilation is infinitely fast. In reality, it is sufficient to keep the jit-compilation time small compared to the overall execution time. But what is 'small'?

Continue reading →

CfP: Intl. Workshop on OpenCL 2016 (IWOCL 2016)

The International Workshop on OpenCL (IWOCL) is an annual meeting bringing together the experts on OpenCL, an open standard for programming heterogeneous parallel computing systems. In 2016 IWOCL will run its fourth installment from April 19-21 in Vienna, Austria, and as local chair of IWOCL 2016 I'm proud to share the IWOCL 2016 Call for Papers, Technical Submissions, Tutorials and Posters. Continue reading →

GPU Research Center at TU Wien

Today it was announced that TU Wien hosts an NVIDIA GPU Research Center, for which Josef Weinbub, Florian Rudolf, and I are PIs. The agenda includes improvements to ViennaCL as well as PETSc, both open source libraries I'm actively involved in. In addition to continued, incremental improvements, we will also look into two interesting research questions related to the numerical solution of partial differential equations. Continue reading →

40 Years of Microprocessor Trend Data

One of the most popular plots when it comes to technologic advancements in microprocessors in general and Moore's Law in particular is a plot entitled 35 Years of Microprocessor Trend Data based on data by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten. Later, trend lines with some (speculative) extrapolation were added by C. Moore. One can find the plot with and without trend lines at various locations in the web (and further down). However, the plot suffers from one the sands of time: Data is only plotted up until the year 2010, missing out the last five years. Continue reading →

STREAM Benchmark Results on Intel Xeon and Xeon Phi

While the number of floating point operations per second (FLOPs) is often considered to be the primary indicator for achievable performance, in many important application areas the limiting factor nowadays is memory bandwidth (cf. memory wall). The standard benchmark to measure memory bandwidth is the STREAM benchmark. Despite its simplicity of 'just simple vector operations', the benchmark is a very helpful indicator for actual application performance. Continue reading →

Mentored Project Ideas for GSoC 2014

Our organization Computational Science and Engineering at TU Wien was selected for the Google Summer of Code 2014. Within our organization, a couple of great open source software projects hosted at TU Wien are reaching out to students all over the world for work on free scientific software over the summer. Application deadline for students is on March 21, 2014. The funding provided by Google for the students is again highly appreciated 🙂

This year I'm again mentoring project ideas for ViennaCL, which I'll describe briefly in the following: Continue reading →

PyViennaCL: GPU-accelerated Linear Algebra for Python

Toby St Clere Smithe, who I mentored during the Google Summer of Code 2013, released PyViennaCL 1.0.0 today. PyViennaCL provides the Python bindings for the ViennaCL linear algebra and numerical computation library for general purpose computations on massively parallel hardware such as graphics processing units (GPUs) and other heterogeneous systems. ViennaCL itself is a header-only C++ library, so these bindings make available to Python programmers ViennaCL’s fast OpenCL and CUDA algorithms, in a way that is idiomatic and compatible with the Python community’s most popular scientific packages, NumPy and SciPy. Continue reading →

CPU, GPU and MIC Hardware Characteristics over Time

Recently I was looking for useful graphs on recent parallel computing hardware for reuse in a presentation, but struggled to find any. While I know that colleagues have such graphs and data in use in their presentations, I couldn't find a convenient source on the net. So I ended up collecting all the data (again) and decided to make the outcome of my efforts available here. Continue reading →

Karl Rupp

Computational Scientist