Today it was announced that TU Wien hosts an NVIDIA GPU Research Center, for which Josef Weinbub, Florian Rudolf, and I are PIs. The agenda includes improvements to ViennaCL as well as PETSc, both open source libraries I'm actively involved in. In addition to continued, incremental improvements, we will also look into two interesting research questions related to the numerical solution of partial differential equations.

### Finite Element Methods on GPUs

In joint research with my colleagues Matt Knepley and Andy Terrel efficient finite element residual evaluation routines were developed, for which we demonstrate over 300 GFLOP/sec in an upcoming paper. Within the next months we want to extend the approach such that we are also able to directly assemble the system matrix on the GPU, overcoming certain limitations of the PCI-Express bus. Most importantly, the availability of the full system matrix allows for various preconditioning techniques to speed up convergence.

### Asynchronous Algebraic Multigrid Methods

Algebraic multigrid methods are popular for use as preconditioners after discretizing elliptic partial differential equations, because they allow for asymptotically optimal linear complexity (i.e. the time to solution scales linearly with the number of unknowns). Moreover, the algebraic nature of multigrid is very appealing for practical use, because the coarse grid hierarchy is constructed in a black-box manner from the system matrix only. This *setup stage*, i.e. building the grid hierarchy and the transfer operators between the different grids, is fairly challenging to parallelize, whereas the *cycle stage* involving mostly matrix-vector products and relaxation methods such as the Jacobi method expose enough fine-grained parallelism.

Rather than starting the cycle stage only after the setup stage is completed, we will experiment with overlapping the cycle stage with the setup stage. This way we can keep the CPU busy with sequential stages, while already using the GPU for making progress towards convergence. Ideally, this will reduce the total solver time by a factor of two.