Multi-Threading in C/C++: Implications on Software Library Design

With the increase in parallelism in response to a stagnation of clock frequencies, software libraries are pushed towards multi-threading. However, there are several different threading approaches out there: The most popular in the C/C++ world are POSIX Threads (pthread), OpenMP, and C++11 threads. Clearly, a good software library does not enforce the use of one particular approach, but is able to deal with (almost) any multi-threading approach. In this blog post I will discuss a possible software library design to achieve this.

This recent discussion thread on the Boost.uBLAS mailing list as well as this discussion thread on the Eigen mailing list show that the current status of multi-threaded support even in high-performance libraries is mediocre at best. Multi-threading is seen as either a globally enabled or globally disabled feature. Part of the culprit is certainly OpenMP, which encourages decisions at compile time (i.e. either compile with or without OpenMP). This mindset can also be found in commercial libraries, as the example of the Intel MKL shows: You can either link with the single-threaded version of MKL, or with the multi-threaded version of MKL. What if I need both in my application? What if I want half of my threads to work on operation A and half of my threads on operation B?

Let's have a look at what is required to provide better multi-threading capabilities in software libraries. Consider a software library that only offers two worker routines: The addition of two vectors, and the dot-product of two vectors. I claim that these two worker routines require almost everything you need with respect to thread management for a full-fledged high-performance linear algebra library:

massively parallel computations with no thread interaction (think of matrix-matrix multiplications here), and
operations involving reductions and thus thread synchronization via shared temporary arrays.

A wish list on the library is as follows, reflecting the needs of the high performance computing community:

C binary interface to simplify use from Fortran and Python
No dependency on a particular compiler
Works well with different threading approaches, in particular pthread, OpenMP, and C++11 threads
User can call routines with any number of threads (including a single thread only) based on runtime decisions

You may think that these requirements are very basic and should be fulfilled by almost every library out there. Well, this is not quite the case: C++ libraries don't blend well with Fortran and have a hidden compiler dependency (libstdc++) if provided pre-compiled. Also, many libraries are hard-wired to e.g. OpenMP for convenience of the developer rather than the user.

Allow Thread Control by the User

Let us consider the following first attempt of a multi-threaded library routine for the addition of two vectors:

// z = x + y
void vector_add(double *x, double *y, double *z, int N)
{
  #pragma omp parallel for
  for (int i=0; i<N; ++i)
    z[i] = x[i] + y[i];
}

There is an apparent use of OpenMP to parallelize the for-loop. Thus, this first attempt does not work with pthread or C++11. If you want *any* control over the thread allocation such as changing the number of threads, you *need* to alter the function signature in order to propagate thread information. Note for C++ developers: If you decided to use expression templates, you don't even have that option.

To propagate thread information, let us introduce an object of type ThreadControl, from which we can query thread information (for simplicity via direct member access):

// library code for z = x + y
void vector_add(ThreadControl control,
                double *x, double *y, double *z, int N)
{
  int entries_per_thread = (N-1) / control->thread_count + 1;
  int begin =  control->thread_id      * entries_per_thread;
  int end   = (control->thread_id + 1) * entries_per_thread;
  end = (end > N) ? N : end;    // respect upper bound

  for (int i=begin; i<end; ++i)
    z[i] = x[i] + y[i];
}

Now, a user calls the routine with OpenMP as follows:

// user code for calling vector_add with multiple OpenMP threads
#pragma omp parallel
{
  ThreadControl control; // per-thread control object (details follow)
  control->thread_id    = omp_get_thread_num();
  control->thread_count = omp_get_num_threads();

  vector_add(control, x, y, z, N);
}

There is nothing specific to OpenMP in the library code any longer, all thread control is left to the user. Consequently, a user could just as well use pthread or C++11 threads:

// user code for calling vector_add with multiple C++11 threads
std::vector<std::thread> my_threads(4);   //use four threads

for (int i=0; i<my_threads.size(); ++i)
{
  ThreadControl control; // per-thread control object (details follow)
  control->thread_id    = i;
  control->thread_count = my_threads.size()

  my_threads[i] = std::thread(...);  // call into vector_add here
}

Technical note: Ideally you can pass the function vector_add() as well as its arguments directly to the constructor of std::thread. However, I could not get this to work, so my demo code wraps everything in a launcher function (which you need to do with pthread as well).

Thread Synchronization and Shared Buffers

Let us now consider the case of the vector dot product. There are several implementation options here; we will use one based on barriers rather than mutexes, since mutexes are a little tricky to wrap in OpenMP. Our implementation computes the partial results for each chunk of the vector assigned to a thread. These intermediate results are stored in a shared buffer and are - after thread synchronization - summed by the first thread. The code is as follows:

// library code for a = dot(x, y)
void vector_dot(ThreadControl control,
                double *x, double *y, double *a, int N)
{
  int entries_per_thread = (N-1) / control->thread_count + 1;
  int begin =  control->thread_id      * entries_per_thread;
  int end   = (control->thread_id + 1) * entries_per_thread;
  end = (end > N) ? N : end;    // respect upper bound

  double *shared_buffer = control->malloc(8*control->thread_count); 

  double thread_a = 0;
  for (int i=begin; i<end; ++i)
    thread_a += x[i] * y[i];
  shared_buffer[control->thread_id] = thread_a;

  control->sync();  // sync threads

  if (control->thread_id == 0)
  {
    // first thread sum other intermediate results
    for (i = 1; i < tcontrol->tsize; ++i)
      thread_a += thread_results[i];

    *a = thread_a;
  }

  control->sync();  // sync threads for consistent results

There are two new ingredients in this code snippet: First, a shared buffer is allocated via control->malloc(). Second, threads are synchronized via calls to control->sync(), which acts like an OpenMP barrier: Threads may only progress once all threads have reached this synchronization point.

Since our demo library does not know which threading model the user will select, the user has to provide a callback for a synchronization routine. For the OpenMP case, this is as simple as:

/* User callback routine for OpenMP synchronization */
void openmp_sync(int tid, int tsize, void *data)
{
 #pragma omp barrier
}

The user code for calling the vector dot routine is now similar to the vector addition:

// user code for calling vector_add with multiple OpenMP threads
#pragma omp parallel
{
  ThreadControl control; // per-thread control object (details follow)
  control->thread_id    = omp_get_thread_num();
  control->thread_count = omp_get_num_threads();
  control->sync         = openmp_sync;  // set sync callback

  vector_dot(control, x, y, a, N);
}

This is the high level picture of the control flow. There are, however, a few more technical details to consider to get the shared memory allocation working.

Remaining Details

The ThreadControl object has so far been created on the stack of each thread. This, however, prevents effective thread synchronization, because threads have no means for communicating with each other based on thread-local variables. To enable that, we need to also introduce *shared* state into the ThreadControl object. To do so, let us introduce a ThreadFactory object which takes care of the shared state. Subsequent ThreadControl objects are then derived from a common ThreadFactory object. User code for calling both of our library routines will now look as follows in the OpenMP case (for clarity I assume C++ classes for ThreadFactory and ThreadControl here):

// user code for calling vector_add, vector_dot with OpenMP
ThreadFactory factory;       // shared state, assuming C++ CTOR
factory->sync = openmp_sync; // register sync callback

#pragma omp parallel
{
  ThreadControl control(factory); // per-thread control object
  control->thread_id    = omp_get_thread_num();
  control->thread_count = omp_get_num_threads();

  vector_add(control, x, y, z, N);
  vector_dot(control, x, y, a, N);
}
// one may also use two separate parallel for regions instead

In addition to the synchronization callback, the shared factory contains internal shared variables through which e.g. shared memory allocation is provided. This, however, is an implementation detail the user is not exposed to.

The synchronization callback routine is more involved to provide for pthread and C++11-threads. I will not go into the details at this point (maybe in a later blog post), but feel free to have a look at the multi-threading library demo repository supplementing this blog post.

Threads vs. MPI

Multi-threading is only one way of keeping your CPU cores busy. Let us also have a quick look at how MPI deals with multi-process orchestration. Taking the example of the vector dot product, our function signature was

void vector_dot(ThreadControl control,
                double *x, double *y, double *a, int N);

How would such a signature look like if you chose MPI instead of threads? Well, not that different:

void vector_dot(MPI_Comm comm,
                double *x, double *y, double *a, int N);

At this point I would like to point out that many MPI-based libraries have been demonstrated to interact nicely and very complex application code has been built. Part of this success was that MPI was designed as a library to aid library-based development. If multi-threading wants to be similarly successful, multi-threaded software libraries should learn from the example of MPI and expose their 'thread communicator' just like MPI exposes a process communicator. After all, such a thread communicator would be much simpler: Since threads operate in a shared memory domain, only control flow needs to be taken care of.

Conclusions

Multi-threaded software libraries need to start exposing a clean thread interface in their APIs. There is no excuse for half-baked solutions like inserting OpenMP pragmas somewhere deep in the library code; this merely results in technical debt one will have to pay for later.

Resources: Demo multi-threaded library repository on GitHub with working code for the examples discussed in this post.

This blog post is for calendar week 10 of my weekly blogging series for 2016.

Karl Rupp

Computational Scientist