This post is part of the AD Myths Debunked series.
You’ve decided you want your parallel library or application to benefit from the advantages of Adjoint Automatic Differentiation (AAD), but you are concerned that AAD may not be thread safe, breaks vectorization, and doesn’t play well with GPUs. How much of that is true? That is what we want to look at in today’s post.
AD is a semantic program transformation, which adds functionality to your original implementation. The transformed program not only computes the function values, e.g., the price of a financial instrument, but also the sensitivity or derivative thereof, e.g., the first-order Greeks. Using AAD for this has an impact on parallelizability and, roughly speaking, that comes from the need to run the program from the outputs backwards through every single statement to the inputs. This in turn means that parallel reads of non thread-local data in the original program result in parallel writes to adjoint data in the reverse one. These parallel writes are a challenge since they potentially introduce race conditions. The straightforward way of resolving this, especially for parallel algorithms such as the Monte-Carlo method, is to duplicate the shared data. Together with dco/c++’s multi-threading support, the parallelism can then be restored and inherited to the AAD computation. Let’s look at specific aspects of adjoints and parallelism in a little more detail.
With dco/c++ the user needs to use a drop-in replacement for the used arithmetic type (e.g., replace ‘double’ with ‘dco::ga1s<double>::type’). This type comes with additional data attached to it, e.g., a tangent component for tangent/forward mode AD, or a so-called ‘tape index’ for AAD. This naturally introduces a gap in the access pattern for the function values when performing operations on previously consecutive data. This has a negative impact on auto-vectorization. There are various solutions to this problem, though. dco/c++ comes with a vector data type (dco::gv<double, vector_size>::type), which shifts the vectorization to a lower level. By doing so, the vectorization can be preserved and even inherited from the adjoint code. There are other approaches, such as the use of dco/c++’s specialized linear algebra libraries (e.g., BLAS or Eigen). NAG is also working on a new approach, which avoids the change of the data layout completely.
For AAD, dco/c++ writes an image of your program into memory while executing the program. This image is called ‘tape’. For efficiency reasons, this is a single global tape. Multi-threaded code might therefore introduce data races on this global data. In order to avoid these, dco/c++ comes with thread-safe tape storage. In addition, dco/c++ has the ‘multiple tape’ support, i.e., each variable can be associated with a (thread-)locally allocated tape. Either of these features, depending on requirements, can be used to overcome the issue of writing into global data with multiple threads.
The last aspect we want to talk about today is AAD of a GPU code. This is inherently a difficult problem. Depending on the application, GPU adjoints can be achieved by either using higher-level libraries or by using specific AAD tools for GPUs, like dco/map (‘map’ stands for ‘meta adjoint programming’). Like GPU programming in the first place, getting efficient GPU adjoints is a hard task. Please let us know if you need an experienced partner at your side.
NAG’s AD toolset has been developed and enhanced over the last 12 years and it builds upon a further 10 years of AD R&D experience. We know that details matter. Applying AD to a parallel code base is part of our day-to-day work.
Myths are narratives, which might sound like truths, but by talking through these in some detail and sharing our experiences, we hope to help businesses navigate these issues. Results matter, myths should not.