$\newcommand{\R}{\mathbb{R}} \newcommand{\costR}{\mathcal{R}}$
To delve a little more into AD, consider a computer implementation of a function $f$ with $n$ inputs and one output, i.e. $f:\R^n\rightarrow \R$. AD can be applied to vector-valued functions as well, but to keep things simple we only consider real valued functions below. AD comes in two modes, forward and reverse.
The forward (or tangent-linear) AD version of $f$ is a function $F^{(1)}:\R^{2n}\rightarrow\R$ given by
for inputs $\mathbf{x},\mathbf{x^{(1)}}\in\R^n$ where the dot is regular dot product. To get the whole gradient of $f$ we let $\mathbf{x^{(1)}}$ range over Cartesian basis vectors and call $F^{(1)}$ repeatedly.
Forward mode AD is typically used when $n$ is small, say less than 30, although the exact figure will depend on the function being differentiated. Above this, adjoint methods are used.
To understand adjoint mode AD it helps to consider an input $\mathbf{x}\in\mathbb{R}^n$ being moved along by a sequence of function calls to an output $y\in\mathbb{R}$
We want the gradient $\partial y/\partial \mathbf{x}$ and by the Chain Rule this is just
Mathematically it doesn’t matter which way we evaluate this. The usual way is left to right
and this is natural since it corresponds to the order of program execution: the program first computes $\mathbf{x_1}$, then $\mathbf{x_2}$, and so on. However it involves matrix-matrix multiplications followed by final matrix-vector product since in general each Jacobian $\partial \mathbf{x_{i+1}}/\partial \mathbf{x_i}$ is a matrix.
Suppose instead we started from the right
Now everything is matrix-vector products which are much faster, however we effectively need to run the program backwards:
This whole approach to computing gradients is called the adjoint mode of AD.
The adjoint model of $f$ is a function $\mathbf{x_{(1)}} = F_{(1)}(\mathbf{x},\mathbf{x_{(1)}},y_{(1)})$ mapping $\R^n\!\times\!\R^n\!\times\!\R$ to $\R^n$ given by \begin{equation*} \mathbf{x_{(1)}} = \mathbf{x_{(1)}} + \nabla f(\mathbf{x}) \cdot y_{(1)} \end{equation*}
Performing adjoint calculations requires solving a data flow reversal problem: the program essentially has to be run backwards. Many AD tools (including dco
) approach this by running the program forwards and storing intermediate calculations to memory in a datastructure called a tape. Even for relatively simple codes the tape can be several GBs, and for production codes will typically exceed the capacity of even large memory machines.
To solve this problem, dco
has a flexible interface which allows users to easily insert checkpoints at various points in their code. When the code is run backwards the final checkpoint is restored and that section of computation taped and played back, then the second-to-last checkpoint is restored and that section of computation is taped and played back (with the previous playback’s results), and so on. In this way, memory is traded for flops, with the result that the size of the tape can be constrained almost arbitrarily.
This functionality is essential in getting adjoint models of production codes to run at all. For more information on checkpointing as well as other techniques for reducing the memory footprint of adjoint codes, please contact us.
This will close in 20 seconds
This will close in 0 seconds