Interpolation – A Lesson in Superscalar and SIMD


Published 18/11/2024 By Jason Charlesworth

In this blog nAG Head Quant, Jason Charlesworth explores the importance and application of interpolation within high-performance computing, especially in financial contexts like derivatives pricing. Readers will see how optimizing interpolation can drastically improve computational efficiency, showcasing nearly 16x performance gains by leveraging hardware-specific features like SIMD (Single Instruction, Multiple Data) and superscalar capabilities. This post not only explains why interpolation can be a bottleneck in complex calculations but also guides readers through step-by-step optimizations—from basic code enhancements to advanced vectorization techniques. Through practical examples and detailed performance comparisons, the post equips readers with actionable insights to boost calculation speed and maximize hardware performance.

Interpolation – It’s Trivial!

Consider a Heston Stochastic Local Volatility Monte Carlo; for every time step and every path we’ll be interpolating on a leverage surface. A typical leverage surface consists of 300 non-uniformly spaced interpolation points and for each of ~1000 timesteps, we’ll need to interpolate ~50000 times (once for each path). It’s very easy for a significant percentage of the compute time to just be interpolation; a standard implementation would take over 3 seconds of compute time. A hardware-friendly implementation would take less than 0.2 seconds.

The nZetta Derivatives Pricing Toolkit is designed to give the best performance, whatever the hardware and whatever the model/product. To do that, though, requires a solid understanding of both the problem domain as well as the hardware. So, let’s see what we can do with interpolation.

Interpolation is used throughout computational maths. We do interpolation whenever you have a set of pairs of {x_i,y_i} points and want to infer y(x). In finance, this is often used to interpolate volatilities or yield curves. We typically use a linear, cubic or spline interpolation for x values in the supported range and we’ll use flat or linear extrapolation outside the range.

If the x points are uniformly spaced it’s trivial to locate the interpolation region. If they’re not uniformly spaced we’ll have to do some form of searching to locate the points bracketing the point of interest. If there’s no known structure to the x points we’ll have to use some form of bisection search.

So, coding up such an interpolator is simple.

Take a value, z, for which we want y(z). Check if it’s inside or outside the region. If outside the region extrapolate based on the edge points. If inside the region, use a bisection approach to find the bracket and then interpolate. We’ll assume we have a number of points defining our interpolator that fits easily in the cache. For enormous interpolators, an Eytzinger tree is optimal for memory access but will not change the discussion here.

For simple bisection we could end up with some code looking like:

where x and y are the vectors defining the interpolation points and z is the point we wish to interpolate. This is an entirely scalar approach.

It cannot benefit from any of the processor’s SIMD capabilities. The SIMD operations do 2, 4 or 8 operations for the price of 1 so being unable to use these loses us a lot of performance.

In addition, the core bisection loop is not ideal for superscalar performance as each iteration is dependent on the preceding one. Modern processors can launch up to 4 independent operations at once and overlap them. However, where one result depends on a previous result it has a lot more difficulty and cannot fully or even partially benefit.

Different compilers will have different benefits for this scalar approach (the details of the experimental setup are given at the end).

First Steps to Vectorisation

The first, obvious step, is to give the compiler the chance to handle blocks of values. At the very least, avoiding going repeatedly through the call stack will be some (minor) benefit. So, now we’ll do the loop inside the function. We’ll additionally mark the input (z) and output (yz) with __restrict to indicate that they are not referring to the same block of data. This gives the compiler the best opportunity to vectorise.

This gives a small performance boost.

The outer loop cannot be vectorised because there are just too many branches for the extrapolation as well as unclear termination conditions in the bisection loop.

First Steps to SIMD

Let’s get rid of some of the branches. We do this by putting in artificial ‘ghost’ points at the start and end of the values used for interpolation, e.g.,

For flat extrapolation the ghost y values are just those of the edge points. For cubic or splines we would adjust the expansion coefficients at the ghost points. This allows us to write the interpolation much more compactly as

Instead of bisecting on N points, we’re now bisecting on N+2 but this is a minor overhead.

The simplified code gives only a very small improvement but it’s a step towards vectorised code:

Unfortunately, even though we’ve got rid of the branches, the inner, bisection, loop has an ill-determined termination condition; different values of z will result in different numbers of iterations and this means we cannot treat this (easily) with SIMD. For SIMD we want every z value to be treated identically.

Branchless Interpolation

We can achieve this by determining the maximum number of bisections we will need to do in advance. In some cases, this will be 1 more than actually required but this extra work will (hopefully) allow SIMD vectorisation.

The #pragma omp simd is a hint to the compiler to try to use SIMD operations on the loop. It is part of the OpenMP standard (a good compiler will not need it). However, a good compiler shouldn’t even need that hint.

Every path through the z loop now involves identical work and so can be SIMD vectorised.

On GCC and Intel OneAPI, this gives a massive speed-up over the scalar version. Currently Visual Studio and Clang do not recognise or support this SIMD possibility.

Despite achieving over 6x speed up for the Intel compiled version, we are still missing out on performance. The reason for this is that within the core bisection loop, we cannot start one iteration before the previous iteration has completed. Modern processors are superscalar; they can launch multiple independent operations at once and overlap them. This allows up to 4x the performance.

Superscalar-Friendly Interpolation

To get a superscalar version we must swap over the order of our loops, i.e., at each level of bisection we want to treat all the z values. Unfortunately, this would result in needing additional memory (with consequent construct/destruct costs) as well as have cache issues. To avoid this we’ll treat blocks of z and use local (stack) storage. This allows us to rewrite our bisection as

Each of the SIMD loops contains a block of independent work. Many calculations can be overlapped. Depending on the compiler, using aligned data for ilo and using the aligned argument in the openmp pragma may give a small additional boost.

The Intel OneAPI compiled interpolator is now running over 14x faster than original and the Clang compiled interpolator is running almost 13x faster. In fact, it is now so fast that the cost of the gradient calculation starts having an impact (division is a slow operation). Calculating the gradients at interpolator construction time and then used in the interpolation (trivial so not showing code here) then results in the final performance graph:

The final version has the Intel OneAPI interpolator running over 15.5x faster than the scalar version, the Clang interpolator running almost 14x faster and GCC almost 9 times faster.

Key Take Aways
  • To get the best possible performance for core parts of the calculation you must consider performing vectors of calculations.
  • Treating vectors of values is not vectorisation; vectorisation requires treating every part of the calculation identically irrespective of the data values.
  • More is less: Doing additional calculations when it enables SIMD and/or superscalar can pay huge dividends
  • Superscalar (and also SIMD) may require a change to the algorithm to get best performance.
Experimental Setup:

We used 300 points to define the interpolator and measured the time to evaluate it on 50000 points (of which 5% would result in extrapolation). We repeated the calculations 100 times (with new random z values each iteration) and took the minimum time.

Hardware is Windows OS on an Intel i9-11900K @ 3.5 GHz.

Compilation was performed targeting AVX512. Maximum optimisation was used. OpenMP was enabled (for Visual Studio, using /openmp:experimental). GCC compilation was MINGW. Clang compilation was using LLVM within Visual Studio.

    Please provide your work email to access the free trial

    By clicking the button below you agree to our Privacy Policy

    This will close in 20 seconds

      Discover how we can help you in just a few clicks





      Discover how we can help you in just a few clicks

      Personal Information





      By clicking the button below you agree to our Privacy Policy

      This will close in 20 seconds