This blog post by nAG Head Quant Jason Charlesworth, explores the cumulative normal distribution function, a fundamental tool in quantitative finance, especially within pricing models like Black-Scholes. Readers will learn about the performance gaps between naïve implementations of this function and optimized alternatives. The post details a streamlined approach using polynomial approximations in Horner form to achieve faster and highly accurate results, specifically designed for modern hardware with vectorization capabilities. Through clear explanations and performance comparisons, the post illustrates how strategic code optimizations can yield speed improvements of up to 7x, making it a must-read for quant developers aiming to boost computational efficiency in financial calculations.
The cumulative normal distribution function, N(x), is part of the essential infrastructure of quant pricing. Whether needed for Black-Scholes option pricing, replication or control variate corrections, N(x), is used everywhere. After all, it’s just the integral of a Gaussian up to a given point.
In the past, all sorts of approximations were used to calculate N(x) often with very limited accuracy. Since the C++11 standard, however, we have std::erf and std::erfc to make everything simpler.
Unfortunately, these functions can be very, very, very slow. The actual performance obviously depends on hardware and compiler and, particularly, which maths mode we’re using:
For e.g., Visual Studio using fp:fast, computing std::erfc can take over 100x the time to compute std::exp!
So you could say “well, for a start, let’s just use the erf formulation of N(x) as it’s about twice the speed of erfc”. Unfortunately, although this’d be great for small x it’d have a terrible numerical error for large negative x. This is because we’d be taking the difference between 1 and something tending to -1 and so would get a result of zero for x<-7 or so. This would lead to no end of problems when calculating implied vols so we can’t use that approach.
The nZetta Derivatives Pricing Toolkit is designed to give the fastest results, whatever the hardware so when thinking about implementing a normdist function we really need to think about how the hardware does calculations.
So, how can we implement a version designed for modern hardware that’s faster and just as accurate as that using std::erfc?
We know that N(-x)=1-N(x) so we can restrict ourselves to x at or below zero. The asymptotic form of N(x) for negative x is
Doing such a sum would be inefficient, numerically unstable and inaccurate for small x. Instead, let’s consider a related form:
With q0=1.
Why this form? Well, having a higher-order polynomial in the denominator than in the numerator we can largely follow the asymptotic form. From a performance point of view, though, both the numerator and denominator can be expressed in Horner form, i.e., a0+x(a1+x(a2…). Consequently both the numerator and denominator are calculated as a series of fused multiply-add operations which do a+b*c as a single operation. Even better is that on most x86 processors you can launch two of these in a clock cycle. This means we can (effectively) calculate both the numerator and denominator for the same cost of just doing one.
We can constrain some of the coefficients to give us exact results by construction, e.g.,
The first two of these are obvious. The last of these is to ensure good behaviour when we change domain; because we are treating negative and positive x differently we need to ensure that the function is at least 2nd order continuous at x=0. If we don’t ensure continuity across domain changes everything from integration to differentiation to risk calculations becomes noisy and unstable.
To get the best-fit values for the remaining parameters we do a gradient-based minimization of the objective function that minimizes the relative error, i.e.,
Where xi is a set of points in the range -37<=x<=0 and Nex(x) is the normdist calculated to at least 32 decimal places. We only need to consider this range as the exponential term will underflow for x much below -37. The choice of n is 2 if we’re minimizing the L2 norm. If we want to minimize the largest relative error it can be worth using n=4. The choice of objective function really depends on what your goal is.
To get the high-accuracy values for Nex and to use in the minimizer, we can either use arbitrary-accuracy maths tools such as Maple or Mathematica or, alternatively, make use of the open-source boost::multiprecision.
It should be noted that we avoid branches as much as possible to ensure the version can be vectorised. The maximum relative error for this form in the region -37<=x<=0 is 2 bits i.e., less than 9e-16 relative error.
The performance of the Horner form is always faster than naively using the std::erfc form and, depending on compiler and flags, can be over 7x faster.
Experimental setup: All timings have been performed on an Intel i9-11900K with code compiled for AVX512.
This will close in 20 seconds
This will close in 20 seconds