Your server has lots of cores and hyper-threads. Can you really use all the theoretical capability or is most of it just wasted money? Do you even know? Reducing memory is essential to distributed calculations but sometimes you just need an enormous L3 cache. Even with a 1Gb L3 cache, can all the threads be used in a massively multicore processor?
Imagine you’ve written derivative pricing code for a particular trade. Running a single PV calculation takes 10 seconds on your dev machine. As part of your risk run, you now distribute 1024 of these calculations on your nice, chunky, 32-core server of seemingly comparable core performance. How long do you expect this risk calculation to take to complete?
If it were perfectly distributed to 32 threads or processes (no hyper-threading) you’d expect about (10*1024/32)=320 seconds. If you distribute to 64 threads to use hyper-threading, a perfectly distributed calculation would be 160 seconds. Hyper-threading isn’t a miracle cure but you’d hope the time would be somewhere between these.
But what on earth’s going on when you find it takes 10 times that?
Recently we were investigating performance of risk calculations using the nZetta Derivatives Pricing Toolkit. nZetta is heavily vectorised to fully take advantage of all the superscalar and SIMD processor capabilities. Performance in a single-threaded environment is easy to analyse. Performance in large thread systems needs a more holistic approach.
Most quant development is done on Windows, typically on a dedicated (high-end) workstation. The calibration, model or product is developed using Visual Studio. It is debugged and performance-tuned on Windows, usually as an isolated calculation with nothing else significant happening on the machine.
Once upon a time, this wasn’t too dissimilar to the environment the trader used to price and manage a trade.
These days, however, most of the heavy-duty calculations are performed on a compute grid. Whether overnight risk or getting a price for a client, the majority of calculations will be run on a large grid of processors. These processors will be running Linux OS and the code will typically be compiled using GCC or Clang. Each processor will have multiple cores. Calculations will be distributed (whether as processes or threads) to the processors as one per physical core or, with hyper-threading enabled, one per logical core.
It’s rare for a quant to check that the performance seen on the compute grid is broadly consistent with that seen in their isolated, development environment. There are many reasons: the grid may be ‘owned’ by a different team, limited tooling, no audit capabilities, not a Linux expert.
Considering that the OS, compiler, hardware and, most importantly, the usage patterns are completely different, it shouldn’t come as a great surprise that the performance can be totally different!
Let’s ignore, for now, the issue of a different OS (which is probably out of your control) or compiler (hopefully you’re using the best available). What could be causing your performance difference on the grid?
In the early days of computing, it was easy to have a usable mental model of how a computer worked. The computer carried out one instruction at a time. It got the arguments (memory or register), did the operation, and stored the result (again, memory or register). And repeat. The clock speed was fixed and an operation that took 5 clock cycles yesterday would take 5 clock cycles today.
These days a single processor is an insanely complicated collection of sub-processors. It comprises multiple cores. The processors have a hierarchy of caches with some caches (L1 and L2) associated with individual cores and some shared between cores (L3). Data is moved into caches for fast access but another calculation on a different core may change whether the data your calculation needs is still in the cache when you need it. Worse still, if other calculations use up too much memory you may need to make use of swap space, drastically slowing your calculations. The clock speed can vary second-by-second depending on load and even on what instructions it encounters (link). Even at the level of individual assembler instructions, it’s hard to predict performance. Deep within each core, the operations are broken into micro-operations and reordered for best superscalar performance. Performance depends on the exact sequence of operations encountered, both in the main thread as well as any hyper-thread.
All of this means that how long an operation takes is strongly affected by everything else going on inside the computer. An operation that took 5 clock cycles yesterday could take 2 cycles today. It could take 200.
If asked about the performance of your code, these days the most honest answer is “Well, it depends!”.
Let’s consider pricing a 4Y single-asset autocallable trade using the Heston Local Stochastic Volatility model. The calibration of this model requires solving a 2D PDE using an ADI method, e.g., Hunsdorfer-Verwer to construct leverage surfaces on each of the ~1000 business days. We’ll use a square 2D PDE (nS=nNu) for calibration. The pricing of the trade uses a Monte Carlo (MC) calculation, on the same 1000 business days and we will use 50K paths. All code is heavily vectorised and tuned for best SIMD and superscalar performance.
For the initial timings, we used a desktop workstation: an 8-core Intel i9-11900K @ 3.5GHz so, with hyper-threading, 16 logical cores. It has 16Mb of shared L3 cache.
We will consider 35 first-order sensitivities of interest, so other than the PV, we have 70 individual pricings for the risk calculation. As we distribute the individual PVs comprising the risk calculation across ever-increasing numbers of threads:
What on earth is going on? The blue line for the time for the entire risk calculation shows that although we get speed benefits using up to 3 threads, adding more than 3 threads actually slows the risk run down! Looking into the components, we find that both the calibration and Monte Carlo get worse and worse as the number of threads increases. The calibration actually takes longer using 16 threads than it does using just 1.
Each PV calculation is independent. There are no obvious race conditions or mutexes. Sometimes, when a lot of threads need to allocate and deallocate memory, there’s a slowdown due to mutexes allocating/deallocating on the heap. However, all the memory used in this calibration is allocated up front so this is not likely.
What’s going on?
Derivative pricing calculations typically evolve states from one time to the next.
Whether calibrating using a 2D PDE or pricing using a Monte Carlo calculation, we calculate a state for time t1 and then use this to calculate the state for t2 and so on. We are unable to use non-temporal storage; the state at t2 is required immediately to evolve to t3. Consequently, by their very nature, quant calculations are heavily dependent on cache performance; we need the result from t1 to stay in the cache while calculating the state for t2. Anything that flushes these values from the cache will badly impact performance.
For 2D PDEs used in Heston SLV calibration, our memory is limited by the grid size. An NxN 2D-PDE grid will use 8NxN bytes and for all the non-trivial 2D PDE methods (Craig-Sneyd, Hunsdorfer-Verwer and variations) we will require an absolute minimum of 3 grids (in, out and scratch-space). For a 300×300 PDE lattice, the PDE solver requires an absolute minimum of 2.2 Mb of temporal space. There is no way around this limit.
For Monte Carlo calculations, we may perform it in a path-by-path approach where we treat one path at a time. This has a minimal memory footprint but has terrible superscalar and SIMD performance. It loses about an order of magnitude in performance! Alternatively, we may use a time-by-time approach in which we construct all the states at one time, do all the payoff updates and then evolve all states to the next time. And so on. This has a larger memory footprint but is optimal for superscalar and SIMD performance. As an intermediate, we can do tranches of paths getting some of the benefits of both.
The key point is that if either calibration or Monte Carlo requires more temporal memory than the corresponding fraction of the L3 cache, we will suffer from cache contention. Remember, the L3 cache is shared between core. L3 cache contention will significantly degrade performance.
A vector representing 50,000 states takes 0.4Mb. In our trade, the Monte Carlo requires vectors representing the spot, the vol and sundry payoff components. For more complicated models and trades, we need even more memory. In many PV calculations, assorted diagnostic and informational values are also calculated: the expected exercise data, probability of triggering etc. If these are not disabled for the risk run, even more memory is used.
Fortunately, Monte Carlo calculations are embarrassingly parallelisable; each path is independent so we can break calculation into tranches of paths. There will be a small amount of additional overhead duplicating some operations but, in general, this will be easily offset by improvement in L2 cache behaviour; at 0.4Mb for a vector of 50,000 states, even a single vector will not fit into the L2 cache.
Stubbing out the calibration to focus on the Monte Carlo, we see that as we increase the number of tranches (and hence reduce the memory for each Monte Carlo calculation), we drastically improve the speed of the risk calculation. At 16 tranches (each of 3125 paths) we can almost fully utilise the entire hyper-thread capability of the processor.
Of course, if your MC calculation uses a scripted payoff, this may further impose significant additional memory costs.
For the minimum memory requirements (above) of 2.2Mb per PDE and with 16Mb of L3, we should be able to run 7 threads without significant contention. We don’t. From the earlier graph, we manage to benefit from a paltry 2 threads before subsequent threads slow things down.
Looking at the risk time solely of the calibration (setting the MC part to use 1 path) and varying the grid size to vary memory usage, we see that, except for using really small grids, we get very little benefit from the additional cores and certainly not from the hyper-threading.
(Times normalised to the 1-thread risk run of the corresponding size).
Solving 2D-PDEs requires us to construct and solve sets of tridiagonal equations. For each timestep, we need to do this twice each in the x and y directions so it’d seem sensible to save them for reuse the second time around. Each of these is a further 3 grids bringing our total number of grids to 9 so taking 6.5Mb per thread. Doing this, we can only run 2 threads on our processor before encountering cache contention.
Can we reduce the memory footprint? We can if we’re prepared to slow down the single PV case. The extra memory is used to avoid having to redo calculations. What if we just recalculated the tridiagonal components each time rather than store them? It will involve extra calculations and this will show up in a single PV but we could then reduce our footprint to just 6 grids, taking 4.3Mb/thread.
The additional, duplicated, work results in a 15% slowdown for a single calibration (which is ~1/3 of the risk time) but allows us to use more threads before L3 cache contention adversely affects performance.
Even with this, beyond 3 threads we are still unable to fully benefit from the remaining 13 threads. More generally, even were we to get the memory usage down to the minimum possible 2.2Mb per PDE we could never fully utilise this processor as it only has only 1Mb of L3/thread.
The performance of the risk calculation for the Heston SLV autocallable is clearly restricted by the amount of L3 per core or per thread.
For simple calculations with Monte Carlo we can use ever smaller tranches to try to keep this small enough. Even then, eventually, cache contention will impact us. For multi-asset, more complex or scripted payoffs, even tranching is unlikely to lower temporal memory enough.
For the PDE calibration, however, there are hard limits below which we cannot shrink the memory needed. Any processor with less than 2.2Mb/thread will always be L3 cache-bound for this sized (300×300) calculation. This is a hard lower limit. Practically, any processor with less than 4Mb/thread will lose performance. This means not getting benefit from all those cores you’ve paid for.
To fully utilise the processors, we need sufficient L3 cache/thread so that cache contention doesn’t arise. For this particular trade that is at ~4Mb/thread.
Unfortunately, the trend over the last few years has been fairly stagnant with L3 cache sizes staying around ~1-2Mb/core.
Recently, however, some of the newer ranges of massively multicore processors have started incorporating truly enormous L3 caches, like the AMD Epyc 9684X. It has 2×96 cores, each core with hyper-threading giving a total of 384 hyper-threads. That’s great but only if we can actually use them. For this problem, what’s really important to us is that the processor comes with 1152Mb of L3 cache.
Running the risk for the same product but for 384 sensitivities (768 risk PVs), to sufficiently stress the processor, we get the following performance response.
With this amount of L3, we should largely avoid L3 contention until ~300 threads. There is a tail-off at the huge number of threads but each thread is still reducing wall time so we’re benefitting from every thread we can use.
If we drop focus on contributing components, we see with this sized L3 the Monte Carlo is largely unaffected by number of tranches and we pretty much fully utilise the threads.
For the 2D PDE, unless we go for huge PDE sizes (425×425 would not normally seen in calibration), with a large L3 cache we can, still, benefit from every extra hyper-thread.
Going big can certainly solve some limiting memory problems.
This will close in 20 seconds
This will close in 0 seconds