Changing Quant Pricing Hardware


Published 12/12/2024 By Jason Charlesworth

Changing hardware for existing systems can be extremely lengthy and painful. However, changing hardware within the x86 family can be fairly straightforward. Using the best suited compiler, and compiler settings for the hardware, as well as understanding the hardware differences, can yield real benefits allowing easy migration within the x86 family.

We’re at an incredible time for high-performance computing hardware. As the cost of further shrinking transistor sizes or increasing clock speeds becomes prohibitive, hardware manufacturers have come up with various clever approaches to provide ever more computing performance. Many of these technologies either have a very niche application (TPU/NPU, Cerebras with 900,000 cores on a single processor!) or require a considerable redesign of key algorithms in the library (GPU, FPGA).

Even among the more classical processor approaches, we have a lot of changes going on: Intel has their E/P architecture, AMD has their chiplets and then we also have the various ARM processors on the cloud (Google Axion, AWS Graviton, Alibaba Yitan) and the first RISC-V processors emerging. 

Change is constant.

Clean Slate

If you were creating an entirely new pricing system, you could consider choosing the perfect hardware for the specific problem. Whether this is to improve the speed of calculations or the $ cost per FLOP, you could use some metric to decide.

However, particularly within banking, this is seldom the case. Even a supposed ‘green field’ development has to interact with other existing bank components. If a compiled library links to these components, you cannot change one library without changing them all.

Banking Reality

The 64-bit x86 processor was commercially available in 2003. Some banks were unable to migrate some systems from 32-bit to 64-bit until almost 2020. This wasn’t because they didn’t see the benefits. The reality of bank systems is that they are large, complex and interdependent. Changing such systems is hard and that change is slow!

Consider a bank’s equities derivatives pricing and risk system. It will typically be a compiled (C++) library and will link to lots of libraries from many different internal and external groups:

  • Internal: The rates curves library will typically be provided by the rates quant team. The credit curves from the credit team. FX volatilities for quanto products from the FX team. Other cross-asset libraries for common components will also be needed.
  • External Vendor Software: Most bank systems make use of third-party vendor libraries, database work, grid distribution, domain-specific languages …
  • Open Source: Harried quants will often bring in open-source libraries. Ideally as source code, but sometimes (depending on their IT security policies), as prebuilt packages.
  • Here-there-be-Dragons: In virtually all bank libraries that I’ve seen are libraries that nobody knows where they came from nor the dependencies.

To change anything requires getting all the internal groups to do work that may not benefit them individually. With their bonuses tied to delivery to their traders, it can be hard to get this prioritised. External vendors may have fees associated with any changes, especially if the library is at end-of-support. For open source, we can hope that it’s easy to recompile for a new setup but this can be painful if the library hasn’t been supported for a while.

With all these issues, it’s unsurprising that most banks avoid hardware changes; most quants and quant devs have fully embraced Machiavelli’s observation:

“It ought to be remembered that there is nothing more difficult to take in hand, more perilous to conduct, or more uncertain in its success, than to take the lead in the introduction of a new order of things. Because the innovator has for enemies all those who have done well under the old conditions, and lukewarm defenders in those who may do well under the new. This coolness arises partly from fear of the opponents, who have the laws on their side, and partly from the incredulity of men, who do not readily believe in new things until they have had a long experience of them.”

Unless you own the entire software stack, changing hardware, for an existing compiled system, involves both work and time. It is typically only undertaken when it’s decoupled from other libraries, a green-field site or the benefits are compelling, e.g., a 10x speed-up of some critical pricing or a significant cost saving.

A Baby Change: From x86 to x86

So, what about a change from x86 to x86? Until now most banks used Intel processors. Recently, many banks have been looking to migrate to use AMD processors.

With our client base’s platforms changing we also needed to support this shift for the nZetta Derivatives Pricing Toolkit. nZetta is designed to fully utilise whatever hardware is available whether that is an x86 processor or an attached GPU or thread distribution. All code is heavily vectorised and, on the x86, makes full use of the hardware’s superscalar and SIMD capabilities.

Our naïve initial guess is that this should be a no-brainer; both Intel and AMD processors fully support the x86 instruction set. Different generations of chips often have additional instructions but provided we’re compiling to the core standard and all our additional libraries do likewise. Surely we should be fine?

This proved to be generally true but there were a few subtleties.

Step 1 – No compiler change

For a Linux build, most banks use one of GCC, Clang or Intel OneAPI to compile their C++ code. What if you don’t want to change the compiler?

For both GCC and Clang, you can simply take your existing libraries and executables and use them, unchanged, on your AMD hardware. It should all be fine if the AMD hardware capability is at or above the hardware compiled for (e.g., -march=core-avx2 or -mavx2).

Unfortunately, if you are using the Intel OneAPI C++ compiler, running an executable on AMD hardware can result in a runtime error with a message along the lines of:

“Please verify that both the operating system and the processor support Intel(R) X87, CMOV, MMX,SSE, SSE2, SSE3, SSSE3, SSE4 1, SSE4 2, MOVBE, POPCNT, AVX, F16C, FMA, BMI, LZCNT and AVX2 instructions.”

It is fairly easy to side-step: the cpuid check appears to be inserted into the executable start-up instructions but not into library files. So, you can compile the rest of the library files using -xCORE-AVX2 and just the file containing the main() function using -march=core-avx2 and this works. Likewise for AVX512 supporting processors.

Step 2 – AMD compiler

AMD provides a C++ compiler, AOCC. Like Intel OneAPI, this is built on top of Clang. Consequently, as the nZetta toolkit already builds cleanly with Intel OneAPI and GCC, we had little problem swapping the compiler for AOCC. The AOCC documentation says that it does some additional optimisations not present in Clang but this difference didn’t raise any issues.

If you were previously using GCC, you may need a few minor changes, typically on which standard header files you’re including or their order. Other than that, the differences in standards compliance/finickity-ness of the modern compilers are minimal.

Timings were performed using AOCC 4.2.0 compared to GCC 13.1.0, for pricing an autocallable (Heston SLV, PDE calibration, Monte Carlo pricing). These were run on an AMD Epyc 9684X and wall times are shown below.

The GCC baseline used -mavx2 -Ofast and the comparable settings for AOCC Build 1 are -mavx2 -O3 -ffast-math. It should be noted that the choice of -Ofast/-ffast-math can be a concern for some quant codes due to the possibility of increased numerical errors. We did not find this to be an issue for these builds for this test but, many bank codes avoid using these optimisations to reduce platform differences.

For well-vectorised quant libraries, it is essential to ensure that you have a SIMD small vector maths library. Fortunately most hardware manufacturers provide one. The AMD maths library, AOCL-LibM, comes with the compiler. It is included when compiling with -fveclib=AMDLIBM and when linking using -lamdlibm. This had minimal improvement on the PDE (calibration) performance (where the predominant vector operations are simple +,-,*,/) but a noticeable performance improvement in the Monte Carlo (pricing) due to the use of expensive functions, e.g., the exp for the process evolution. This is shown in AOCC Build 2.

The AOCC documentation suggests an additional optimisation, -zopt, that does some further safe optimisations but this only gave marginal improvements and is shown in AOCC Build 3. 

Lastly, we can compile to take maximum benefit from the hardware. The Zen 4 and Zen 5 families support AVX512 (see below). Telling AOCC to compile for these families with -march=znver4 gave some nice additional performance improvements, shown in AOCC Build 4.

The final AOCC Build 4:

-march=znver4 -pthread -m64 -O3 -fopenmp -zopt -ffast-math -fveclib=AMDLIBM

and linked using:

-lstdc++ -lomp -lpthread -lamdlibm -lm

It should be noted that executables built with -march=znver4 will run perfectly happily on Intel processors too (provided they have AVX512 support).

For most users, an executable taking only 55% of the time of the corresponding GCC executables is sufficient; over-tuning compiler flags is rarely worth the extra hassle unless you know exactly the hardware and the level of numerical approximations you can tolerate. In the case of bank code and wanting cross-OS/compiler stability, it may even be necessary to avoid the use of -ffast-math which does impact performance.

And what about the Hardware?

So, we’ve dealt with the compiler. What about hardware differences?

While both Intel and AMD x86 processors run exactly the same assembler instructions which have exactly the same numeric results, how the processors treat these is not always the same. In particular, modern processors are superscalar; The processor has several (typically 8) ‘ports’ which are simpler processors that can do certain operations. All assembler instructions are broken up and shuffled around to try to make the biggest parallel use of these ports.  How many clock cycles an operation takes and how many ports can handle a particular operation (and hence superscalar parallelism) varies from processor to processor. Chip designers dedicate their limited space on the processor to benefit the largest number of users and different manufacturers will prioritise different coding needs.

For end users, it’s seldom worth obsessing over this; you’re using the compute grid that the bank IT group has provided. However, for quants, there are two areas where the differences between processors should be understood as they could impact the performance of pricing code, namely AVX512 SIMD operations and division operations. How much it affects things depends on the precise nature of your calculations.

AVX512

The AVX512 extensions for the x86 processor provide the processor with a very large set of 512-bit registers and operations on these registers. A single operation can, e.g., multiply two sets of 8 doubles for the same cost as a single scalar multiplication, and likewise for many other operations. Having a lot of registers can also reduce cache reads/writes so having AVX512 available is valuable in finance (the performance impact is shown above comparing AOCC Build 3 and 4). There is a lot of fragmentation in the AVX512 support but for quant work, provided it supports AVX512-F (the core set), we’re good enough.

Intel and AMD have taken different approaches to implementing AVX512. A simplistic explanation of the difference is that Intel has dedicated silicon for AVX512 operations. The use of this results in more heat than other operations so Intel lowers the clock speed when you use these operations and this persists for a while after use too (see e.g., Travis Down’s exhaustive analysis here and here). Consequently, you (a) don’t get the full speed-up you’d naively expect and (b) a mix of small amounts of AVX512 with mostly non-AVX512 could impact performance.

AMD, on the other hand, has implemented AVX512 by “double pumping” two AVX2 registers for AVX512 operations (see, e.g., section 2.21 here). All things being equal this would be slower for the same clock speed. However, (a) this doesn’t require clock-speed reduction and (b) effectively superscalar overlapping means that the two operations can largely be carried out in parallel.

Which approach works better on your code depends on the amount of vectorisation and the type of operations vectorised.

An additional problem is that many of the newer Intel desktop processors do not support AVX512 due to the use of the E- and P-core architecture as the energy-efficient (slower) E-cores do not handle it. However, the top end of the range still supports AVX512.

Division

Compared to floating point multiplication or addition, the division operation can be very slow. In most aspects of finance calculations, this should not be a significant proportion of the time. However, for multifactor PDEs, the division operation can become a measurable cost.

Intel’s implementation of division seems to gain minimal benefit from SIMD registers; It takes pretty much a constant time per element whether using x86, SSE2, AVX2 or AVX512 (see Intel’s intrinsics guide for throughput and latency). It also appears to only use one port so gets limited benefits from superscalar. On AMD, SSE2 and AVX2 divisions appear to consume the same number of clock cycles, so the AVX2 is twice as fast per element. The AVX512 divisions take about twice the throughput of the SSE2, possibly because AMD can handle division in two ports.

Take Away Points

  • Migration within x86 processors is easy.
  • For vectorised libraries, the AOCC compiler is almost a drop-in replacement for GCC, Clang or Intel OneAPI and provides a significant performance boost over GCC, especially if you use the AMD short vector maths library.
  • For heavily vectorised libraries, on hardware that supports AVX512, we found that compiling for the generation of hardware can provide a further significant performance boost.

    Please provide your work email to access the free trial

    By clicking the button below you agree to our Privacy Policy

    This will close in 20 seconds

      Discover how we can help you in just a few clicks





      Discover how we can help you in just a few clicks

      Personal Information





      By clicking the button below you agree to our Privacy Policy

      This will close in 0 seconds