This paper describes SIMD (Single Instruction Multiple Data) APIs for .NET that expose the parallel execution capabilities available on modern processors. These APIs include both platform-independent and platform-specific APIs that expose the SIMD capabilities available on the target hardware. The platform-independent APIs abstract both the element type and length of the maximum width vector available on the target hardware, while the platform-specific APIs are a more direct mapping to the target instructions. The APIs are consumable by high-level languages such as C# and F#, and the just in time compiler (JIT) for .NET translates the operations to the appropriate hardware instructions directly in the referencing method, avoiding overhead due to calls or managed/native transitions.
The most straightforward approach is to add new data types and operations to the programming language that directly reflect the data types and operations offered on the target. Operations on these data types are implemented as "intrinsics" which simply translate to the low-level instruction on the target processor [1, 4ś7, 9, 11, 12, 15] . This allows (or requires, depending on your perspective) the developer to directly leverage the new functionality.
Auto-vectorization [2] takes existing apps with no explicit SIMD use, and identifies code sequences that can be transformed to utilize SIMD operations. It's a specialized, constrained form of auto-parallelization, and depends upon many of the same analyses.
These are rather costly in their full generality, requiring sophisticated alias analysis and loop dependence analysis (identifying where the same data might be read/written across loop iterations).
Both the addition of new APIs, as well as support for auto-vectorization, require changes to the compiler and runtime ś but auto-vectorization is significantly more costly in development time for the compiler itself, as well as in the time it takes to perform the optimization at compile time. Auto-vectorization has the natural advantage that it doesn't necessarily require the developer to change their code, though note that it doesn't always achieve the results that developers expect, and due to its compile-time cost, can generally be applied only for statically compiled programs, or for long-running server applications to which adaptive compilation techniques can be applied.
The intrinsics approach has two problems. First, it makes the program dependent on both the target architecture and the target vector size (which may range from 64 to 2048 bits, and are likely to continue to increase), and second, it requires the developer to implement their algorithm for each numeric type to which it may be applied (float, double, integer, etc.).
This work first presents a higher-level generic data type which is abstracted by both size and type. It provides efficient access from a managed language to SIMD instructions available on the target processor. Its width can be queried programmatically, and it can be instantiated over the full range of primitive numeric types available on modern SIMD hardware. It provides standard arithmetic operations, which are readily mapped by a JIT compiler to the target instruction set.
This platform-independent solution consists of a generic type, Vector<T>, which can be instantiated over primitive types (integer or floating point). This type provides numeric operations enabling developers to create a single implementation of their algorithm, while applying it to a range of numeric types or precisions. This design provides for efficient translation to the target architecture, as the operations on Vector<T> translate directly into a short sequence of one or more processor instructions. This contrasts with autovectorizing compilers, which have a high throughput cost for runtime compilation. Further, this design also provides a high-level programming model, which allows developers to write code that is both target-independent and can be instantiated over a range of element types. This contrasts with the "intrinsic" approach, in which both the element type and length of the vector are fixed. In some cases, (such as the x86 C++ intrinsics) even the target processor is fixed at development time.
Vector<T> provides a general-purpose API that programs can use to take advantage of hardware vector instructions, when available at runtime. These can often be introduced with only minor refactoring. However, this very generality makes it difficult to fully take advantage of the complex and specialized vector instructions on modern hardware. For example, some operations (such as shuffles that provide essential capabilities for matrix operations), are difficult to express in a size-and platform-independent manner. Furthermore, abstractions can impose a penalty in order to map platform-independent semantics onto target platforms with different characteristics. In order to fully utilize the target platform capabilities, we introduced fixed-size data types: Vector64<T>, Vector128<T> and Vector256<T>. APIs defined over these types are platform-specific and map directly to instructions on the target ISA.
Both the platform-independent and the platform-specific types benefit from the support for value types in .NET. These types need not be allocated on the managed heap.
Terminology
In this paper, when we use the term ISA, an abbreviation of Instruction Set Architecture, we use it to mean a set of instructions that are specified to be implemented as a set. That is, if the ISA is implemented, then all the instructions within that ISA can be safely used. For example, AVX is an ISA defined for the x86 family of architectures.
We use the term "managed code" to refer to code that runs in a virtual machine environment such as .NET, generally with automatic memory management. This is in contrast with the term "native code" which refers to code that runs directly on the OS.
The coreclr code base generally refers to the 64-bit ARM architecture AArch64 as Arm64, so that is the term used in this paper. Similarly, the term x64 is used to describe the 64-bit mode of operation on x86-64.
Design Objectives
Dijkstra said łin programming, simplicity and clarity . . . are not a dispensable luxury, but a crucial matter that decides between success and failurež [3] . While code performance is important, there are far more available computing resources in the world than there are developers to exploit them ś and developer productivity is a large factor in łtime to solution. ž Minimizing complexity and maximizing simplicity and clarity by ensuring that the APIs exposed by the frameworks are clean and consistent is a strength of .NET. In addition, we wanted to ensure that SIMD support was designed in a way that made it a natural extension of the .NET runtime and Frameworks.
Support for General Parallelism
In order to determine how to add SIMD support to .NET, we considered the different types of data parallelism, and how we should support them.
First, general data parallelism refers to problems that operate on large or variable numbers of data elements, and where most of the computation for each element is independent of the other elements. Conceptually, this is a model where each input vector is uniform ś that is, all elements are modelling the same thing. The inputs are generally large or variable in size. A SIMD implementation operates on chunks of N elements at a time, where N is the number of elements in each vector.
Explicit N-dimensional parallelism refers to problems that operate on data with fixed dimensionality, and which apply the same operand across dimensions. Examples are points in N-dimensional space and colors.
We wanted to enable .NET developers to leverage both types of data parallelism in their code. For N-dimensional parallelism, an approach similar to that of XNA [11] and Swift SIMD [4, 9] is effective. Vector2, Vector3 and Vector4 data types represent vectors of 2, 3, and 4 floating-point elements, respectively, providing N-dimensional parallelism for dimensions 2, 3, and 4. For general data parallelism, however, we want to ensure that developers can leverage the maximum degree of parallelism available from the target hardware. That is the objective of the Vector<T> API.
Explicit, Extensible and Target-Independent
The parallelism exposed by the SIMD APIs is explicit and developer directed. This is largely because the JIT is operating in a time-constrained context, but it also is motivated by the fact that auto-vectorization doesn't always yield the expected performance, or can be lost unexpectedly when the source code is refactored. These types and their implementation must be effectively łPay for Play. ž That is, it must have a minimal start-up, throughput and working set impact on code that does not use it.
Further, it should be designed for extensibility. That is, it must easily accommodate larger width vector operations (e.g. 128, 256 and eventually larger vectors). Similarly, we would like to expose its capabilities through both a set of platformindependent APIs as well as a set of APIs that provide łcode to the metalž capability.
Vector<T> API
A Vector<T> structure holds as many elements of type T as will fit into the SIMD vector of the target instruction set. The number of elements is available via the Vector<T>.Count property. On a given target architecture, the bit size of all Vector<T> instantiations will be the same, but will contain different numbers of elements depending on the size of T.
Vector<T> is unique in the .NET runtime in that its size is determined at runtime based on the instruction set present at execution time. This in generally detected based on the highest-level ISA supported by all of:
• The runtime (i.e. the .NET virtual machine), • The hardware on which the code is executing, and • The support available in the JIT. First, the runtime asks the JIT how large a vector it can support on the target hardware, and then it tells the JIT how large it will be, based on what the runtime is configured to allow. This allows for a reasonable handshake between different runtime and JIT implementations and versions. The developer doesn't make any explicit decisions about what to target, nor must the JIT support a 512-bit vector, even if available on the target hardware.
The operations available on Vector<T> include most of the expected arithmetic operators, as well as additional mathematical operations such as Abs, Sqrt, Min, and Max.
The Vector<T> type is effectively constrained to primitive integer or floating-point types. A prototype implementation included an additional Static Interface Constraints feature that would allow us to define an INumeric or IArithmetic interface that included the expected arithmetic operators. Such support has not yet been added; instead, if you instantiate Vector<T> with types that are not supported, you will get a runtime error.
Vector<T> Example
For illustration purposes, we will use the Mandelbrot set. The Mandelbrot set is the set of complex numbers c for which the following sequence converges:
The computation of this series for each value of c is entirely independent of any other value of c. And since we will be computing this series for many values, it is a good example of a general data parallel problem. Indeed, it is an example of what is often called an łembarrassingly parallelž problem in that one could continue to add processing units until there was one for each value being computed.
The Mandelbrot program starts with a range of complex numbers, represented as points in the 2D image. It computes the series, for each value in that range, until it converges, or we reach an iteration limit.
The code shown in Listing 1 is the heart of the Mandelbrot computation. It is using a Complex data type, which is just a struct with the two components of a complex number, and on which we have defined methods to perform addition, square and absolute square. It iterates over the series, retaining the previous value in the series in the accum variable, so that it can be used to compute the subsequent value.
The visualization is done by the DrawPixel method, which assigns a color to each point, based on the number of iterations it took to diverge (or black if iters is the maximum value).
After this method completes, the program then "zooms in" by reducing the range and step and recomputing the set.
Because each iteration of the inner loop is independent of all the others, we can compute N points in the x-dimension per iteration of the inner loop. This requires only that we can perform N multiplications, additions and absolute values in parallel.
Listing 2 shows the same code, vectorized to compute N values in the series with each iteration of the loop, where N is Vector<float>.Count.
First, we turn the Complex data type into a ComplexVector data type, so that it contains two vectors, each with N values of the two components. This is an example of using an SoA, or Struct of Array representation, where our Vector is effectively an array of length N. Note: This contrasts with the AoS, or Array of Structs representation in which the two components are allocated contiguously, and which doesn't lend itself as easily to vectorization in this kind of problem.
We then modify the inner loop to perform N computations in each iteration. We define methods on ComplexVector to perform the necessary operations, using vector addition and multiplication.
The only thing slightly complicated about this is that we have an early exit in the loop. For that, we evaluate the condition across the vectors into a Vector<int> and use that to control whether the increment is added in each element. Note that this means that if the original number of iterations was not a multiple of N, we will compute extra values in the series. Of course, for the large number of iterations we are computing, this is not significant. Figures 1 and 2 show the execution time of Mandelbrot on an Intel® Core™ i7-6700 3.40 GHz CPU and on an ARMv8 pi4 processor, respectively, normalized to the "Raw" scalar singleprecision configuration. Note that the same code is used for both targets, and is dynamically compiled to the target instructions at runtime. The ADT numbers show the performance when using the Complex (scalar) and ComplexVector abstract data types, while łRawž numbers show the performance when using the scalar or Vector<T> types directly. • The Raw performance shows a performance improvement for all vector sizes on x64. • The Raw performance shows a performance improvement for single precision on Arm64. • The ADT performance illustrates an abstraction penalty for all configurations, but is particularly large for x64 single precision scalar, and Arm64 double precision vector.
ś On x64 this is due to early ABI-related pessimization for call arguments that are later inlined. ś On Arm64 this is due to the lack of struct promotion (scalar replacement) of these wrapped vector types. • Overall the Arm64 performance reflects the need for further tuning.
Though the abstraction penalty for the vector types is signficiant, the initial implementation of these types showed a much larger abstraction penalty, which has motivated tuning of the scalar replacement approach for structs that łwrapž the vector types.
On Arm64, however, the abstraction penalty is much larger and the overall benefit from vectorization is smaller. This reflects the fact that JIT support for Arm64 is a less mature implementation and needs further tuning.
Limitations of Vector<T>
Early experience with Vector<T> exposed limitations in the breadth of its applicability.
Gather Operations
As shown with the bi-linear interpolation example in a later section, a gather instruction can benefit algorithms that perform sparse data accesses. While a reasonable API could be defined for a gather operation, not all targets implement an accelerated gather instruction. Providing a software fallback implementation on targets that don't support a hardare accelerated gather operation may not achieve expected performance, especially since the abstract data type is intended to provide full type and memory safety, while the hardware intrinsic is inherently unsafe and therefore avoids the bounds check overheads.
Shuffle and Permute Operations
Additional examples of challenging problem domains include multi-dimension algorithms, where effective use of SIMD hardware requires support of shuffle and permute operations. These are operations that operate across one or more vectors and move elements around. It is difficult to come up with a good way to express these for a type with dynamic size, and we have yet to develop a good solution for this.
Hardware Intrinsics
Hardware intrinsics were introduced to address the need to provide access to platform-specific instructions, especially those that are not easily abstracted in a more general-purpose API. They differ from the System.Numerics.Vector intrinsics in that they are not general-purpose (they target specific ISAs) and instead directly expose platform and hardware specific functionality to the .NET developer. In addition to providing access to SIMD instructions, they also expose other specialized instructions, such as the CRC checksum instructions available in the SSE4.2 x86 ISA and the CRC32 Arm ISA.
The hardware intrinsics are exposed under the System. Runtime.Intrinsics namespace. For .NET Core 3.0 there currently exists one namespace: System.Runtime.Intrinsics.X86. We are working on exposing hardware intrinsics for other platforms, such as System.Runtime.Intrinsics.Arm.
Fixed-Size Vector Types
Unlike the APIs defined over Vector<T>, the hardware intrinsic APIs are defined over fixed-size vectors. These are actually platform-agnostic, although the APIs defined on them are platform-specific:
Vector64<T> A 64-bit vector of type T. Vector128<T> A 128-bit vector of type T Vector256<T> A 256-bit vector of type T Note that these are generic types, which distinguishes these from native intrinsic vector types. It also currently limits interoperability with native code, as the runtime currently doesn't support interop for generic types (though this is being prototyped). Although these types are defined across platforms, Vector64<T> intrinsics are defined only on Arm64, and Vector256<T> intrinsics are defined only on x86.
From a hardware perspective, the vector registers are, at least nominally, of the same type. That is, a vector of floats may occupy the same register, by the same name, as a vector of doubles or even a vector of ints. The type information is a part of the opcode. However, for a high-level programming language, it is desirable for the element type to be part of the vector type itself. This further allows for the use of more mnemonic names. For example, we can use Add as the method name for all of the vector add intrinsics, with the type providing the mapping to the correct opcode, e.g. vaddps (x86 AVX), paddw (x86 SSE2), or vaddvq_s8 (Arm NEON).
ISA Classes
Under the platform specific namespaces, the intrinsics are grouped into classes which represent logical ISAs. Each class then exposes an IsSupported property that indicates at runtime whether the hardware supports that ISA. Each class then also exposes a set of methods that map to the underlying instructions exposed by that instruction set.
In some cases, the implementation of one ISA implies another, in which case it is declared as a subclass of the other. For example, the Lzcnt class provides access to the leading zero count instructions. There is then a subclass named X64 which exposes the forms of the instruction that are only usable on 64-bit machines. Thus, if Lzcnt.X64.IsSupported returns true, then Lzcnt.IsSupported must also return true since it is an base class. Likewise, if Sse2.IsSupported returns true, then Sse.IsSupported must also return true because Sse2 explicitly inherits from the Sse class. However, it is worth noting that just because classes have similar names does not mean they are related in the class hierarchy. For example, Bmi2 does not inherit from Bmi1 and so the IsSupported checks for the two instruction sets are distinct from each other. Listing 3 shows an outline of two of these classes.
An important feature of the hardware intrinsics is that the IsSupported checks are evaluated at the time the method is compiled by the JIT, which is generally at runtime, and any code that becomes unreachable due to these (known constant) checks is discarded by the JIT. Frameworks such as CoreFX and ML.NET take advantage of these methods to help accelerate things like copying memory, searching for the index of an item in an array/string, resizing images, or working with vectors, matrices, and tensors.
Bi-linear Interpolation
Bi-linear interpolation is a technique that is used to extrapolate additional data points from a limited data set.
Because a discrete set of reference points are used, mapping the indices of the reference points to and from the reference values they represent involves conversion between integer and floating point. Further, because these indices are not contiguous for contiguous input values, the algorithm benefits from support for gather operations ś i.e. loading of non-contiguous array elements simultaneously. The bi-linear interpolation sample we are looking at came from a customer in the finance sector who wanted to achieve higher performance using vectors. It illustrates some of the limitations of the initial implementations of Vector<T>. This code interpolates values of a function F using two different reference sets, so the values being interpolated will all lie along the diagonal. Each reference set is given a different weight, and the range and delta of the input values for each set may differ. To interpolate the value of F at a point x, we use the two nearest values of each of our reference functions and apply the given weight. Unlike some bi-linear interpolation algorithms, this one is not actually quadratic. The algorithm is described in Listing 4. Listings 5 and 6 show an excerpt (the computation of the A reference points) of the bi-linear interpolation algorithm implemented using Vector<T> and the hardware intrinsics, respectively.
For each point in our input set (the array x), we compute the lower and upper reference points for each reference set A and B. Note that our input values, as well as our reference points, are floating point values, and we use the range and the delta between the points to compute the integral index. This computation requires conversion between integer and floating-point values. Once we have the indices for our reference points, we load them and apply the weights to the interpolated values.
The initial step in our vectorized version is to load the next N values from our input array x (where N is the number of elements in our vector). This is done with a constructor taking an array or pointer and the starting index. We then subtract the initial value in our reference vector and multiply by the delta between reference points. However, the indices are not contiguous for contiguous x[i], so this requires a gather operation, not currently supported for Vector<T>, so we must use a loop to construct the result vectors. Figure 3 shows the performance of this algorithm in several configurations on an Intel® Core™ i7-6700 3.40 GHz CPU:
• Scalar is the sequential algorithm.
• Vector<T> 256 is vectorized using the Vector<T> APIs, also using the VEX encodings. • Vector256 is vectorized using the Hardware Intrsinsics APIs. • Scalar no VEX is the sequential algorithm compiled to use no VEX encodings. This demonstrates both that the VEX encodings occasionally impose a penalty, but also that the JIT is not using optimal encodings. • Vector<T> 128 is the same code as the Vector<T> 256 configuration, but with the JIT configured to use no VEX encodings. This limits the vector size to 128 bits, thus enabling only 2-way parallelism. Listing 9. Generated Code Excerpt for AVX Intrinsics An excerpt of the AVX and Arm64 generated code for the Vector<T> implementation is shown in figures 7 and 8 (note that they are generated from the same source), and the generated code for the AVX hardware intrinsics implementation is shown in figure 9 . There are a couple of limitations of the Vector<T> APIs that make it difficult to achieve an optimal speedup. First, they do not include a gather operation, so the individual indices must be extracted from the vector in order to do the final set of loads from the A and B arrays. Second, the conversion operations for narrowing and converting between integer and floating point values are separate, so we can't leverage the native support for converting between vectors of double and 32-bit integer. These constraints offset the gain from vectorizing the other operations. In addition, the array accesses are bounds-checked; even though the algorithm is designed to ensure that the references will be within the bounds of the array (and it should be discernable by the JIT within the method), the JIT is not able to eliminate the bounds checks for these references, which has a significant impact. There is also a bounds check at the beginning of each iteration for the vector load from the input array x. This has less impact, and is more difficult to eliminate, as the JIT doesn't know a priori that the x array has been allocated to ensure that it will evenly divide by the size of the vector. Future API, language and JIT enhancements may alleviate these issues while preserving the safety desired in a high-level programming model. The 2-way parallelism enabled by the 128-bit Vector<T> is insufficient to overcome these overheads on x86/64. The Vector<T> APIs are designed to be safe, and so do not provide reads and writes to or from pointers. In contrast, the hardware intrinsics implementation is able to utilize the Avx2.Gather method, and uses pointers to access memory, avoiding the bounds checks. However, on Arm64, the different characteristics of the load/store architecture make the 128-bit Vector<T> implementation slightly faster than the scalar version.
Reading, Writing and Retyping Vectors
Effectively using these intrinsics requires the ability to manipulate values that may reside in registers or in memory, as well as the ability to reinterpret vectors between different element types. The addition of the System.Runtime. CompilerServices.Unsafe set of APIs makes this easier:
• Unsafe.As enables a reinterpretation cast (i.e. no actual change in representation) between vectors of different element types. • Unsafe.ReadUnaligned and Unsafe.WriteUnaligned methods make it possible to express low-level loads and stores in high-level language. • Explicit load or store intrinsics allow reading or writing with pointers, and include access to target instructions with attributes such as Aligned, NonTemporal, or Scalar. • The Span<T> data type represents a slice of arbitrary memory containing elements of type T. This is used by the Vector<T>.CopyTo methods. In addition, the Vector<T> type provides explicit reinterpreting conversions between vectors of different element types (e.g. between Vector<byte> and Vector<int>).
Since their initial implementation, both the abstract vector types as well as the hardware intrinsics have been incorporated into applications that run on .NET.
ML.NET
The ML.NET machine learning framework previously utilized SSE intrinsics by calling out to a library written in native C++, using the C++ intrinsics. It is now using both SSE and AVX intrinsics directly from managed C#. This achieved comparable performance on SSE and an improvement of roughly 20% when using AVX [10] . Since those results were published, additional intrinsics have been added in support of factorization.
Frameworks
Since their introduction, both the abstract Vector<T> type and the hardware intrinsics have been used in the frameworks to accelerate code such as text manipulation. Examples include:
• System.Text.Json.JsonReader, • System.Text.Encodings.Web.JavaScriptEncoder • System.Span 11 Implementation Details
C# Source Files
Each ISA class has an associated implementation in a C# source file. The class is marked with an [Intrinsic] attribute. The implementations of the intrinsic methods on that class are recursive. When the VM encounters such a method, it will communicate to the JIT that this is an intrinsic method and will also pass a mustExpand flag to indicate that the JIT must generate code. This allows these methods to be invoked indirectly to support the following scenarios:
• Debugging • Reflection invocation • Execution of an intrinsic with a non-constant operand when an immediate operand is required by the target instruction
In most cases, when a call to an intrinsic is encountered, the mustExpand flag is false. If the JIT fails to expand the intrinsics (e.g. because the target platform is not supported, or a required immediate operand is non-constant), a regular method call is emitted. Then, when the actual method is subsequently jitted, the runtime sets the mustExpand flag to true, and the JIT will either emit code to throw the PlatformNotSupportedException, generate the single instruction (for the debugging or reflection case) or generate a switch table for non-constant operands.
JIT Support for Intrinsics 11.2.1 Platform Target Information
The JIT depends on the VM and configuration settings to determine what target platform to generate code for. These are communicated to the JIT through a handshake mechanism.
Importation
The JIT has a NamedIntrinsic mechanism to identify method calls that may be recognized as intrinsics. In the incoming IL, intrinsic invocations are just method calls, so the JIT must distinguish intrinsic calls from ordinary call-sites and map them to a special intrinsic IR node. The JIT maintains a set of intrinsic tables that provide the information it needs to translate the intrinsic to machine code.
The [Intrinsic] attribute has a different meaning on each attribute target:
• Method: call targets marked with [Intrinsic] will be checked by the JIT when importing call-sites. If the method's (namespace, class name, method name) triple matches a record in one of the intrinsic tables, it will be recognized as an intrinsic call. • Struct: value types marked with [Intrinsic] are recognized by JIT as special vector types. • Class: marking reference types with [Intrinsic] causes any member methods to be considered as possible intrinsics.
Currently, the JIT determines in the importer whether it will:
• Generate code for the intrinsic (i.e. it is recognized and supported on the current platform) • Generate a call (e.g. if it is a recognized intrinsic but an operand is not immediate as it is expected to be). The mustExpand option, which is returned by the VM as an "out" parameter to the getIntrinsicID method, must be false in this case. • Throw PlatformNotSupportedException if it is not a recognized and supported intrinsic for the current platform.
Intrinsics Tables
There are intrinsics tables for the abstract vector types, as well as each platform that supports hardware intrinsics. These tables are intended to capture information that can assist in making the implementation as data-driven as possible, which greatly reduced the cost of implementation by enabling many intrinsics to be supported simply by adding entries to the table, or a whole set of intrinsics to be added by adding an additional classification.
IR
The intrinsic nodes contain the following information:
• The intrinsic ID, which indexes into the associated table. • The łbase typež indicates the type of the vector elements (the generic type argument). • The size field indicates the full byte width of the vector (e.g. 16 bytes for Vector128<T>).
Lowering
Lowering is responsible for transforming the IR in such a way that the control flow, and any register requirements, are fully exposed. This includes determining what instructions can be "contained" in another, such as immediates or addressing modes.
It is the job of Lowering to perform the necessary legality checks, and then to mark them as contained as appropriate. This allows the implementation of the intrinsics to be somewhat flexible with regard to whether the operands are in memory or in registers, on targets that support both register an memory operands for the same instruction.
Register Allocation
The register allocator uses a linear scan approach. The primary consideration for the register allocator regarding vector registers is the ABI requirements.
Some vector calling conventions (256-bit vectors on x86, and 128-bit vectors on Arm64) do not preserve the upper half of the callee-save vector registers. This requires the register allocator to separately track the upper and lower halves of these values across calls.
Code Generation
By design, the actual code generation is fairly straightforward, since the hardware intrinsics are intended to each map to a specific target instruction. Much of the implementation of the x86 intrinsics is table-driven.
12 Future Work
Auto-Vectorization
The APIs outlined here provide the opportunity for a highlevel optimizer to perform auto-vectorization in either a platform independent manner (targeting Vector<T>) or generating multi-versioned code for specific target ISAs.
Additional APIs
There are several directions in which future APIs may take both sets of intrinsics:
Arm Neon
Future work will include Neon intrinsics for 64-bit Arm processors.
16-bit Floating Point
Developers have requested support for half-precision floating point, which can be useful in a number of scenarios, including machine learning.
Floating Point Controls
This would include flush-to-zero on denormals, and NaN flagging, as well as the ability to opt-in to a łfast mathž mode that allows for optimizations that may either change the precision of results or the order of possible exceptions.
Vector Versions of Various Mathematic
Functions This would include trigonometric, hyperbolic, exponentiation and logarithmic functions, as well as the ability to compute single-precision results for vector operations (currently the scalar versions of these functions return double precision results).
Full Vector ABI Support
While the Vector ABI for Arm64 is currently supported, the x86 implementation doesn't yet implement the default ABI for vectors, either on Windows or Linux. This support must be implemented to enable interoperability with native code, and will also improve performance for managed-to-managed calls.
General Performance
One of the distinguishing features of the .NET runtime is its support for value types ś i.e. types that can live on the stack (or in registers) and can be directly contained in other objects or value types.
Currently there are more copies of value types than we would like to see, including of vector values. Also, generating the most efficient code for the fixed vector types (Vector2, Vector3 and Vector4) sometimes requires the value to live exclusively in a single vector register, while other times it requires splitting it into its constituent fields. In future, we hope to add support for making this choice according to usage.
Second, the JIT is not yet handling expressions involving vectors as well as those involving only primitive types, so redundant expressions are not always eliminated or pulled out of loops when they could be. This is an area of current work in the JIT.
Thirdly, we have tuned the inlining heuristics to favor inlining of methods involving SIMD vectors, but more tuning is needed. There is also ongoing work on tuning of inlining heuristics.
Finally, in spite of best efforts, the performance of nonaccelerated Vector<T> methods is often such that an alternate algorithm is preferable when the intrinsic is unsupported. We've had requests for a mechanism that would enable the application to query on a per-method basis which ones will be recognized as intrinsics by the JIT.
Related Work
Stojanov et al. [5] present a mechanism for automatically translating the XML description of x86 intrinsics available from Intel to native methods callable by code running under the JVM. While this has the significant advantage of being automatically generated, it suffers from a transition penalty when the intrinsics are called from managed code.
Nie et al. [14] present a combined approach for adding SIMD support to Java: a Java vectorization interface for explicit use of vector types, as well as auto-vectorization support (which this work does not include). Because the explicit vector types are recognized and expanded in the JIT compiler, it doesn't suffer from the transition penalty of [5] , though supported operations are limited, and the vector size is limited to 128 bits.
XNA [11] is a framework (now deprecated) for Xbox development that was released in 2006. It includes a number of data types with SIMD semantics, including Vector2-4, Matrix, Quaternion. The Matrix is equivalent to our Matrix4x4. However, the type sizes and the methods defined on them are limited.
Many C++ compilers include support for the set of SIMD intrinsics defined by Intel [7] for the x86 family of architectures and/or those defined by Arm [1] for the Neon architecture. These each define target-specific data types and intrinsic methods that are a one-to-one match with target instructions. These are analogous to our hardware intrinsics, but don't provide as seamless an integration with the language and type system.
Enoki [8] is a C++ template library that provides a highlevel abstraction including support for wide and multidimensional types that builds upon CUDA and low-level intrinsics. It solves a distinct set of problems from this work, which is specifically targeted toward a managed runtime and is more analogous to, yet more abstract than the low-level intrinsics Enoki builds upon.
Rust [15] provides a set of intrinsics similar to those provided by C++ under std::arch. It provides a guard facility similar to the IsSupported property of .NET's Hardware Intrinsics ISA classes: is_target_feature_detected!. There is also proposed abstract SIMD support under std::simd. These appear to be fairly similar to those described here, though it is also a statically compiled language, and so does not generate code specific to the current execution target. We believe the APIs defined for .NET provide a more seamless integration into the runtime.
Mono.SIMD [12] offered a range of 128-bit vectors which largely reflected the available functionality on the SSE family of instructions.
Swift simd [4, 9] includes a broader range of types, with a uniform naming convention (type name followed by number of elements, including 2D matrices) but only fixed sizes (not scalable to higher widths). It also has support for some interesting cases of permutation (extracts, rotates).
Go [6] provides fixed-size abstract vector types Vec32 and Vec64, and defines a set of mathematical operations over those types.
Most of these approaches are closely tied to the x86 family of Instruction Set Architectures.
The approaches taken by XNA, Swift and Go seem to be closest in spirit to the fixed-size vector types (Vector2, Vector3 and Vector4) described here, offering vector types as first-class data types. But the set of types is limited and doesn't scale up to larger than 128-bit vectors.
The Rust and C/C++ intrinsics are closest in spirit to the hardware intrinsics APIs described here. However, they do not readily provide the ability to eliminate the dead paths (for other than the current target ISA) at runtime at a fine granularity.
In summary, the SIMD approach taken for .NET provides both abstract and low-level intrinsics for leveraging SIMD operations. In contrast to the statically compiled solutions, it alleviates the need to choose between dynamically checked multi-version code and compiling to the lowest common denominator. In contrast to other dynamically-compiled solutions, it provides a wider range of both abstract SIMD capabilities and łto the metalž programming.
Lessons Learned
The main lesson learned from adding SIMD support to .NET was that iteration and feedback are essential in determining the right abstractions and how to expose them. The original implementation of Vector<T> was done when the .NET Framework was closed source, and turnaround time for releases was relatively long. The hardware intrinsics were added after it was open-sourced, and it greatly benefited from community feedback and faster turnaround.
The other notable lesson was the extent to which this work impacted .NET more broadly:
• It has been (and continues to be) a driving force in improving the performance of value types in .NET. While the existence of value types has long been a strength of .NET, the vector types exposed a number of ways in which performance of these types could be improved. The plans for further improvement in the handling of those types are in [13] . • Effective use of these types was one of the motivating factors in adding support for unsafe reads and writes. • Tuning of inlining heuristics was necessary to ensure good performance.
Conclusions
The SIMD support in .NET makes vector and specialized computation available to .NET developers in a set of APIs that uniquely provide:
• Seamless integration with the type system and runtime, • A breadth of capabilities, and • No transition penalty. It is used in the runtime itself and thereby accelerates code running at industry scale, and has accelerated real world code such as ML.NET.
A Artifact Appendix

A.1 Abstract
The artifacts include two versions of the .NET runtime (for x86-64 Windows and ARM AArch64 Ubuntu), and instructions for downloading and running the .NET benchmarks for these targets. These benchmarks illustrate the performance improvement achieved when using the vector types described in the paper. 
A.2 Artifact Check-List (Meta-Information)
A.3.3 Software Dependencies
In order to install ubuntu 18.04 on the Raspberry Pi, it is necessary to upgrade the firmware using a Raspbian installation. This, and the installation of Ubuntu 18.04 was done according to the instructions here: https://jamesachambers.com/raspberry-pi-4-ubuntu-serverdesktop-18-04-3-image-unofficial/. On Ubuntu, dotnet depends on libgdiplus, which can be installed via "sudo apt-get install libgdiplus". 
A.6 Evaluation and Expected Result
The paper presents the Mean execution time, normalized to the scalar version of the algorithm. The raw data is shown in the included tables, truncated to no more than 5 digits of precision, and omitting the Median, Min, Max and memory allocation statistics. Note that the BilinearInterpol benchmark reports an AVX number, even when it is not present, which can be ignored as it doesn't actually run. These numbers are somewhat noisy, so it is expected that the results will be within about two percent of these.
A.7 Experiment Customization
The .NET Performance repo includes additional benchmarks, and can be extended to include others. Benchmarks can be run with 
A.8 Notes
These benchmarks can be run with a more recent version of .NET 5, available here: https://github.com/dotnet/core-sdk/blob/master/ README.md#installers-and-binaries. Since it is undergoing active development, performance is likely to chnage over time; hence it was deemed desirable to provide the version used for the paper.
The performance of most of these benchmarks are reasonably stable, though the VectorDoubleSinglethreadADT version is noisy.
A.9 Methodology
Submission, reviewing and badging methodology: 
