This paper describes a short and simple way of improving the performance of vector operations (e.g. X = aY +bZ +..) applied to large vectors. In a previous paper [1] we described how to take advantage of high performance vector copy operation provided by the ATLAS library [2] in the context of C++ Expression Template (ET) mechanism. Here we present a multi-threaded implementation of this approach. The proposed ET implementation that involves a parallel blocking technique, leads to significant performance increase compared to existing implementations (up to ×2.7) on dual socket x86 64 targets.
INTRODUCTION
In this paper we propose a simplified C++ implementation of a linear algebra vector class that allows composing compact and abstract vector expressions such as:
This implementation is based on the Expression Template mechanism (ET) introduced by Veldhuizen [3] and Vandevoorde [4] . ET allows avoiding temporary vectors and the performances of abstract expressions like (1) compete with the ones of the corresponding low-level (loop-based) basic implementations such as:
In a previous paper [1] we illustrated the gain to be obtained from mixing our ET based vector class with the ATLAS [2] implementation of the procedural BLAS library. The high performance ATLAS implementation of the vector copy operation, which relies partly on low level assembly language, is used as a kernel for the vector ET evaluation.
In this paper, we present two multi-threaded implementations of our ET based vector class based on Intel's Threading Building Blocks and on OpenMP respectively. The ubiquitous presence of multicore architectures has rekindled the interest in shared memory parallel programming models. While the computational power of a chip scales almost linearly with the number of cores, this is not the case for the memory access bandwidth. Hence, the fraction of the so-called memory bound applications, which feature performances that are limited by the memory access bandwidth of target architecture, increases with the mean number of cores included in micro-processors.
Vector expressions like (1) exhibit a small arithmetic intensity and are strongly memory bound. Hence it might be surprising that multi-threaded implementations accelerate these tasks significantly on shared memory multi-core machines. Nevertheless, we have observed a ×2.5 acceleration factor on dual socket quadricore processors compared to our previous implementation. The resulting vector class allows composing abstract vector expressions like (1) that perform better than both loop-based implementations such as (2) and off-the-shelf vector libraries such as Blitz++ [5] , uBLAS [6] or std::valarray. This performance gain reaches a factor of ×2.7 for large vectors.
The paper is organised as follows: Section 2 presents the considered vector operations. Section 3 presents the large vector operations as typical memory bound problems. Section 4 gives a short description of our C++ vector class implementation based on ET. Performance measurements show that this implementation avoids abstraction penalties. Section 5 presents our enhanced ET vector class relying on the ATLAS dcopy kernel. Performance measurements are car-
hal-00551682, version 1 -28 Feb 2013
Author manuscript, published in "Workshop on Parallel/High-Performance Object-Oriented Scientific, Genova : Italy (2009)" 
VECTOR DEFINITION AND TEST PLATEFORMS
Let us first define the scope of this paper and what we refer to as vector operations. From the linear algebra point of view, vectors can be defined as indexed collections of numerical elements of the same type. Indexed means that the value of every vector elements can be accessed from a given integer index to be chosen in a given range. While a wide variety of linear algebra vector types (sparse, multidimensional,...) can be considered, we will focus on simple vector types where real type elements (single or double precision) are stored in basic containers that can be defined and exchanged through the following common programming language: F77, C and C++. Within these languages, the location of a contiguous memory region containing a given number of floating point elements, can be manipulated either as a pointer type (C and C++) or as an array type (F77). These arrays are the main Input/Output types for the Basic Linear Algebra Subroutines (BLAS) API [7] .
Performance Measurements and Target Architectures Description
This paper is based on performance measurements that have been carried out on different x86 64 target architectures. Table 1 provides a short description of the main features of theses machines. In the following, the performance curves will be named after these four architectures. Most of the time, the different targets exhibit the same kind of performance behavior. In this case, we will report only the Xeon E5410 curves. The performance measurements are carried out with the tools developed by the BTL++ project [8] . Fig. 1 shows the performance of the vector operation Y ← αX + Y (axpy) on a Pentium Xeon E5410 using respectively single and double precision floating point elements. Performance is maximal for vector in the range of [10 2 , 10 3 ] elements. From sizes around 10 4 , performance decreases and reaches its lowest level when vector sizes exceed 10 5 elements. The reason for this behavior, which is common to all vector operations, is that the performance is mainly driven by the memory access bandwidth. This is true for all computations involving a ratio r defined by: r = number of memory accesses (read+write) number of floating point operations , that is not small compared to 1. In this case, (r = 3/2 for the axpy operation), performance depends on memory access bandwidth. Most modern architectures exhibit a memory cache hierarchy with high bandwidth for data access inside the cache (data size ≤ a few MBytes) and a much lower bandwidth for data access in the main memory (a few MBytes ≤ data size ≤ a few GBytes). This explains the low level of performance observed for large vectors that do not fit in the cache hierarchy:
LARGE VECTOR OPERATIONS AND SU-PERSCALAR ARCHITECTURE
Large vectors: vector sizes ∈ [10 6 , 10 8 ].
Since the double precision axpy operation requires twice as much memory bandwidth as the single precision one, it runs naturally 2 times slower for large vectors (0.33 Gflops vs 0.66 Gflops).
MINIMAL C++ VECTOR CLASS
This section presents a minimal vector C++ class based on Expression Template mechanism. 
Expression Template
ET have been introduced by T. Veldhuizen [3] and D. Vandevoorde [4] . Applied to a vector class, ET allows writing arbitrarily complex vector expressions such as:
that do not imply temporary vector construction and do not incur any performance penalties compared to the corresponding loop-based implementation: The static cast method getCDR() allows extracting the embedded DERIVED object from its BaseVec<> capsule. The DERIVED template class parameter is one of the three following classes: VecExpr<> and VecScalExpr<> instances are constructed by arithmetic operators applied to BaseVec<> objects. These two classes store references to the operands. In addition, the VecExpr<> class statically defines the type of operation (+ or -) as a template parameter:
Operators +/-: Both expression classes define an operator [] that performs the actual evaluation of the expression. Note that this evaluation is not performed at the expression classes construction stage. This is a lazy evaluation process that allows avoiding temporary vectors involved in standard implementations of operators. A more detailed presentation of these classes is given in [1] .
Vec<T>
The class Vec<T> is the main vector class. Its implementation is classical except for the assign operator = specialized for BaseVec<DERIVED> right hand side: The template parameter ELEMENT_TYPE is either float or double and the vector elements are stored in a C array of this type (data_). The operator = evaluates the right operand value that can be the result of an arbitrarily complex expression.
Expression Template Performance
A large variety of vector expressions is handled by this ET vector class implementation. In this paper we present performance measurements carried out for a limited set of vector expressions that should give a fair picture of the general performance level to be expected from the Vec implementation.
Considered Set of Vector Operations:
The vector operations that we will study in this paper are linear combinations of vectors:
where T is a given target vector, {Si} is a set of source vectors of the same size and {ai} the corresponding set of scalar factors. These combinations can be characterized by the number Nc of involved source vectors:
• NC = 1: unary combinations (part of L1 BLAS API).
• NC = 2: binary combinations.
• NC = 3: ternary combinations.
• . . .
EXPRESSION TEMPLATE AND ATLAS BASED BLOCKED EVALUATION
In this section, we combine the previous Expression Template mechanism with a blocked copy technique. The principle is to take advantage of the high performance copy operation provided by the ATLAS library.
Vector Class Modification
The Fig. 3 shows the performance results of binary and ternary linear combinations implemented via our ET Vec class, with and without blocking, and via direct loop-based C implementations. One can see that the abstract expression of the combinations does not lead to any performance penalties. Moreover, the blocked ET implementation accounts for a performance improvement on both binary (+25%) and ternary (+16%) vector combinations.
Sequential Blocked Results

PARALLEL EXPRESSION TEMPLATE
In this section, we present a parallel implementation of the previous Blocked Expression Template mechanism. The possibility of such an approach is mentioned in the reference [10] .
Vector Class Modification
The only change in the Vec class is a replacement of the In Fig. 4 we compare this implementation to another parallel implementation of the BlockAssign based on the Intel Threading Building Blocks [11] .
Parallel Results
Fig . 4 shows the performance improvement that this parallel blocked ET implementation entails for both binary (×2.7) and ternary (×2.6) vector combinations.
When vector sizes exceed the total size of the cache hierarchy, the performance improvement remains small compared to the number of cores involved in the computation. The main reason is that performance is then determined by the memory bandwidth and other hardware mechanisms such as the size of the in-flight cache requests for each cache.
In Fig. 5 , we use the STREAM benchmarks [12, 13] to evaluate memory performance of our implementation on different architectures. These benchmarks correspond to OpenMP parallel C code for some few particular vector expressions (copy, triad, daxpy for instance) and are used here as reference code. The benchmarks are compiled using the same gcc versions, and Intel icc version 10.1.
Fig. 5 shows different important results:
• Memory bandwidth usage increases as the number of threads increases. The increase is not linear (from 2 to 4 threads for the Xeon machine for instance), performance only slightly increases. This is due to the fact that cores do not have a uniform memory access. On each chip, four cores compete for the access to memory through the same Front Side Bus. Further more, two cores share the same L2 cache, that can sustain a limited number of in-flight memory requests (cache misses). This could explain for the performance stall in the Xeon machine between 2 and 4 threads.
• Performance of expression template code is comparable to C code. There is no loss of performance, while there is a gain in the abstraction of the formulation.
• There appears a difference between performance obtained through gcc and performance obtained with the Intel compiler. While this provides an upper and sustainable bound on memory bandwidth, investigation on the causes of this difference is left for future work.
The apparent memory bandwidth obtained does not correspond to the peak memory bandwidth of the machine, and changing for instance the compiler (considering icc instead of gcc on the Intel machine) would probably lead to some further performance improvements. Depending on how the assembly instructions are scheduled, cycles taken in the decoding phase of the program execution and usage of the functional units will differ (removing possible stalls for instance).
Finally, Fig. 6 shows the variation on memory bandwidth performance according to the number of operands in vector expressions. While on the 16-core Xeon, it appears that performance improves when the number of operand increases, this is the opposite for the Opteron machine. In the Xeon case, this shows that the limit of memory requests that can be in-flight at some point of the computation has not been reached, even using ternary expressions. Memory requests are not serialized and there is some amount of overlap during their resolution, accounting for the performance increase. In the Opteron case, on the contrary, some requests are serialized or introducing some additional stalls. This may occur in the cache hierarchy and more detailed explanation requires further investigation.
Comparison with other libraries
We have carried out performance comparisons with other available expression template linear algebra libraries: armadillo [14] , Blitz++ [5] , DealII [15] , Eigen [16] , FLENS [17] , genial [18] , gmm++ [19] , MTL4 [20] , Seldon [21] , uBlas [6] and std::valarray. 
CONCLUSION AND OUTLOOKS
This paper presents some new formulation for vector expressions, using Expression Templates. This formulation is a short and simple way to describe vector expressions and improve performance. Parallelism expressed as OpenMP code is used in the vector class definition, but does not appear at all the vector expression level and the same expression may be implemented using different strategies. 
