Abstract Heterogeneity, parallelization and vectorization are key techniques to improve the performance and energy efficiency of modern computing systems. However, programming and maintaining code for these architectures poses a huge challenge due to the ever-increasing architecture complexity. Task-based environments hide most of this complexity, improving scalability and usage of the available resources. In these environments, while there has been a lot of effort to ease parallelization and improve the usage of heterogeneous resources, vectorization has been considered a secondary objective. Furthermore, there has been a swift and unstoppable burst of vector architectures at all market segments, from embedded to HPC. Vectorization can no longer be ignored, but manual vectorization is tedious, error-prone and not practical for the average programmer. This work evaluates the feasibility of user-directed vectorization in task-based applications. Our evaluation is based on the OmpSs programming model, extended to support user-directed vectorization for different SIMD architectures (i.e., SSE, AVX2, AVX512). Results show that user-directed codes achieve manually optimized code performance and energy efficiency with minimal code modifications, favoring portability across different SIMD architectures.
Introduction
While transistor shrinking allows to include additional features and structures on the die, the increasing power density prevents the simultaneous usage of all available resources. Instruction-level parallelism (ILP) importance subsides, while data-level parallelism (DLP) becomes a critical factor to improve the energy efficiency of microprocessors. Among other features, SIMD instructions have been gradually included in microprocessors for various market segments, from mobile to high-performance computing (HPC). Each new generation includes more sophisticated, powerful and flexible instructions. The higher investment in SIMD resources per core makes extracting the full computational power of these vector units more important than ever.
From the programmers point of view, SIMD units can be exploited in several ways, including (a) compiler auto-vectorization, (b) low-level intrinsics or assembly code and (c) programming models/languages with explicit SIMD support. Auto-vectorization in compilers has strong limitations in the analysis and code transformations phases that prevent an efficient extraction of SIMD parallelism in real applications [1] . Lowlevel hardware-specific intrinsics enable developers to fine tune their applications by providing direct access to all of the SIMD features of the hardware. However, the use of intrinsics is time-consuming, tedious and error-prone even for advanced programmers. To facilitate the use of SIMD features, some programming models and languages have been extended with a new set of directives that allow programmers to guide the compiler in the vectorization process (e.g., OpenMP 4.0). This approach is high level, orthogonal to the actual code and portable across different SIMD architectures.
The OpenMP 4.0 standard supports tasking and data dependencies. The parallel execution model of OpenMP 4.0 can be conceived as a directed acyclig graph where each node is a task and each edge a dependence, which must be explicitly annotated by the programmer. Such annotations also provide the opportunity for the runtime system to automatically offload tasks to accelerators like GPU's or Intel MIC coprocessors. The runtime system is empowered to take care of data movements without the need of specific programming intervention besides annotating tasks' input and output dependencies. Also, the runtime system may deploy some optimizations like data prefetching or overlapping of computation and communication. It is also possible to exploit locality by allocating computation where its data reside.
In this article, we evaluate the efficiency of an implementation of a user-directed vectorization proposal using a task-based programming model. Our main contributions include:
• Development of a task-based version of a subset of benchmarks from the ParVec benchmark suite [2] . As discussed by ParVec authors, benchmarks can be classified in scalable (S), resource limited (RL) and code/input limited (CI). While we agree that it would be better to have additional benchmarks in the evaluation, the evaluated benchmarks already covered this classification: blackscholes (S), canneal and streamcluster (RL) and swaptions (CI).
• We present the code modifications necessary to generate a user-directed code version that achieves similar performance and energy results to those obtained with manual vectorization.
• We discuss our findings and propose improvements for both the manually vectorized versions and the user-directed vectorization module in the Mercurium [3] source-to-source compiler.
This article is organized as follows. Section 2 introduces our evaluation methodology. Section 3 shows our main experimental results and discussion. Section 4 presents a brief summary of the related work on SIMD benchmarking and programming models. Finally, Sect. 5 shows our concluding remarks and future work.
Methodology
In this paper, we evaluate three versions of the codes, including (a) two manually vectorized implementations, one based on pthreads [2] and one based on the OmpSs programming model [4] (labeled pthreads and OmpSs, respectively), and (b) a userdirected vectorization (labeled U.D.) Both user-directed and OmpSs versions were developed for this paper. We use the same loops/functions to be vectorized for all the three versions. The user-directed code is compiled using the Mercurium source-tosource infrastructure [5] . Mercurium's vectorizer recognizes user annotations on the code to produce a SIMD version of the scalar code as depicted in Fig. 1 . Binaries are then built and linked using the Intel Compiler C/C++ (14.0.1) as a back-end. We also use Intel's Short Vector Math Library (SVML), O2 optimization level and -no-vec flag. This flag is used to isolate our results from the automatic vectorization performed by the Intel compiler. We also tested automatic vectorization on the original scalar code without any visible performance or energy improvement.
The user-directed vectorization process only requires the addition of the "pragma omp simd" construct (along with some clauses) to vectorize selected loops and functions. Most challenges of user-directed vectorization appear at the Mercurium level, so that it can automatically generate the code from the user annotations. There has been feedback during application development to correct bugs and improve on the automatic code generation quality (e.g., function vectorization).
However, limitations in the current version of the Mercurium vectorizer require additional changes in the code. First, all data structures should to be aligned in order to improve performance (we can generate unaligned loads, but on some platforms, like Intel's Xeon Phi Knight Corner, performance can be compromised). Second, complex "if-then-else" statements are not fully supported in SSE and AVX, due to the lack of mask operations (a mask register allows us to decide which lanes of the vector register perform the operation). There are no plans to implement this feature in the near future, since these architectures will soon become obsolete. Nevertheless, there is some basic support for ternary operators, and simple if-then-else statements need to be rewritten as ternary operators to be vectorized. Third, like in manual vectorization, it is recommended to transform data structures from array of structures (AoS) into structure of arrays (SoA) (though there is some ongoing work to automatize this process).
The evaluation platform is a dual-socket E5-2603v3 processor running at 1.60GHz, with a total of 12 cores, 30MB of L3 cache and 64GB of DDR3. We use PAPI [6] to measure energy, L1D/L2/L3 cache miss rate and total instruction count. The E5-2603v3 only provides energy information for the whole socket, since the power plane 
Evaluation
This section shows performance and energy results for a subset of the ParVec benchmarks. Execution times are shown in absolute numbers in order to compare performance between versions. In addition, speedup is referenced to the scalar sequential combination of each version to show scalability when varying thread count and vector length. L1D cache miss rate and total instruction count are not shown due to space limitations, but are mentioned in the discussion whenever necessary. For further details on ParVec benchmark specifics and the manual vectorization process, please refer to Cebrián et al. work [2] .
Power dissipation remains approximately constant in all SIMD versions. Intel platforms share both floating point registers and arithmetic units for scalar and SIMD instructions. While bit toggling increases power dissipation due to the extra vector length, the processor's front-end pressure is reduced and so is power. It also spends more time idle, waiting for data dependencies and memory operations, and thus dissipating similar average power independently of running scalar or SIMD code. Therefore, any performance improvement from vectorization will come "for free" in terms of power, leading to substantial energy savings.
Although we do not show results for the experiments done without prefetching for space limitation reasons, we briefly comment them and authors can provide further details if needed. We observe very few differences in results for L1 cache level between the binary executed with and without prefetching. We attribute that behavior to the fact that we cannot interact with the prefetcher at this level. We did not find an alternative way of disabling the prefetching mechanism for that L1 cache level. As for the results for L2 cache level, we see a similar behavior for the OmpSs and the user-directed (U.D.) versions and different than the one for pthreads. Both for OmpSs and U.D. versions, the number of accesses is significantly less, and at the same time the absolute number of misses is much higher without prefetching. These two facts are not surprising and just confirm that the prefetcher for that cache level works well. On the other hand, the pthreads version has significantly less accesses compared to the other two versions.
That can be explained by the way the OmpSs runtime works, creating the dependence task graph and incrementing the number of memory accesses. Moreover, when we disable the prefetching mechanism, we see that most of the accesses are misses. Results for the L3 cache level are similar to the ones for L2 and can be explained by the same factors.
For blackscholes, prefetching has a significant impact on performance (and consequently on energy efficiency). The speedup per thread of the native inputs on the pthreads version compared to the scalar OmpSs version goes up from
Globally, the binaries executed with prefetching mechanism obtain better energy reduction as the memory accesses result in a higher number of hits. The power consumption is approximately constant while varying the register size. With the instruction reduction previously mentioned and the power consumption maintenance, we clearly obtained an energy reduction proportional to the speedup.
Blackscholes
The blackscholes benchmark shows almost linear scalability with both thread count and vector length (Fig. 2) . This is mainly because of the high arithmetic intensity of the benchmark (computations per loaded data) and the low L1D cache miss rate. Instruction count is also reduced linearly with vector length, meaning that we are vectorizing most of the application code.
Pthreads and OmpSs versions have the BlkSchlsEqEuroNoDiv and CNDF functions vectorized manually, accounting for roughly 50 lines of intrinsics code per instruction set (SSE, NEON, AVX). In addition, some of the data structures have been aligned. Furthermore, after the if-then-else to ternary operator conversion and a complete data alignment, the user-directed version only requires a single directive per function and loop to vectorize all 50 lines of code. C standard math library calls are replaced with Intel's Short Vector Math Library calls in all codes, so that we can further improve performance.
Energy is reduced close to a factor of three for OmpSs and five (per thread) for pthreads (as shown in Fig. 2 ), most likely due to the overhead of the Nanos++ runtime when distributing work among threads. This leads to an energy reduction of 40x when running on 12 threads. Finally, it is worth mentioning that Nanos++ has an additional energy overhead when using one and two sockets. This is due to the threads spinning while searching for work. 
Canneal
We are using a clustered version of this benchmark to increase computational requirements, thus easing the vectorization process. Scalability when increasing thread count is clearly linear (Fig. 3) for the task-based versions and almost linear for the pthreads version. Performance is also increased linearly with the vector length for manually vectorized versions. For this application, we performed an array-of-structure (AoS) to structure-of-array (SoA) data conversion for all three versions. However, the AVX user-directed version does not achieve the same performance and instruction count reduction. This is because the user-directed version for AVX is similar to that of SSE (except for the wider registers). By contrast, the hand-vectorized version uses a more sophisticated and more efficient strategy for AVX.
The 16x performance gain translates to energy improvements by a factor of 10x when using AVX for 12 threads. This is 4x more than when only relying on threading (Fig. 3) . The use of SIMD instructions increases the miss rate of the L1D cache by a factor of 4 for SSE instructions and by a factor of 6 for AVX instructions, becoming a significant bottleneck for scalability as we increase the vector size.
Streamcluster
This application has almost linear scalability up to 6 threads (Fig. 4) . The use of two sockets in the machine does not allow the full performance achievement due to the significant number of synchronization barriers. This can be explained by the fact that the message flow between different sockets is slower than within the same socket. Function dist has been vectorized as it accounts for most of the execution time, which has a small set of arithmetic operations. As a consequence, scalability depends on the memory subsystem accompaniment. Performance is increased by a factor of 1.8x and 2.1x when using SSE and AVX instructions, respectively, referenced to the scalar version. This means that we are unable to fully benefit from AVX due to the memory subsystem.
While energy scales linearly with thread count until full socket use, performance scalability issues with vector size translates into sub-linear energy improvements (5x for AVX 6-threads as compared to 3x without SIMD). Consumption, speedup and energy reduction when using two sockets is explained by the time spent in inter-socket barrier synchronization. 
Swaptions
Finally, the swaptions benchmark shows linear scalability thread/task count, but limited scalability with vector size (Fig. 5) . In fact, instruction count is reduced by a factor of 1.75x to 2.5x for SSE and AVX, respectively (down from a potential 4x for SSE and 8x for AVX). In addition, swaptions performance is also limited by multiple data dependencies and high L1D cache miss rate (1.5% for scalar, 2.5% for AVX).
This application has several complex if-then-else statements that the compiler vectorizer is unable to translate efficiently, so we decided to manually predicate the code. In this way, arithmetic operations are executed unconditionally and conditions are only used on memory operations. This causes a significant increment on the execution time of the scalar user-directed version, but makes the SSE and AVX versions scale better with vector length (as compared with the manual vectorization with if-then-else statements). We plan to rewrite the manual vectorization in our future work using the same strategy. The user-directed code saves around 20 lines of intrinsic-based code plus a big vector initialization phase that is specific for the target instruction sets (SSE and AVX, around 40 lines of code), making it suitable for any other architecture.
Energy follows performance closely (Fig. 5 ), again experiencing a moderate increase in average power due to the runtime system. The increase on vector scalability due to predication of the code yields moderate energy benefits for SSE and AVX on the user-directed code. AVX instructions manage to almost double the energy efficiency of the application.
Related work
New programming models and languages with support for vector data and vector operations have emerged in the last years. Among them, the most popular are OpenCL and CUDA, which define vector data types to describe vector operations that will be automatically vectorized by the compiler. In addition, ISPC [7] is a SPMD programming model that allows programmers to natively write applications with SIMD parallelism in mind. Chorus [8] extends C with the map and fold functions commonly used in Functional Programming Chorus, but they target them to vector operations. Although these approaches could yield good performance and more portability, sometimes not only a deeper code rewriting might be needed, but also a full redesign of the applications.
On the other hand, user-directed vectorization exploits the power of SIMD instructions with minimal code modifications. There are several proposals based on compiler directives that allow programmers to propose loops that should be vectorized by the compiler. SIMD extensions defined in OpenMP 4.0 and OpenACC 2.0 are two examples of them. In this way, programmers can guide the vectorization of their code and they can benchmark different vector versions by means of introducing simple annotations in their applications. These approaches require little effort by the programmer, whereas they still provide very good performance across platforms. Our work is also in this direction, but we define SIMD extensions in the context of the OpenMP programming model. Our extensions also allow annotating a loop as safely vectorizable, but we go a step further. We define the interaction between SIMD parallelism and fork-join parallelism in an integrated parallel approach. Nowadays, OpenMP includes SIMD extensions to express SIMD parallelism. These extensions are the result of our proposal and Intels proposal. We collaborated in the definition of the extensions that are now part of the 4.0 standard, in the same way that OmpSs was the building block for the task support in OpenMP.
Regarding SIMD benchmarking, Molka et al. [9] discuss weaknesses of the Green500 list with respect to ranking HPC system energy efficiency. They introduce their own benchmark using a parallel workload generator and SIMD support to stress main components in a HPC system, but do not consider a task-based scenario. Kim et al. [10] show how blocking, vectorization and minor algorithmic changes can speed up applications close to the best known tuned version. We want to evaluate whether task-based benchmarks with user-directed vectorization can behave in the same way. The RODINIA [11] and ALPBench [12] benchmark suites also offer limited SIMD support, but not in the same scenario that we explore. Cebrian et al. [2] extended the Parsec benchmark suite to add SIMD support using manual vectorization. We will use these benchmarks as our starting point to build user-directed/task-based versions and compare them in terms of performance and energy. The goal is to produce similar quality code without the need of low-level programming.
Conclusions and future work
The main contribution of this paper is to compare different vectorization strategies and scenarios that would ease the vectorization process of end user applications and libraries. Our goal is to validate whether user-directed vectorization, that is, the use of the Mercurium vectorizer tool (or equivalent ICC/OpenMP4.5), can produce similar quality code to that obtained by manual vectorization using annotations in the code. We show results for four of the benchmarks from the ParVec benchmark suite parallelized with both OmpSs and pthreads programming models. Mercurium source-to-source infrastructure is used to obtain a SIMD version of the application based on annotated scalar code. We use manually vectorized versions as a reference to leverage the results of the user-directed codes.
Vectorization strategies are the same for user-directed and manual vectorization, since both try to keep the applications as close as possible to the scalar implementation. The main difference comes from the user-directed vectorization to be unable to capture some vectorization opportunities. Since we had the manual vectorization available, we could improve on the Mercurium vectorizer to produce better quality code, up to a point where both achieve similar performance without the additional complexity of low-level programming.
The applications show good energy scalability with vector length, specially blackscholes and canneal. The main reason for that is the reduction of executed instructions and memory accesses with respect to the scalar versions. The combination of the task-based benchmarks plus the user-directed vectorization provides similar reduction on runtime while maintaining the power consumption approximately the same. This is explained by the fact that Intel architectures share register and arithmetic units for scalar and SIMD instructions. Power dissipation increases with vector length due to additional bit toggling, but the processor spends more idle waiting for the memory subsystem. As a consequence, average power remains similar, while energy is reduced superlinearly with runtime.
We have showed that scalable applications running with 12 threads can achieve energy improvements up to 40x (blackscholes), while other applications can still benefit from up to 5x (streamcluster). As a result, we can confirm that vectorization and parallelization are key techniques to improve energy efficiency.
As vector support is already existent in most commodity processors, we can reduce energy only by learning how to vectorize our applications in a comfortable manner. Both parallelization and vectorization can be achieved without the need of low-level programming, by using task and vector annotations (pragmas) on the code. In addition, user-directed vectorization keeps the abstraction layer with the underlying architecture, making the code portable between architectures and saving many lines of intrinsics code, thought it may require reorganizing or modifying the code.
As future work, we aim to improve on energy awareness for runtime systems running on multisocket machines. Our future line of work will be to extend the evaluation of user-directed performance on other applications, while we extend Mercurium with additional features. At last, we would like to replace Intel's Short Vector Math Library (SVML) with a more generic one in order to support other ISA's (e.g., ARM ® NEON technology).
