Abstract-The degree of DLP parallelism in applications is not fixed and varies due to different computational characteristics of applications. On the contrary, most of the processors today include single-width SIMD (vector) hardware to exploit DLP. However, single-width SIMD architectures may not be optimal to serve applications with varying DLP and they may cause performance and energy inefficiency. We propose the usage of VLIW processors with multiple native vector-widths to better serve applications with changing DLP. SHAVE is an example of such VLIW processor and provides hardware support for the native 32-bit and 128-bit wide vector operations. This paper researches and implements the mixed-length SIMD code generation support for SHAVE processor. More specifically, we target generating 32-bit and 128/64-bit SIMD code for the native 32-bit and 128-bit wide vector units of SHAVE processor. In this way, we improved the performance of compiler generated SIMD code by reducing the number of overhead operations and by increasing the SIMD hardware utilization. Experimental results demonstrated that our methodology implemented in the compiler improves the performance of synthetic benchmarks up to 47%.
I. INTRODUCTION
Massive parallelism in numerous video, image and signal processing applications mostly shows up in the form of data-level parallelism (DLP). DLP is typically exploited through special single-instruction multiple-data (SIMD) hardware units. SIMD model of computation and its related architecture represent a proven effective technique for improving the computational efficiency [1] - [4] due to the low power and high performance resulting from its simplicity, as it applies a single instruction to multiple parallel processing elements. As shown in [1] adding SIMD execution units to the base processor improves performance by 10x and energy efficiency by 7x for 720p HD H.264 encoder. SIMD is adopted in many modern processor architectures targeting embedded, desktop, graphics and super computing domains.
Application parallelism to be exploited inside a single processor also appears in the form of the instruction-level parallelism (ILP). ILP is exploited by issuing scalar operations and SIMD operations in parallel. In order to exploit both the ILP and DLP, very-long instruction-word (VLIW) architecture is one of the most commonly used type of processor architecture in many mobile [5] - [7] and embedded platforms. However, the degree of parallelism in applications is not fixed and varies due to different computational characteristics of applications or application parts. A DLP analysis of multimedia and signal processing algorithms in [8] shows that many algorithms have different natural SIMD width (number of elements) -4, 8, 16, 96, 1024. Moreover, an analysis of multimedia applications in [9] also exposes that SIMD width requirements change over the run-time of the same applications. On the other hand, most architectures support a single vector-width such as 128-bit Intel SSE, ARM NEON and PowerPC Altivec SIMD architectures. Single vector-width may not be optimal to serve applications which exhibit smaller DLP than the architecture's vectorwidth. In such cases, either a part of the SIMD hardware unit is used to perform the computation or smaller DLPs are first converted to match with the vector-width of the architecture before the actual computation. The both cases result in the performance and/or energy inefficiency.
In our previous work [10] , we proposed the usage of VLIW architectures with multiple native vector-widths to better serve applications with varying DLP. The SHAVE (Streaming Hybrid Architecture Vector Engine) VLIW vector processor [11] is an example of such kind of architecture. SHAVE is a unique VLIW processor that provides hardware support for the native 32-bit and 128-bit wide vector operations. This industrial processor is used as an experimentation and demonstration engine for our multiple vector-width methodology.
The research reported in this paper focuses on the mixed-length SIMD code generation to efficiently exploit the multiple-width vector operations. The contributions made in this work are twofold. First, we implemented the SIMD code generation support in SHAVE compiler for the native 32-bit wide vector units. As a result of this work, the new compiler is able to generate the mixed-length SIMD code. More specifically, we target generating 32-bit (short) SIMD code for the native 32-bit wide vector units in parallel to the 128/64-bit (long/medium) SIMD code for the native 128-bit wide vector units of SHAVE processor. In this way, we aim at improving the performance and energy efficiency of compiler generated SIMD code by reducing the number of overhead operations and by increasing the SIMD hardware utilization via better exploitation of ILP. We demonstrated that our compiler improves the performance of synthetic benchmarks up to 47%.
The paper is structured as follows. In the next section, background information is given. In Section III, limitations of current SHAVE compiler are discussed and our proposed solution is presented. We further motivate the short and mixedlength SIMD in Section IV. Section V explains our contributions to the compiler for enabling the short SIMD code generation. Sections VI and VII focus on experiments and discuss their results. Related research is discussed in Section VIII. Finally, Section IX concludes the paper. In this section, background information on SHAVE vector processor, SIMD capability of SHAVE, and vector data classification is provided.
A. SHAVE Vector Processor
The SHAVE VLIW vector processor shown in Figure 1 contains wide and deep register-files combined with a variablelength long instruction-word (VLLIW). The function units in the data-path are the predicated execution unit (PEU), branch and repeat unit (BRU), two 64-bit wide load-store units (LSU0/1), 128-bit wide vector arithmetic unit (VAU), 32-bit wide scalar arithmetic unit (SAU), 32-bit wide integer arithmetic unit (IAU) and 128-bit wide compare move unit (CMU). Each function unit is enabled separately by a header in the variable-length instruction. The additional two blocks in Figure 1 are the instruction decode (IDC) and debug control unit (DCU). The function units are fed with operands from a 128-bit x 32 entry vector register file (VRF) with 12 ports, 32-bit x 32-entry scalar register file (SRF) with 12 ports and 32-bit x 32-entry integer register file (IRF) with 17 ports. The function units in the VAU unit accept operands from VRF. Similarly, units in the IAU accept operands from IRF, while units in the SAU accept operands both from IRF and SRF.
Data and instructions reside in a shared connection matrix (CMX) memory block and data is moved between peripherals, processors and memory via a bank of software-controlled DMA engines. A LEON3 RISC processor controls the whole multi-processor system and manages the initialization and termination of the kernels on SHAVE.
B. SIMD Capability of SHAVE
VLLIW packets control multiple function units which have SIMD capability for highly parallel and high throughput execution at the function unit and processor level. Vector function units are designed to support vector arithmetic at different precision levels. Function units in VAU are designed to support 128-bit vector arithmetic of (signed/unsigned) 8-bit/16-bit/32-bit integer, and IEEE-compatible 16-bit and 32-bit floating-point. Half and quarter of the function units in VAU can also be used for 64-bit and 32-bit vector arithmetic. Function units in SAU are designed to support 32-bit vector 
C. Vector Data Classification
An SIMD instruction performs the same operation on multiple data elements of the same type. Register contents are used as operands of SIMD operation and are considered as vectors of elements. Therefore, packing of multiple data elements of the same type in a register is required before performing an SIMD instruction. In SHAVE, IRF, SRF and VRF registers are used to store the operands of SIMD operations. The contents of the IRF, SRF, VRF can be composed of multiple elements of different plain data as shown in Table I . The leftmost column in Table I lists plain data formats which involve both integer and floating-point data. Integer data is represented in (signed or unsigned) 8/16/32-bit binary formats. The 32-bit format represents a wide range of data, while 16-bit and 8-bit formats usually provide sufficient precision to represent audio and video data, respectively. Floating-point data is represented in 16/32-bit binary formats. The 32-bit format is used to represent single-precision floating-point data, while 16-bit (a.k.a half) format represents half precision floatingpoint data. Composition of multiple plain data into one vector results in vector data. We further classified vector data as long, medium and short according to their bit-widths.
Long (128-bit) Vector Data: 128-bit data is classified as long vector data. As shown in the second leftmost column of Table I , possible combinations of long vector data are 16 x {i8, u8}, 8 x {i16, u16, f16} and 4 x {i32, u32, f32}. VRF registers are used to store the long vector data.
Medium (64-bit) Vector Data: 64-bit data is classified as medium vector data. As listed in the middle column of Table  I , possible combinations of medium vector data are 8 x {i8, u8}, 4 x {i16, u16, f16} and 2 x {i32, u32, f32}. Half of the VRF registers are used to store the medium vector data.
Short (32-bit) Vector Data: Finally, 32-bit vector data is classified as short. As presented in the rightmost column of Table I , possible combinations of 32-bit short vector data are 4 x {i8, u8}, 2 x {i16, u16, f16}. SRF is used to store the floating-point vector data, while IRF is used to store the integer vector data. Quarter of VRF is also used to store both the floating-point and integer vector data. The original compiler targeting the SHAVE processor is capable of SIMD code generation for 128-bit and 64-bit vector operations. The compiler accepts the long and medium vector data as supported types. However, there is no support in the compiler for the 32-bit SIMD code generation using the native 32-bit vector operations, even though SAU and quarter of VAU function units are designed to support 32-bit vector operations. Therefore, several type conversions are required to handle the short vector data types. Currently, compiler promotes the short vector data type to a longer data type before a vector computation. After the computation, its result is converted back to the short vector data type. This is exemplified in Figure 2 . The two input short vector data (v0, v2) of 4 x i8 type are first promoted to the medium vector data (v1, v3) of 4 x i16 type. After 64-bit vector computation on VAU, result data (v4) of 4 x i16 type is converted back to its original 4 x i8 vector type (v5). In this example, only the useful computation is carried by VAU and the data type conversion operations are considered as overhead operations. Therefore, only one out of four operations is a useful operation. All short vector data is handled in the same manner and promoted to a supported medium or long vector data.
In order to reduce the number of the overhead operations and to increase the parallel SIMD hardware utilization, we propose to add an appropriate compiler support for the 32-bit SIMD code generation for the short vector data. As mentioned earlier, SAU and quarter of VAU function units are designed to support 32-bit vector operations. However, we initially chose to generate 32-bit SIMD code only for SAU to utilize the both SIMD hardware at the same time and to avoid the energy cost of using the VAU hardware than needed. In this way, we aim at generating 32-bit (short) SIMD code for SAU in parallel to the 128/64-bit (long/medium) SIMD code for VAU. Such mixedlength SIMD code generation contributes to the performance improvement and energy savings. The following section gives more insights into the short and mixed-length SIMD.
IV. SHORT AND MIXED-LENGTH SIMD
In this section, we further explain how the code generation support for short and mixed-length SIMD contributes to the performance and energy efficiency by reducing the number of overhead operations and increasing the SIMD hardware utilization.
A. Reducing Overhead Operations
A motivation example is used to demonstrate the benefits of adding the SIMD code generation support in a compiler for the short vector data. Listing 1 presents an example of the LLVM intermediate representation (IR). LLVM-IR is strongly typed language and an instruction in IR is explicitly associated with a type. Listing 1 includes two load instructions, one add instruction and one store instruction applied to data of 4 x i8 vector type. For the sake of simplicity, the IR code for the stack maintenance is eliminated. ; adds the vectors %z = add <4 x i8> %x, %y
; stores the result vector back to stack store <4 x i8> %z }
The schedule of the original compiler generated assembly for the IR code in Listing 1 is presented in Table II . In the first cycle, the data is loaded (LD32) to IRF from memory with two parallel LSU load instructions. Then, the two input data are copied (CPIV) from IRF to VRF with CMU instruction in the second and third cycle. Since there is no support for 32-bit vector addition, the input data is sign extended (CPVV) from type of 4 x i8 to 4 x i16. This way, the data type becomes eligible to the 64-bit vector addition (ADD) which is carried out in the sixth cycle on VAU. After the computation, the result is converted/truncated (CPVV) back to its original type in the next two cycles. Finally, the result is copied (CPVI) from VRF to IRF and stored (ST32) back to the memory in the tenth cycle. 
Blue and red colored instructions in Table II are the overhead instructions. Blue colored load and store instructions are inevitable, but the red colored operations can be eliminated. Table III represents the schedule of the generated assembly for the IR code in Listing 1 after adding our SIMD code generation support for short vectors in the compiler. In the first cycle, the data is loaded to IRF from memory with two parallel LSU load instructions. Then, 32-bit vector addition is carried out on SAU in the second cycle directly using the data from the IRF. Finally, the result is stored back to memory in the third cycle. As it can be seen from the schedule, all the red colored overhead operations in Table II are eliminated. In total, 7 overhead operations are eliminated and the cycle count is reduced from 10 to 3 (70% improvement). Besides reduction of overhead operations and related performance improvement, the short SIMD code generation leads to a more compact assembly code which results in a program size reduction. Reduction in both the program size and operation count contributes to the energy saving.
The examples shown in Fig. 2 and Table II are simplified and show the worst-case situations. Therefore, the ratio of overhead operations to the useful operations is high and around 70-75%. However, we expect to have lower overhead ratio in the real cases. One reason for that is a load instruction requires 6 delay slots before writing data to a register file. This introduces 5 NOPs (no operation) instruction after the load instruction. The NOPs are not shown in Table II and introduction of the NOPs to the assembly code results in a more realistic 46% (7 out of 15 cycles) overhead ratio. Furthermore, complex operation pattern selection can hide several conversion operations. For instance, extended load (load with type extension/promotion) and truncated store (store with truncation) are examples of such complex patterns. Moreover, if an overhead operation is scheduled either in parallel with another operation or in a delay slot of a multi-cycle operation, elimination of an overhead operation may not impact the total cycle count and even may reduce the average ILP.
B. Increasing SIMD Hardware Utilization
Function units in VAU and SAU support parallel execution of SIMD operations. However, the original compiler schedules all the SIMD operations on VAU and does not exploit any other available parallel SIMD hardware. With mixed-length SIMD code generation support, we aim at increasing the SIMD hardware utilization via a better exploitation of SIMD operation parallelism.
Listing 2: An IR with short and medium vector types define <4 x i8> @main(<4 x i8> %a, <4 x i8> %b, <8 x i8> %x, <8 x i8> %y, <8 x i8> * %zptr){ %c = add <4 x i8> %a, %b %z = add <8 x i8> %x, %y store <8 x i8> %z, <8 x i8> * %zptr ret <4 x i8> %c } Listing 2 provides an example of an IR code that applies two addition operations on data of short and medium vector types in the same function. Since there is no dependency between the two addition operations, they are eligible for parallel execution. Listing 4 presents the assembly code generated by the original compiler for the IR code in Listing 2. Instructions in the assembly code follow the convention shown in Listing 3. The lack of a code generation support for short vectors results in the promotion of the short vector of 4 x i8 type to 4 x i16, as seen in Listing 4. As a result, two independent operations are scheduled in a sequential order on the same hardware resource. On the other hand, Listing 5 presents a mixed-length SIMD code for the same IR code in Listing 2 generated by the compiler with short SIMD code generation support. Double pipe sign ( ) in assembly code refers to the parallel execution of instructions. As it can be seen from the assembly code, the two independent addition operations (VAU.ADD and SAU.ADD) are scheduled in parallel on two available hardware resources. Moreover, next to the increased hardware utilization, another advantage is that the energy consumption of the vector computation on the 32-bit wide hardware is expected to be much lower than on the 128-bit wide hardware (even though a part of the hardware is used for a useful computation).
V. COMPILER SUPPORT FOR SHORT SIMD CODE GENERATION
An LLVM [12] based commercial compiler targeting code generation for SHAVE processor is used as a base compiler. This section explains the implementation work required to support the short SIMD code generation. This work enabled compiler to generate the mixed-length SIMD code. The new compiler is able to generate the 32-bit SIMD code for the short vector types that can be executed on SAU in parallel with the 128/64-bit SIMD code that is executed on VAU.
Three-phase Compiler Flow: LLVM compiler flow consists of the front-end, middle-end and back-end phases. The front-end and middle-end provide target-independent code optimizations and transformations, while the back-end enables the target-dependent code generation. The short SIMD code generation requires additional features implementation in the compiler back-end. The compiler back-end comprises the data type and operation legalization, instruction selection, register allocation, instruction scheduling and code emission phases. The following subsections explain the features added to compiler back-end in order to support the short SIMD code generation.
Type Legalization: First of all, the signed integer 4 x i8 and 2 x i16, and floating-point 2 x f16 LLVM-IR vector types are registered as the supported (legal) vector types in the compiler back-end. Adding new vector types to the back-end requires association of the types with register files. Therefore, the integer types are associated with IRF, while the floatingpoint type is associated with SRF. This implies that the short integer vector data can be stored in IRF registers. On the other hand, the short floating-point data can be stored in SRF registers. Moreover, all short vector types are associated with quarter of VRF. The register class association information is further used for the register allocation.
Instruction Selection: The instruction selection consists of several phases, of which the most important ones are the legalization of operand types, the legalization of operations and the actual instruction selection. The legalization phase converts the IR types and operations to the types that are supported natively by the target machine. After the legalization, the IR operations are associated with the target machine instructions by using pattern matching. In order to facilitate the instruction selection, we added the following patterns to the compiler back-end.
Load & store: Several load patterns are defined to load the 32-bit data of the integer vector types to the IRF registers and the data of the floating-point vector type to the SRF registers. Similarly, the 32-bit store patterns are defined for the short vector types to store the data from the corresponding register files to memory. These patterns result in selection of the LSU instructions.
Arithmetic, logical & shift:
The new patterns are added for the 32-bit vector addition (add), subtraction (sub), multiplication (mul), division (div) and integer modulo (mod) operations. New patterns for the bit-wise logical operations (and, or, xor, not) supporting the short integer vector types are also added. Additionally, the right and left shift operation patterns are added for the short integer vector types. All these operation patterns match with the corresponding SAU instructions.
Data movement: Several patterns are added in order to enable bidirectional data movement of short vector types between the SRF, IRF and quarter of VRF. This implies data movement between register files without the type conversion.
Data type conversion: Several patterns that correspond to data movement within the same or between different register files with type conversion are added. For instance, conversion of a signed integer in IRF to a floating-point in SRF (e.g. 2 x i16 → 2 x f16) or conversion of a floating-point in SRF to an integer in IRF (e.g. 2 x f16 → 2 x i16) are examples of such operations. Moreover, these operations also involve the type conversion from a short vector type (e.g. 4 x i8) to a longer vector type (e.g. 4 x f16) and vice versa.
Widening/extending & truncating/rounding: New patterns corresponding to widening/truncating of a short/longer vector to/from a longer/short vector of the same data type. Sign extension is used for the signed (e.g. 4 x i8 → 4 x i16) and zero extension is used for the unsigned (e.g. 4 x i8 → 4 x u16) widening of the integer data type. On the other hand, truncation is only applied to the unsigned integer data type. Moreover, patterns for extending (e.g. 2 x f16 → 2 x f32) and rounding (e.g. 2 x f32 → 2 x f16) operations between the floating-point data types are also added.
Data pack & unpack:
Finally, new patterns that correspond to the insertion of a scalar data element (e.g. i8) into a vector (e.g. 4 x i8) or, vice versa, extraction of a scalar element from a vector are defined.
The patterns added for the data movement, data type conversion, widening and truncating, and data pack/unpack result in selection of the corresponding CMU instructions.
VI. EXPERIMENTS & RESULTS
Several synthetic benchmarks were used to evaluate the impact of our new compiler generating the mixed-length SIMD code with respect to the original compiler. Synthetic benchmarks are chosen in order to minimize the intervention of the compiler optimizations (e.g. auto-vectorization) on the code generation. In this way, we can have repeatable results for the accurate comparison. Benchmarks, that are written in C language, are first compiled with the original and the new compiler. Secondly, resulting assembly codes are simulated with a cycle-accurate simulator in order to measure the performance metrics.
Arithmetic Operations: First benchmark set involves add/sub, div, mod, mul arithmetic operations on different vector data types. The front-end compiler supports new vector type definitions through the vector extensions. Table IV provides a list of the scalar data types and their corresponding vector extended types that are used in the benchmarks as types of the input data. For instance, the char4, short2 and half2 vector types in the high level language correspond to the 4 x i8, 2 x i16 and 2 x f16 vector types in LLVM-IR. Similarly, an unsigned integer uchar4 type corresponds to the 4 x u8 type. Each benchmark test file applies arithmetic operation on the input data of the specific vector type and checks the result vector by comparing each element of the result vector with the expected value.
Supported short (32-bit) vector types: A subset of benchmark applies arithmetic operations on data of the char4, short2 and half2 vector types. of instructions and stalls, and the average ILP of the test programs compiled with both the original and the new (mixedlength) compiler. The total number of cycles corresponds to the sum of the number of instructions and stalls. Performance improvement thanks to the mixed-length compiler is measured with the number of instruction (cycle count) reduction. As it can be seen from the table, the number of instructions reduced are between 24 and 41. This leads to the cycle count reduction in the range of 6 and 15%. This corresponds to the cost of the program execution on the whole system (SHAVE + LEON3 host processor). There is a static cost of initialization and termination of the SHAVE program. Subtraction of the static cost (144 instructions) from the total number of instructions result in the number of instructions that are only executed on the SHAVE processor. Multiplication of the remaining number of instructions with the average ILP leads to the number of operations scheduled on the SHAVE processor. Table VI presents the number of operations saved thanks to the mixedlength SIMD compiler w.r.t. the original compiler (for the same test programs provided in Table V) . As it can be seen from the Table VI , between 13% and 40% of operations can be eliminated. The rightmost column of Table V presents the cycle count improvements only for the program running only on SHAVE. The cycle count improvement of each test is different than the corresponding operation count improvement due to the change in average ILP.
Unsupported (less than 64-bit) vector types: The second subset of benchmarks serves the analysis of the impact of the new compiler for the unsupported (non-legal) types. These vector types cover the data range from 16-bit (e.g. char2) to 48-bit (e.g. short3), and also include the unsigned integer (e.g. uchar4) data. Since these vector types are not natively supported by the compiler, they are either promoted to a supported vector type or converted to a scalar type before vector or scalar computation. Figure 4 presents impact of the new compiler on the performance of the arithmetic operations accepting integer data of the unsupported types. As it can be seen from the figure, the new compiler positively contributes to the code generation for the unsupported types and improves the performance of the test programs up to 47%. This corresponds to the instruction count reduction up to 57. On the other hand, the performance of the arithmetic operations on byte2, char2, uchar2 is degraded down to 12%. This corresponds to up to 7 additional instructions. The new compiler decides to extract and apply a scalar operation on each vector element. This is because we did not add patterns for handling the 2 x {i8, u8} types. For instance, new patterns for extending the 2 x {i8, u8} type to the 2 x {i16, u16} type and truncating it back to their original type will enable short vector computation. This will improve the performance of the tests with the byte2, char2, uchar2 types. Type Conversions: The second benchmark set focuses on the type conversions. In the image processing, converting of the image pixel values from integer to floating-point for further data processing (e.g. filtering or transformations) and converting processed data from floating-point to integer for the image display are common operations. In this subsection, we discuss bidirectional type conversions between char, short integer types and half floating-point types. Figure 3 Conversion between char and short: The char to short conversion requires the sign extension. The performance for the vector types of length 2, 3 and 4 are improved up to 33.33% (up to 24 instructions reduction). However, the performance is degraded for the vector length of 8 by -9.17% (10 additional instructions). The short to char conversion requires truncation. The performance of the vector type conversion of all lengths is improved up to 38.89% (up to 42 instructions reduction).
Conversion between half and char: The conversion from half to char results in the performance improvement up to 41.59% (up to 47 instruction reduction) for all vector lengths. The conversions from char to half leads to performance improvements up to 14.44% (up to 26 instructions reduction) for the vector lengths 2 and 3. On the other hand, the performance is degraded by -1.1% (increase of 6 instructions) for the vector length 8.
Conversion between half and short: The conversion between the vector types of length 2 leads to the performance improvement up to 45.57%. This corresponds to the reduction of (up to) 36 instructions. On the other hand, performance of other tests are either not changed or degraded down to -2.17% (up to 6 additional instructions).
The performance improvement depends on the number of operation saved and increase of the average ILP. As observed from the Figure 3 , the number of operation of all tests are reduced (except for the conversion of half3 to short3, only increase of 1 operation). The performance improvement is mainly due to the removal of the redundant copy operations between the IRF, SRF and VRF register files, as the conversion is carried out on CMU using SRF and IRF. While the original compiler moves the data to VRF before the conversion operation. However, on the contrary, average ILP of all tests are not increased and this negated the gain from the number of operation reduction. As a result, performance of few tests are degraded. All in all, performance is improved for the conversion of all types that fit in 32-bit register files. Moreover, the conversion from high to low (e.g. short to char, half to char) number of bits always resulted in the performance improvements for all vector lengths.
VII. DISCUSSION
As seen from the results, our contribution to the compiler leads to the consolidation of short data on IRF and SRF before the actual computation. This results in the performance improvement for benchmarks using the short vector types. On the other hand, the performance of some benchmarks is degraded. In the ideal case, the compiler should evaluate the possible code generation options and choose the optimum option for each specific benchmark. However, the code generation for the architectures having two SIMD widths is not a common practice in compilers. Our prototype compiler is also not yet flexible and mature enough to cope with the all challenges of the code generation for two SIMD width architectures. One possible improvement is in the instruction scheduling phase. For instance, the scheduler can migrate the operations on the short vector type to a larger vector type in the case of either the register or function unit pressure. Such migration can also reduce register spilling, increase ILP and lead to production of a better code. This is especially true for a VLIW architecture where two or more function units can perform the same operation.
Another possible enhancement is improving of the compiler calling convention constructs. The current calling convention requires placing the arguments and the return value of functions in VRF for all vector types. Listing 2 in Section IV-B gives an example of a function having two arguments and a return value of type 4 x i8. Even though the 4 x i8 vector type fits in the IRF register file, due to the calling convention, the arguments and the return value are placed in VRF. This requires copying the arguments from VRF to IRF before data processing and copying the return value from IRF to VRF after processing as shown in Listing 5. As a solution, a vector type aware calling convention construct can pass arguments and return value via IRF or SRF for the short vector types. This will result in the removal of the copy instructions introduced in Listing 5.
VIII. RELATED WORK
In traditional architectures, DLP is usually exploited through single-width SIMD hardware units as in the cases of SODA [3] , Imagine [13] and NXP EVP [14] processors. Lately, width of the SIMD hardware is evolved with architecture generations as in Intel MMX (64-bit), SSE (128-bit) and AVX (256/512-bit) SIMD engines. The old variations of SIMD width are usually supported in the instruction set of the new architecture generations, however, mostly for supporting the backward software compatibility. Recently, hardware reconfiguration techniques are introduced in many example architectures to more efficiently exploit the varying DLP in applications. The most commonly used two techniques are power gating [15] and reconfiguration of SIMD lanes. In [8] , an example architecture, referred to as anySP, with configurable SIMD data-path which supports wide and narrow vector-widths is proposed. In another work [16] , dynamic reconfiguration of SIMD width of the architecture based on the DLP characteristic of loops is also studied. In these works, dynamic configurability enables lane resource to execute as a traditional SIMD processor, be re-purposed to behave as a clustered VLIW processor, or combinations of both. On the other hand, the work presented in [17] uses power gating for turning off the lanes of the SIMD function units in the case of low DLP and the corresponding portion of the code is devectorized to keep the higher lanes off. The devectorized code is executed on the lowest SIMD lane. The above mentioned works focus on either SIMD lane reconfiguration or power gating. Therefore, they suffer from the performance and energy penalties due to the reconfiguration. However, our approach shifts the problem to the software side and tackles the problem in the compiler. This eliminates the penalties intoduced by the reconfigurable-hardware based solutions. Another related work which proposes compiler based solution is presented in [9] . This work presents a vectorization pass, referred as the SIMD Defragmenter, to identify groups of compatible instructions (subgraphs) that can be executed in parallel across the SIMD lanes. Although most of the work is carried out in the compiler, the static configuration of hardware is still needed and this incurs a minimized overhead. Whereas our approach is purely based on the compiler and requires no reconfiguration in the target architecture. The work presented in [18] explains SoftSIMD. In this work, source code transformations are used to facilicate the utilization of a standard 32-bit adder as a 2 or 3-ways SIMD unit in order to increase hardware utilization. Another work presented in [19] explains a compilation framework that optimizes data parallel computation either on scalar ARM processor, ARM NEON vector engine or Vectorblox MXP soft vector processor. While ARM NEON provides a fixed vector lane configuration, MXP soft vector processor enables a range of vector lane configuration to better match the application requirements.
IX. CONCLUSION & FUTURE WORK
In this paper, we have proposed and researched the mixedlength SIMD code generation for SHAVE VLIW architecture which comprises multiple native vector-widths. To our knowledge, we implemented the first (prototype) compiler producing such mixed-length SIMD code for a fixed-variation SIMD architecture. We demonstrated that our compiler improves the performance of synthetic benchmarks up to 47%. However, as it can be seen from the experimental results, the code generation slightly degrades the performance of some benchmark. This indicates the need of further code generation optimizations such as adding 32-bit vector code generation support for VAU to facilitate the parallel execution of two short vectors if there are no longer vectors to process. We also plan to more precisely study the impact of the mixed-length SIMD on the energy consumption as the operation and cycle count reduction directly contributes to the energy saving.
