Abstract-The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and "real-world" application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using "real-world" benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.
I. INTRODUCTION
P OWER AND energy efficiency have become the dominant limiting factor to processor performance and have increased significantly processor design complexity, especially when considering the mobile market. Being able to exploit high degrees of data-level parallelism (DLP) at low cost in a power-and energy-efficient way [2] - [4] , vector processors are an attractive architectural-level solution. Undoubtedly, the design goals for mobile vector processors clearly differ Manuscript received June 2, 2017; revised October 7, 2017; accepted December 6, 2017. Date of publication January 9, 2018; date of current version March 20, 2018 . The research leading to these results has received funding from the RoMoL ERC Advanced Grant GA 321253 and is supported in part by the European Union (FEDER funds) under contract TTIN2015-65316-P. The work of I. Ratković from the performance-driven designs of traditional vector machines [5] . Therefore, mobile vector processors require a redesign of their functional units (FUs) in a power-efficient manner. Clock gating is a common method to reduce switching power in synchronous pipelines [6] - [10] . It is practically a standard in low-power design. The goal is to "gate" the clock of any component whenever it does not perform useful work. In that way, the power spent in the associated clock tree, registers, and the logic between the registers is reduced. It is the most efficient power reduction technique for active operating mode. 1 Therefore, the conditions under which clock gating can be applied should be extensively studied and identified. A widely used approach is to clock gate a whole FU when it is idle [6] , [7] . A complementary and more challenging approach is clock gating the FU or its subblocks when it is active, i.e., operating at peak performance [8] . Furthermore, there are characteristics of vector processors that provide additional clock-gating opportunities (that we discuss in Section IV).
Since fused multiply-add (FMA) units usually dissipate the most power of all FUs, their design requires special attention. Abundant floating-point (FP) FMA is typically found in vector workloads, such as multimedia, computer graphics, or deep learning workloads [11] . Although in the past FMA has been used for high performance, it recently has been included in mobile processors as well [4] , [12] . In contrast to highperformance vector processors (e.g., NEC SX-series [13] and Tarantula [14] ) that have separated units for each FP operation, mobile vector processors' resources are limited; thus, we typically have a single unit per vector lane capable of performing multiple FP operations rather than separate FP units [4] . Apart from that, additional advantages of using FMA over separate FP adder and multiplier are as follows.
1) Computation localization inside the same unit reduces the number of interconnections (power and energy efficiency). 2) Higher accuracy (single, instead of two round/normalize steps). 3) Improved performance (shorter latency). In this paper, we investigate the design of a low-power fully pipelined double-precision IEEE 754-2008 compliant FMA unit for vector processors (VFU). In our main contribution, we comprehensively identify, propose, and evaluate (using both synthetic and real-world workloads) the most suitable 1 Active operating mode assumes a busy functional unit.
1063-8210 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
clock-gating techniques for VFU running at peak performance periods without jeopardizing performance. We present three kinds of techniques: 1) novel ideas to exploit unique characteristics of vector architectures for clock gating during active periods of execution (e.g., vector instructions with a scalar operand or vector masking); 2) novel ideas for clock gating during active periods of execution that are also applicable to scalar architectures but especially beneficial to vector processors (e.g., gating internal blocks depending on the values of input data); 3) ideas that are already used in other architectures and that we present as its application is beneficial to vector processors, and for the sake of completeness (e.g., idle VFU). Regarding the second and third groups of ideas, an advantage of vector processing that extends the applicability of clock gating is that vector instructions last many cycles, so the state of the clock-gating and bypassing logic remains the same during the whole instruction. As a result, power savings typically overcome the switching overhead of the added hardware (which is often not a case in scalar processors).
To fulfill current trends in digital design that promote building generators rather than instances [15] , [16] , we perform this research in a fully parameterizable, scalable, and automated manner. We developed an integrated architecture-circuit framework that consists of several generators, simulators, and other tools, in order to join architectural-level information (e.g., vector length or benchmark configuration) with circuit-level outputs (e.g., VFU power and timing measurements). We implement our clock-gating techniques and generate hardware VFU models using a fully parameterizable Chiselbased [17] FMA generator (FMAgen) and a 40-nm low-power technology.
We discuss the related work individually for each of our clock-gating techniques together with the description of the technique in Section IV. Besides, in the context of alternative low-power techniques for FP units, interesting approaches have been proposed: memoing (caching results that can be reused) and byte encoding (computation performed over significant bytes). However, detailed and accurate evaluation reveals that the actual savings are often low and with an unaffordable area overhead [18] .
In summary, the main contributions of this paper are as follows.
1) The first proposal of active clock-gating techniques for VFU (see Section IV). 2) An in-depth evaluation of the proposed techniques. a) Detailed power savings evaluation of limits of each proposed technique separately using synthetic benchmarking (see Section VI-B). b) Realistic, combined, scenario evaluation using real application-based benchmarking. We find the techniques that significantly reduce power with no performance loss (see Section VI-C). 3) A fully automated, parameterizable, and scalable exploration framework including our hardware and software (benchmark) generators (see Section V). 
II. VECTOR PROCESSORS BACKGROUND
Vector processors operate on vectors of data within the same instruction. 2 Vector instruction set architecture (ISA) provides an efficient organization for controlling a large amount of computation resources. Furthermore, vector ISAs emphasize local communication and provide excellent computation/area ratios. Vector instructions express DLP in a very compact form, thus removing much redundant work (e.g., instruction fetch, decode, and issue). For example, a vector FP FMA instruction (FPFMAV) indicates the operation (FMA), three source vector registers, and one destination vector register. Thus, tuples of three elements, one from each source register, are the inputs for the VFU, and the result is written to the destination. All tuples can be processed independently, and multiple elements could be accommodated in a vector register.
The register file is designed so that a single named register holds a number of elements. The entire architecture is designed to take advantage of the vector style in organizing data. Additionally, the memory system of vector processors allows efficient stride and indexed memory access. The number of elements of a vector register is denoted by the maximum vector length (MV L ). Occasionally, fewer elements than the MV L are used, which reduces the effective vector length (E V L ).
The vector execution model streamlines one vector register element per cycle to a fully pipelined vector FU. As a result, the execution time of a vector instruction is the startup latency (a number of stages) of the vector FU plus the E V L . A common technique to reduce this time is to implement multiple vector lanes through replicated lock-stepped vector FU. Each lane accesses its own "slice" of the vector register file, which reduces the need for increasing the number of ports typically associated with a larger number of FUs. Lockstepping the lanes simplifies the control logic and is power efficient. These concepts are shown in Fig. 1 . Although lanes were proposed for increasing performance, using multiple lanes can increase the energy efficiency of a vector architecture [3] , [9] , [10] .
Additionally, an interesting feature that vector processors typically offer is a vector mask control. Masked operations are used to vectorize loops that include conditional statements. Masked operation uses an MV L -bit vector mask register (VMR) for indicating which operations of the vector instruction are actually performed. In other words, masked vector instructions operate only on the vector elements whose corresponding entries in the VMR are "1."
Conventional vector processors should not be confused with single instruction multiple data (SIMD) multimedia extensions such as AVX-512 [19] that are an alternative way to exploit DLP and indicate operations to perform on multiple elements. 3 The main difference of these extensions with a conventional vector processor is that they exploit subword-SIMD parallelism and are typically implemented with multiple vector FUs that operate on all independent elements in parallel. Having a vector FU per element to operate on all them in parallel would be inefficient for vector processors because they operate on much longer vectors. Instead, a vector FU is fully pipelined, and the elements of the vector register are streamlined to the unit, one per cycle, possibly using a small number of vector lanes.
III. FLOATING-POINT FUSED MULTIPLY-ADD
This section briefly describes FP representation and FP FMA. Additional details about FP arithmetic are available in [21] - [23] .
A. Floating-Point Representation
FP representation is the most common way to represent real numbers in computers. It is based on the scientific notation to encode numbers, M × 10 E , where M and E are the mantissa and the exponent, respectively. For example, 123.4 could be represented as 1.234×10 2 . In the same way, the binary number 10100.1 2 could be represented as 1.01001 2 × 2 4 .
IEEE754 FP numbers have three basic components: the sign (S), the exponent (E), and the fraction (F). IEEE754 double-and single-precision FP formats are shown in Table I . The sign bit "1" indicates negative, while "0" indicates positive numbers. The mantissa is composed of the fraction and an implicit (hidden) leading "1." 4 The exponent base (2) is implicit and needs not be stored. The exponent field contains the sum of bias (B) and true exponent (E T ). The bias is 127 for single-and 1023 for double-precision numbers. Therefore, the value represented by an IEEE754 FP number is:
Special value NaN is used for representing undefined values. This happens when one (or more) operand is NaN or when the operation is: 1) 0 * ∞; 2) ∞ − ∞; 3) 0/0, ∞/∞; 4) x mod 0, ∞ mod y; or 5) √ x, x < 0. Another important special value is infinity (±∞). This happens when either input is ∞ or in case of division by zero. NaN and ∞ handlings are explained in [21] . 
B. FMA
The FMA unit executes the FMA instruction (FMA R <-A, B, C) that implements R = A * B + C. In contrast to a multiplication followed by an addition, the FMA instruction assumes all three operands at the same time. It was introduced for the first time in IBM's RS/600 in 1990 [24] . IEEE754-2008 standard defines the FMA instruction to be computed initially with unbounded range and precision, rounding only once to the destination format. For this reason, FMA is faster and more precise than a multiplication followed by an addition. The FMA unit performs operand alignment in parallel with the multiplication. This leads to shorter latency (n S ) compared with a multiplication followed by an addition. Additionally, the FMA operation reduces the number of interconnections between FP units and the number of adders and normalizers. The FMA instructions help compilers to produce more efficient code. Potential drawbacks are increased latency of A simplified implementation block diagram of the FMA unit used in our research is shown in Fig. 2 . As we assume double precision, we need a 162-bit adder and a 53 × 53 multiplier. For the adder and the multiplier, we choose BrentKung and Wallace algorithms, respectively, as it is aligned with our findings in [9] and [10] . The aligner performs shifting of the addend based on the exponents difference in order to align it with the product (M A × M B ).
FP addition using the FMA unit is implemented by setting the first operand to 1 (A = 1.0), while FP multiplication is implemented by setting the third operand to 0 (C = 0.0).
IV. PROPOSED CLOCK-GATING TECHNIQUES
This section presents the proposed clock-gating techniques for VFU. The classification is presented in Table II .
A. Scalar Operand Clock-Gating
We propose this technique to tackle the cases in which one or two operands do not change during the vector instruction. Table III lists the types of instructions during which scalar operand clock-gating (ScalarCG) is active. As only one of all the supported vector instructions has all three vector operands, often at least one operand is scalar. Only the FPFMAV instruction, in which all operands are vectors, does not benefit from this technique. During these instructions, the corresponding input register(s) of scalar operand(s) should latch a new value only on the first clock edge of the execution of the instruction, while during the rest of the instruction, they can be clock-gated. To implement this, we introduce the signals VS [2. .0] (Fig. 2) , where VS[i] = 0 means that the ith operand is gated after the mentioned first cycle. VS signals are derived from the instruction OPCODE. Deriving VS signals from the OPCODE is done before the first pipeline stage (as shown in Fig. 2 ). This generation (decoding) requires regular comparators, and they are not on the critical path as the OPCODE is available at least one cycle in advance. Table III shows corresponding VS signals for all the instructions.
B. Implicit Scalar Operand Clock-Gating (ImplCG)
This technique is an additional optimization of ScalarCG and aims to exploit further the information given through the instruction OPCODE for clock-gating, operand isolation, and computation bypassing. In the case of addition and subtraction instructions, such as FPADDV and FPSUBV, the 53 × 53 mantissa multiplier is not needed as it is known that one of the multiplicands is "1," and thus, we can bypass, isolate, and clock-gate it providing the value of the other multiplicand directly to the adder. There is an analogous situation for FPMULV since the addend is known to be "0." In this case, the 162-bit wide adder, leading zero anticipation, and the aligning part are not needed.
To control bypassing, isolation, and clock-gating of the mentioned submodules, we introduce signal INSTYP (see Fig. 2 and Table III) , generated from the instruction OPCODE, which indicates whether an FPFMAV or an In the context of instruction-dependent techniques, there is interesting research done in the past for scalar processors [8] . The main advantages of our ImplCG proposal over the mentioned research are: 1) we apply the technique for a variable number of pipeline stages; 2) we evaluate power, timing, and area; and 3) we propose the technique for vector processors.
The advantage of applying this technique on a vector processor over other models (e.g., scalar) is that vector instructions last many cycles, so the state of the related hardware (clock-gating logic and MUXs) maintains the same state during the whole instruction. Thus, there will be less switching overhead than in the scalar case.
C. Vector Masking and Vector Multilane-Aware Clock-Gating (MaskCG)
Here, we target cases in which there are idle cycles during the vector mask instructions (e.g., FPFMAV_MASK). Common cases in which vector mask control is used are: 1) sparse matrix operations, and 2) conditional statements inside a vectorized loop. Additionally, we assume that the same mechanism is also used to reduce the E V L to less than the MV L . We assume that the control logic will detect and optimize this case, skiping the last elements of the vector corresponding to the trailing 0s of the mask. However, in vector designs with n L lanes, there will still be mod(E V L , n L ) idle lanes in the last cycle of the operation.
The VMR directly controls the clock-gating of the whole arithmetic unit during these idle cycles (see Fig. 2 ). Regarding the internal implementation, we perform clock-gating at pipeline stage granularity [25] , so we prevent useless cycles inside the unit, i.e., the data are latched in subsequent stages only if necessary. Once the Enable signal of the first pipeline stage's register gets the value "1," this Enable signal propagates to the end of the pipeline, one stage per cycle (see Fig. 2 ). In other words, the Enable signal of the nth stage is actually the first stage's Enable signal delayed by n −1 cycles. This is implemented by adding a 1-bit-wide, (n S −1)-long shift register that drives clock-gating cells.
To the best of our knowledge, there is no related work that aims to exploit vector conditional execution with VMR to lower the power of vector processors.
D. Input Data Aware Clock-Gating (InputCG)
Here, we identify the scenarios in which, depending on the input data, a part of mantissa processing is not needed for the correct result and, thus, can be bypassed. We use a recoded format for internal representation [26] that allows us to detect special cases (explained in Section III-A) and zeros with an negligible hardware overhead; it requires inspection of only three most significant bits of the exponent (fourth column in Table IV) . Table IV presents the identified scenarios (conditions) in which a hardware block of mantissa arithmetic computations and the corresponding input registers can be bypassed, isolated, and clock-gated. The recoded format allows detection of relevant scenarios by using simple 3-bit comparators. They are located at the inputs of VFU (A, B, and C processing block on Fig. 2 ). 6 In that way, we assure that the mentioned detection comparators are not on the VFU's critical path, i.e., gating information is available in time.
The added internal hardware is similar as for I mplCG. Having zero addend is analog to FPMULV instruction case (see Fig. 3 ). Zero multiplicand allows gating and bypassing all the modules from Fig. 3 except the registers that hold operand C value as in that case the final result is operand C. In the case of NaN and infinity, there is no need for any computation as the result that has to be at the VFU output is already known (explained in Section III-A), so we can gate/bypass/isolate vast majority of FMA submodules.
There are many workloads whose data contain a lot of zero values [27] , [28] , thus can fairly benefit from the last two subtechniques presented in Table IV . Although these techniques are applicable to other architectures as well, their application to vector processors is more efficient since the recurrent values are common within the vector data, thus lowering the switching overhead in added hardware (clockgating logic and MUXs).
While both ImplCG and InputCG techniques aim to exploit cases when the addend is "0," in this case, there is no external information of "0" existence via VS signals, but it has to be detected, and the gating has to be done on time.
As in the case of ImplCG, the research done in [8] presents a related data-driven technique for scalar processors. The main advantages (that enable additional savings) of our InputCG technique over the mentioned research are: 1) detection of zero operands; 2) distinction between ∞ and NaN; and 3) gating the mantissa multiplier when processing NaNs.
E. Idle Unit Clock-Gating (IdleCG)
This technique clock-gates the VFU when no data are supplied to it. The clock (un)gate decision is made in the instruction issue pipeline stage, where it is known if an instruction will be sent to the VFU in the next cycle (see Fig. 2 ). As indicated in Fig. 2 , this technique uses the same internal clock-gating circuitry as MaskCG. A similar approach is widely used in scalar processors and is known as deterministic clock-gating [6] , [7] . Nonetheless, this technique has more potential for power savings than its scalar equivalent as it can benefit from the following vector specific advantages.
1) Vector FUs are used in burst fashion (with idle periods between bursts) since a single FMA/ADD/SUB/MUL instruction processes all vector elements in consecutive cycles. This makes clock-gating more efficient as the overhead of its buffers is minimized. 2) For high-frequency designs, the issue stage may need an additional cycle to determine if a unit will be used in the next cycle. In a scalar processor, we would need to waste that cycle once per each scalar FMA/ADD/SUB/MUL instruction. In a vector processor, we waste this cycle EV L − 1 times less. Although here we focus lowering power of VFU when it is active, we present this technique for the sake of completeness.
V. METHODOLOGY
A simplified block diagram of the framework is shown in Fig. 4 . It includes architectural-(uKernel, VectorSim, and FMAgen) and circuit-level [RC, encounter digital implementation (EDI), NCsim] simulators and tools, as well as an interfacing tool (tBenchGen). For various parameters, we obtain power (P) and area (A) of the VFU.
The first step is feeding VectorSim (described in Section V-A) with vectorized microbenchmarks (uBench) and vector parameters [MV L and a number of vector lanes (n L )]. Using these inputs, VectorSim generates data and timing traces for the vector FP operations. We use two kinds of uBenchs (both explained in Section V-B) as follows.
1) param-uBenchs are generated by feeding the parameterizable microkernel uKernel with its parameters. 2) app-uBenchs are manually vectorized kernels extracted from applications. The synthesizable Verilog netlists are generated using FMAgen (described in Section V-C). The output of architectural-level simulations together with FMAgen parameters is transformed into Verilog test benchmarks (tBenchs) using tBenchGen, a tool that we developed. The most important inputs are: data and time traces from VectorSim, vector and FMAgen (explained in Section V-C) parameters, clock cycle, and tBench length expressed in the number of Verilog test vectors. As output, it provides tBench for each lane separately and its profiling report.
Afterward, we use Cadence RTL Compiler (RC) to obtain synthesized mapped netlists and to perform static timing analysis (STA) of the VFU. All the designs are synthesized for the minimum clock period that provides a safe slack on the critical path. We optimize noncritical path logic for leakage using high-V T H cells. We provide synthesized Verilog netlists together with the physical layout information to the Cadence EDI system to get placed and routed designs and to perform again STA. Additionally, the most critical paths are verified with Cadence Spectre.
In order to verify the designs and extract resulting switching activity information (written into value change dump files), we simulate each VFU in Cadence NCSim for each matching tBench with back-annotated delays using standard delay format files. Afterward, we perform precise power estimation using EDI PowerSim.
All designs are implemented using the mentioned lowpower and low-leakage TSMC40LP library for typical operating conditions (V dd = 1.1 V and temp = 25°C). Since practically all existing vector processors are developed using standard cells [29] , we selected this approach in our research. We use latch-based integrated clock-gating cells from the cell library. We set the tools to meet timing constraints while prioritizing power over the area. All the optimizations in all the tools are applied using high effort.
An initial target core density of 70% is selected as a sensible balance between timing improvement and shrinking area for the wide set of designs parameters that we use. We experimentally found that, in a place and route (PnR) stage in general, density below 70% sometimes provides negligible faster timing for a nonnegligible area overhead while densities higher than 70% can spoil timings noticeably as the tool suffers from the lack of free space for optimizations and routing. Additionally, the initial densities below 70% sometimes even cause design rule check errors and density (congestion) violations.
A. VectorSim
We built VectorSim based on the vector architecture library (VALib) and the SimpleVector simulator [11] , developed in our group. VALib is a library that implements vector instructions and allows rapid manual vectorization and characterization of applications. SimpleVector is a simple and very fast trace-based simulator, which helps to estimate the performance of a vector processor. We took advantage of the fact that both tools have been designed to be easily extended with new instructions or implementation alternatives. Therefore, we modified them to satisfy our research goals and to enable its integration in our exploration frameworks. Among other upgrades, we added a set of vector FP FMA instructions to VectorSim. High-level VectorSim configuration is presented in Table V .
We set up VectorSim to model a decoupled 32-bit vector machine with support for 64-bit FP. The decoupled execution [14] , [30] - [32] . They share instruction fetch and decode, and they separate issue logic and FUs, allowing in that way independent scalar execution. In-order execution is common in low-power processors due to its simplicity (e.g., some of ARM Cortex-A architectures: cortexA7, cortexA8, cortexA32, cortexA35, and cortexA53 [33] ). It is more efficient in vector than in scalar processing, as in vector architectures the drawbacks of in-order execution are diminished, especially if the vectors are long. Additionally, we model chaining (vector equivalent of data forwarding) and dead-time elimination (allowing to reuse the ALU immediately after the current instruction).
The vector execution engine is organized as n L identical vector lanes. Possible values of the number of lanes (n L ) are 1, 2, and 4. In our experiments, we do not examine more lanes as it would not satisfy well a low-power core budget. 7 Moreover, values that we choose are typical in a vector processor design [29] . Each lane has a slice of the vector register file, a slice of the vector mask file, 1 vector integer ALU, 1 VFU, and a private TLB. There is no communication across lanes, except for gather/scatter, reduction, and compress instructions. In addition to the vector ALU, each lane also includes 1 logic unit that handles logic operations, shifting, and rotating. We assume that the division is done in software since it is rare in vectorizable applications and the hardware support is costly. Additionally, a control unit is needed to detect hazards, both from conflicts for the FUs (structural hazards) and from conflicts for register accesses (data hazards).
B. Benchmarking
This section explains two benchmarking methods that we employ for an in-depth evaluation of the proposed techniques. The first method has a goal to stress each of the techniques separately, while the second tests all the techniques simultaneously and provides the results for "real-world" applications.
1) Fully Parameterizable Kernel-uKernel:
We generate different param-uBenchs using the same uKernel. It is a variant of the DAXPY loop: D = A * B + C. The inputs are random values unless specified otherwise. uKernel parameters (see Table VI ) are used to determine the characteristics of the generated param-uBench. There are parameters that modify the code (INSTYP, ADD/MUL, MULS, and ADDS), execution (IR and p m ), and data (p inf , p NaN , and p 0 ). Listing 1 shows an example of uBench pseudocode generated with the uKernel. 7 The total number of FUs per core is in accordance with many other lowpower processors [ 2) Application-Based Microbenchmarks-app-uBench: An app-uBench is a manually vectorized, FP intensive microbenchmark (kernel) extracted from an application. It is a representative part of the application and small enough (between 100k and 150k test vectors) to keep circuit simulation time reasonable. We use four different appuBenchs extracted from the vectorized applications described in Table VII . We selected different types of applications to make the results more general. These applications are used in mobile devices and can also be found in server workloads.
C. Fully Parameterizable FMA Generator
We developed FMAgen as a hardware generator written in Constructing Hardware in Scala Embedded Language (Chisel), a hardware construction language aimed at designing hardware by using parameterized generators [17] . Chisel is based on the Scala programming language, and it supports a combination of object-oriented and functional programming and good software engineering techniques. We find it as an optimal way to design and test parameterizable FUs. On one side, it provides the ability to design and connect hardware blocks in the same way as in other hardware description languages (HDLs) (Verilog or VHDL), while on another side, it is significantly more [26] . This open-source library internally uses a recoded format (the exponent has an additional bit) to detect and handle special cases, such as subnormal numbers, more efficiently. 9 BHFPU can produce FMAs for a configurable FP format, i.e., arbitrary number of mantissa and exponent bits.
FMAgen generates synthesizable Verilog code of onelane VFUs according to the input parameters (FMAgen parameters): clock-gating technique type (CG type ), latencynumber of pipeline stages (n S ), and the input FP format. The presented advanced clock-gating techniques are compatible with each other and can be arbitrarily combined. Therefore, possible values for CG type are any combination of the aforementioned clock-gating techniques (IdleCG, MaskCG, ScalarCG, ImplCG, and InputCG), including all of them together (AllCG) or none of them (NoCG). A combination of clock-techniques that is discussed below is ActiveCG, which combines all active clock-gating techniques from Table II (MaskCG, ScalarCG, ImplCG, and InputCG). n S can be an arbitrary number. In this paper, we put four stages as a reasonable limit for a low-power processor. Additionally, we set the VFU input FP format to double precision.
Apart from the mentioned features that we added to BHFPU (support for all the clock-gating techniques as well as support for combining them arbitrarily, pipelining, and different pipelining styles), we also added full IEEE754-2008 compliance [40] (which introduces some timing overhead). A simplified block diagram of modeled VFU is shown in Fig. 2 .
We paid special attention to ensure that clock-gating logic does not create a critical timing path. The only circuitry that 9 We assume that recoding is done when loading and storing to memory. could be on the critical path is bypassing multiplexors (see Fig. 3 ). However, compared with the rest of FMA submodules, their delay impact is fairly small. Additionally, we debug timing and apply retime 10 when necessary. Therefore, timing cost of added circuitry is almost negligible, especially in case of four-stage VFU (the most relevant one).
Since we target low power, we do not incorporate any speculative hardware for improving performance, and thus, no energy is wasted on precomputed results that get discarded.
VI. EVALUATION
This section presents an evaluation of the presented vector processing aware clock-gating proposals in terms of power savings (S) and area efficiency. Regarding power measurements, first we evaluate each technique separately using the benchmarking method from Section V-B1, and afterward, we evaluate combined scenarios using the method explained in Section V-B2.
VFU designs with one to four stages are synthesized and run at 0.45, 0.85, 1.1, and 1.3 GHz, respectively. We assume an NoCG VFU as a baseline. Its power in case of two-lane VFU is 15.6, 30.9, 44.9, and 59.2 mW for one to four stages, respectively.
We observe that the static (leakage) power is practically negligible compared with dynamic power. For noCG, it is around 0.01% ot total power in average. The leakage is highest for IdleCG when it is up to 1%. It is practically negligible due to the following reasons: 1) arithmetic topologies produce high switching (high dynamic power); 2) the technology that we use has low leakage; and 3) we optimize noncritical path logic for leakage using high-V T H cells. Although it is negligible when considering active operating modes (i.e., the execution inside a vector kernel), when the execution is outside a vector kernel (i.e., when the vector core is inactive), the leakage might be additionally suppressed via power gating. 11 However, power gating is out of the scope of this paper since we target lowering power during active operating modes with no performance loss.
We focus on four-stage results as they are the most important from the processor design perspective. Nonetheless, the one-stage results are presented as a reference, and in most cases, it has the highest overhead in terms of power and area across all n S values. For the sake of simplicity, in the rest of this section, we typically omit results for two-and threestage designs, but we observe that these results regularly scale between results for one and four stages. Table VIII shows the area efficiency of the proposed techniques. Area for an NoCG 2-lane VFU configuration is 36191, 38060, 40693, and 43419 µm 2 for one to four stages, respectively. Area overhead is in some cases higher than expected because: 1) during synthesis, we prioritized timing and power over area to assure power savings without spoiling 10 Critical path optimization by adjusting the position of the flip-flops. 11 The gate signal in this case could be generated from vector kill instruction KILLV (similar to VRIP instruction in Cray X1 instruction set [41] ). timing, and 2) Chisel generated Verilog code is sometimes less area efficient than equivalent manually written Verilog [39] . However, we observe that this overhead has a strong decreasing trend as n S increases.
A. Area Efficiency

B. Per Technique Power Analysis
Fig . 5 and Table IX reveal the results for each of the presented vector processing aware clock-gating proposals separately, in terms of power savings for four-and one-stage VFU. In these experiments, we set MV L and the number of vector lanes (n L ) to 64 and 2, respectively.
We observe that in most of the cases, the savings grow with n S , as more pipeline stages enable finer granularity of clockgating. Due to its higher practical importance, in the rest of the discussion, we focus on four-stage results. Fig. 5 shows results for MaskCG, InputCG, and IdleCG. 1) MaskCG: Due to its simplicity, this technique comes with practically no overhead and the savings are between 8% and 52% depending on the p m . The saving attainable when p m = 1 (S = 52%) is the maximum possible power reduction for active four-stage VFU. 2) InputCG: In order to isolate savings for each of the mentioned subtechniques (Table IV) , we test all of them separately by asserting the probabilities p inf , p NaN , and p 0 (see Table VI ) to the operands. In InputCG ∞ , InputCG NaN , and InputCG mul0 , the corresponding probability affects operand A, while in InputCG add0 , it affects operand C, the addend.
The maximum saving of 48.3% is available when p inf or p NaN is 1 (InputCG ∞ and InputCG NaN ). The same savings are available when an operand is NaN or ∞, as in both the cases, the same hardware is clock-gated. The minimum probability p NaN or p inf (of any operand) necessary for saving power is the spot where the savings graph crosses the probability axis (16%). When considering InputCG mul0 , the maximum saving is 40%, and the minimum probability p 0 (of any multiplicand) necessary for saving power is 18.5%. Much lower maximum saving (2.3%) is available when the addend is a zero (InputCG add0 ), as the adder consumes much less power than the multiplier (around five times in average). However, by combining these scenarios at the same time (which is reasonable to assume in a real application case), higher savings would be available. Therefore, even though the savings associated with detecting zero addend and clock-gating the adder and the corresponding aligner and input registers are not enough to justify its existence by itself, supporting this case improves overall savings of the complete InputCG technique, when a real, combined scenario is considered.
Since it shares some hardware with other InputCG subtechniques, the overhead of adding it is less than the saving it can achieve. The power overhead of the added hardware can be identified in the case when the probability is 0, i.e., when InputCG is never active. The cost is a bit higher than expected considering the amount of additional logic that we include (detecting, bypassing, and clock-gating logic). In line with our discussion of area results, Chisel generated designs sometimes suffer from unexpected overhead, and our initial experiments confirm it. However, we observe that it significantly decreases as n S increases, thus, we expect this to be negligible for high n S . Table VI is assumed to be variable while other parameters are zero. This is indicated in the legend with technique, probability pairs. We show the profiling results as well to understand the VFU behavior and where the savings come from. Data for MaskCG, InputCG NaN , and InputCG inf are not present in Table X as they are 0%. The reason is that the selected app-uBenchs do not have vector mask instructions. Also, none of the input values are NaNs nor infinities. However, abundant vector mask instructions could be found in any vector workload that has conditional execution [42] , so in this kind of workloads, we can expect fair savings as the result of MaskCG technique. Regarding NaNs and infinities, for some other applications and/or input data sets their occurrence might be more common, and thus, the benefit of InputCG NaN and InputCG inf subtechniques will be visible. Common cases of NaN and infinity processing are explained in [21] . Fig. 6 and Table X reveal that AllCG efficiency is very high, i.e., clock-gating is often used almost 100% of total execution time. This is a consequence of the fact that the proposed clock-gating techniques are used during both idle and active VFU execution (i.e., both ActiveCG and IdleCG are used). Due to this very high AllCG efficiency, the power savings are also fairly high. We observe that power savings are available for practically all the combinations of app-uBenchs and vector parameters. The highest savings are obtained for computer vision app-uBenchs (Facerec and Disparity) and are between 60% and 80%. The only case in which the techniques do not result in savings is K -means, MV L = 16 and n L = 4. There are two reasons for that: 1) the clockgating efficiency (i.e., the percentage of execution time that any clock-gating technique used) is not high; and 2) with n L = 4 and MV L = 16 the effective vector length per lane is 4, which makes ImplCG (the most used technique in this case) less fruitful, since it is used only three consecutive cycles per vector on each lane (which as a result has more switching activity in clock-gating and bypassing logic). Additionally, from Table X , we observe that presented novel ideas/approaches (ActiveCG) provide significant savings in addition to the standard one (IdleCG). Fig. 7 shows that ratio of active and idle VFU execution varies across app-uBenchs and vector parameters, and explains the nature for each combination of parameters. There are situations in which the VFU is most of the time active and vice versa. However, we can notice that there is a trend that vector processors with MV L of 128 have its VFU most of the time active (busy) as with longer vectors the effects of cache misses are diminished. IdleCG is used whenever the VFU is idle, and thus, IdleCG efficiency inside these idle periods is 100%. When considering active VFU execution, the efficiency (ActiveExeCG) varies across the app-uBenchs and vector parameters and is shown in Fig. 8 . As we can observe from the figure, a very high percentage of the time at least one of the active clock-gating techniques is used and depending on the benchmark it goes up to 100%. Table X shows that in all these cases the used techniques are some variants of ImplCG 12 and InputCG. Also, there are cases when these techniques overlap.
ActiveCG techniques can arbitrarily overlap, and there are two potential kinds of overlaps. The first group of overlaps refers to the cases when the techniques jointly produce higher savings than each technique separately. This happens when the techniques target a different hardware. For example, when we have a zero addend inside a FPADDV instruction, InputCG add0 gates the mantissa adder and the aligner, while ImplCG gates the input register of the A operand and the mantissa multiplier. The second group of overlaps happens when one technique gates just part of the hardware that another technique gates. In these cases, the savings are equal to the savings of the technique that has a larger scope. For example, if the 12 As explained before, ScalarCG is integrated in ImplCG.
corresponding bit in VMR is "0" and the current instruction is FPFMAVVS, the savings are going to be equal to the savings achieved by MaskCG alone.
VII. CONCLUSION
In this paper, we extensively identify, propose, and evaluate the most suitable clock-gating techniques for VFU considering peak performance, and focusing on the active operating mode. We propose techniques that are either: 1) completely novel ideas to lower the power of VFU using active clockgating (e.g., vector instructions with scalar operand or vector masking), or 2) ideas that exist in some form in scalar architectures and that we extend to achieve more savings by taking advantage of vector processing characteristics. We find that each of the proposed optimizations achieves power reductions while maintaining the performance. As a consequence of this fact, sometimes an area increase is observed.
An in-depth evaluation is performed, and each of the techniques is evaluated separately as well as combined with other techniques. For this evaluation, both synthetic and real application-based benchmarks are employed. We considered a variety of benchmarks with different behaviors to assure a fair evaluation and general conclusions.
In the case of active four-stage VFU with two lanes actively operating at the peak performance, power savings up to 52% are available when using a single technique. Regarding the vector instruction-dependent techniques that we propose, we observe savings for all FP vector instructions.
Testing all the techniques together and using real application benchmarks (especially computer vision ones) reveals fairly high power reductions that go up to 80%. Clock-gating efficiency (percentage of time that some of the proposed techniques used) is quite high, often close to 100%. When considering the efficiency of only active clock-gating techniques, this number is usually between 70% and 100%. We observe that these novel ideas/approaches (applied when VFU is active) provide significant savings in addition to the standard ones (idle VFU). Moreover, we notice the trend that savings for the proposed techniques rise with the number of pipeline stages.
We performed this research in a fully parameterizable, scalable, and automated manner using simulators and tools at many levels. Although targeting FP FMA, as the major consumer among all FUs, similar low-power techniques as well as the framework could be retailored for other vector FUs as well. We would also like to stress that the combination of Chisel-based generators and state-of-the-art synthesis and PnR tools is a powerful tool for flexible hardware generation with Verilog-like quality of results.
