ABSTRACT Hash functions are a crucial tool in a large variety of applications, ranging from security protocols to cryptocurrencies down to the Internet-of-Things devices used, for example, as biomedical appliances. In particular, SHA-2 is today a ubiquitous hashing primitive. Its acceleration has driven a wealth of contributions in the technical literature and even a whole industry segment involving dedicated hash processing accelerators. Because of the variety of requirements in terms of performance, resources, and energy consumption as well as the impact of the particular hardware technology of choice, evaluating and comparing different architectural schemes is a nontrivial task, along with the exploration of new solutions matching given user requirements. Based on a careful review of the state of the art, this paper introduces an SHA-2 workbench to be used as a framework for evaluating different implementation styles and architectural choices. The workbench comes in the form of a generic HDL description, where the various implementation options are exposed in the form of user-configurable parameters and can be variously combined obtaining either known solutions or possibly new configurations to be explored. We systematically use the workbench to analyze the available SHA-2 architectural techniques. This extensive evaluation provides a deep understanding of the performance and energy implications of each implementation style and even allows the identification of nonobvious matches between architectural choices and target technologies in order to optimize hash rate and area efficiency figures.
I. INTRODUCTION
Cryptographic hash functions underlie many aspects of our everyday life today. They are a fundamental building block for a number of cryptographic services which are critical for security applications and protocols, but they also provide the essential ingredient for emerging innovative applications, like blockchains and distributed ledgers, involving a wide range of platforms from high-end servers down to resource-constrained Internet-of-Things (IoT) devices. In all such applications, especially those requiring large amounts of hashing operations, it is of paramount importance to meet strict performance and energy requirements under certain cost budgets. In fact, while current hashing algorithms like SHA-2 are deemed secure and well suited for sensitive applications (unlike widely adopted predecessors like SHA-1, now phased out due to major vulnerabilities pointed out recently), the The associate editor coordinating the review of this manuscript and approving it for publication was Junaid Shuja. increased level of security comes at higher costs in terms of both processing load and electrical energy. Because of the very inherent nature of the hashing operation as exploited in the above applications, this trade-off turns out to be crucial. Consider for example the distributed consensus protocol in the Bitcoin blockchain, where the energy cost of hash computation is the critical factor determining the economic profitability of coin mining, or the energy requirements of many battery-powered IoT devices, where the battery life often coincides with the lifetime of the device itself.
Although diverse in nature, the above fields of application all share the need for dedicated acceleration of hash processing. Indeed, hash-based mining has driven the development of a whole industry in the area of mining accelerators, while the technical literature contains a wealth of proposals for hardware-accelerated implementations of hash functions like SHA-2. These proposals target a diverse range of applications and systems, making it difficult to assess the impact of the hash block itself due to complex interactions of different components within the designed system. Furthermore, after being captured in a Hardware Description Language (HDL) such as VHDL or Verilog, the proposed accelerator architectures are implemented, via automated tools, to different target technologies, which can be a Field-Programmable Gate Array (FPGA) or an Application-Specific Integrated Circuit (ASIC) technological library. The impact of the target technology on the reported results can hardly be factored out, making it very difficult to compare different proposals in terms of rough performance as well as energy efficiency.
Driven by the increasing importance of the SHA-2 hash function, this work directly addresses the above limitations by introducing what we call a SHA-2 workbench, a framework for evaluating different implementation styles and architectural choices as well as understanding their performance and energy implications. The work is based on a careful review of the state of the art in order to exhaustively cover the available implementation options and capture the commonalities of existing solutions. Our SHA-2 workbench comes in the form of a generic HDL description, where the various implementation choices are exposed in the form of user-configurable parameters and can be variously combined, obtaining either known solutions or possibly new configurations to be explored. As such, the proposed workbench can be effectively used as a design exploration framework, letting the designer find a SHA-2 architecture that meets specific requirements by simply reconfiguring the framework instead of implementing from scratch all the possible alternatives. The main contributions of this paper are hence the following:
• Making the comparison of different SHA-2 designs easier and fair, by defining a common architecture where different design proposals can be easily plugged in and synthesized to the same target architecture;
• Allowing the performance evaluation of different SHA-2 designs without taking into account optimizations or operating conditions specific to a particular application;
• Making the development of new SHA-2 implementations easier by defining a comprehensive architecture exploration framework.
The rest of this paper is organized as follows. Section II reviews the SHA-2 family of hash algorithms. Section III deals with the hardware implementation of SHA-2 by surveying the most representative proposals in the technical literature. Section IV summarizes the fundamental contribution of this work, while Section V presents the design of the proposed workbench, then evaluated in Section VI. Section VII analyzes the results we obtained and Section VIII concludes the paper with some final remarks.
II. SHA-2 HASH FUNCTIONS
The Secure Hash Algorithm (SHA) is actually a set of cryptographic hash algorithms defined by the National Institute of Standard and Technology (NIST) in the Secure Hash Standard (SHS) for being employed by the U.S. government agencies. SHA-2 is the family of hash algorithms defined in the SHS, excluding SHA-1 [1] . It was firstly introduced in 2001, with the definition of three hash functions; subsequent updates added three variants, totaling six hash functions. SHA-2 hash functions are currently in use as secure hash algorithms [2] .
A. DETAILS OF THE ALGORITHMS
SHA-2 hash algorithms are distinguished by the length of the output they produce. The two basic variants are SHA-256 and SHA-512, which are in fact the same algorithm, applied to different word lengths. SHA-256 operates on 32-bit words, whereas SHA-512 works on 64-bit words. The two variants differ also in some constant parameters and values, and employ different initialization values. The other four versions, namely SHA-224, SHA-384, SHA-512/224, and SHA-512/256, are the same function as SHA-256 (the first one) or SHA-512 (the others), with the output truncated to the specified number of bits and different initial values. The whole family can hence be described by looking only at the two basic variants.
1) INITIALIZATION
SHA-2 hash functions work on blocks of a fixed length, which is 512 bits for SHA-256 and 1024 bits for SHA-512, and yield a fixed-length output. In order to ensure that the block length divides the overall input length, the input message is padded with a single '1' bit followed by as many '0' bits as needed to reach a length that is equal (mod 512) to 448 for SHA-256, and equal (mod 1024) to 896 for SHA-512. This leaves room for 512 − 448 = 64 bits available for encoding the length of the original message for SHA-256 and 1024 − 896 = 128 bits for SHA-512. The message length field in the padding scheme determines the upperbound to the length of the message.
After padding, the message is split into blocks, often called Padded Data Blocks (PDBs) to recall that the original message was padded. A partial hash value is used within the processing of each block to take into account the output of the processing of previous blocks. The current partial hash value is used within the processing of each block to initialize 8 working variables, usually labeled A to H , which are updated in each iteration of the algorithm to yield the new partial hash value. For SHA-256, the working variables are 32-bit wide and 64 iterations, also called rounds, are performed, whereas for SHA-512 the working variables are 64-bit wide and 80 rounds are performed. Of course, the current partial hash value needs to be initialized; the initialization values, which are defined in the standard [1] , are different for each of the 6 variants of the SHA-2 family of algorithms.
2) COMPRESSOR
The part of the algorithm devoted to the update of the working variables is often, especially in the context of hardware implementations, referred to as the compressor. Following a top-down description, at each stage t the working variables are updated as shown by the assignments below:
where the T 1 and T 2 functions are defined as
The Choose and Majority functions are defined as
whereas the functions are defined for SHA-256 as (4) with ≫ r denoting the circular right shift operation. SHA-512 uses the same structure but different rotation values, namely
The K values which appear in Equation 2 are round constants defined by the standard [1] as the first 32 (64 for SHA-512) bits of the fractional part of the cube root of the first 64 (80 for SHA-512) prime numbers. On the other hand, the W values which appear in the same equation are produced by the function described next.
3) EXPANDER
The input message block is absorbed in the hash computation by means of words W . More precisely, for SHA-256 the 512 bits of the input message are expanded into 64 32-bit words W to be used in Equation 2, one expanded word per round. Similarly, for SHA-512 the 1024 bits of the input message are expanded into 80 64-bit words W , one per round. The expander performs the following computation, where M j is the current message block: (6) where the σ functions are defined for SHA-256 as (7) with ≫ denoting the logical right shift operation. As for the functions, SHA-512 uses the same structure for the σ functions but different rotation values, namely
Note that the first 16 elements of W coincide with the words of the message block itself.
4) BLOCK CHAINING
At the end of all iterations, the current partial hash value needs to be updated with the output of the processing of the current block to produce the new partial hash value. This operation is performed by means of a modular addition of each accumulator variable with the corresponding word of the current partial hash value.
The new partial hash value is provided as the input to the processing of the next block, if any, otherwise it coincides with the final hash output. The fact that the partial hash value is updated at the end of the processing of each PDB, and then used to initialize the processing of the subsequent PDB, implies that message blocks need to be processed sequentially. This is an obstacle for aggressive parallelization, and also the reason why many hardware implementations deal only with the hash of a single PDB.
B. APPLICATIONS OF SHA-2
Hash functions are a flexible cryptographic tool [3] and, as such, they have always been a fundamental component for a number of security services, such as Hash-Based Message Authentication Code (HMAC) [4] , Digital Signature Algorithm (DSA) [5] , message authentication through Authenticated Header (AH) in IPSec [6] Pseudo-Random Number Generation (PRNG) through Deterministic Random Bit Generators [7] , and so forth. Security standards issued by NIST require that the involved hash operation is based on one of the NIST-approved functions, i.e. SHA-1, SHA-2, or SHA-3. While the relatively recent SHA-3 relies on a complete re-design of the hashing algorithm and has not reached widespread adoption yet, the SHA-2 algorithm family is immune to any known cryptographic vulnerability, as opposed to SHA-1 [8] and MD5 [9] , [10] , and is currently the preferred choice for all the above mentioned cryptographic applications. Of course, the increased robustness of SHA-2 comes at the price of a higher computational cost. This can potentially affect the performance of security applications relying on SHA-2, for example IPSec and its underlying HMAC algorithm [11] , [12] , pointing out the need for hardware acceleration in contexts where massive hashing is required.
But security protocols are not the only field of adoption for cryptographic hash functions. In particular, SHA-2 has also been chosen as the basic building block for a few innovative applications inherently requiring hashing operations, VOLUME 7, 2019 mainly because of their one-way, collision-resistance, and pseudo-randomness properties.
1) BLOCKCHAINS
The blockchain technology has become today a popular innovative application. It relies on hash functions to ensure the integrity of the distributed ledger [13] . The hash of each block is included in the subsequent block, so that any modification of a block invalidates its hash and the hash of all the following blocks in the chain. SHA-2, particularly SHA-256, is chosen as the underlying hash function in many blockchains as it is deemed secure yet computationally efficient and thus ideally suited for use in the blockchain application [13] . The integrity guarantees of the ledger rely on its collision resistance property, since a collision would make it possible to alter the block content without changing its hash.
Furthermore, SHA-256 is also the underlying function of the distributed consensus protocol in the Bitcoin blockchain [14] , which has prompted the popularity of the blockchain concept itself and stimulated the introduction of newer blockchain schemes [15] . The protocol, called Nakamoto consensus [15] , is a competitive process where the first node which finds a new block gets paid with newly mined currency -hence the name mining given to the process. This places an extremely high demand on the performance of the hash algorithm, driving the development of a whole industry of dedicated hardware accelerators [16] , [17] . In addition to the processing power, energy efficiency must also be taken into account, since the energy spent in the mining process is a cost which reduces the revenues of mining. An excessive increase in power consumption can vanish any profit obtained from a faster miner.
2) INTERNET OF THINGS
A further innovative application field for hash functions is the IoT [18] , for example targeting the biomedical sector [19] , [20] as well as other areas where security is of paramount importance due to the sensitiveness and potential life-criticality of the involved data. An example is given in [19] , where SHA-256 is employed as a PRNG to generate random encryption keys for medical images sent over the cloud from wireless sensors. Although SHA-1 is still approved for use as a PRNG [2] , SHA-256 was preferred in this work due to its larger number of output bits, which determines a longer key and therefore a higher security level. Another application is presented in [20] , where SHA-256 is employed as the underlying hash function for the authentication of the nodes in the IoT network, employing the algorithm presented in [21] .
Moreover, IoT applications make increasing use of the blockchain technology to enforce integrity and nonrepudiation of data communications [18] , [22] , or to implement smart contracts [23] in distributed environments such as an IoT network. The adoption of a blockchain in IoT applications is challenging due to the resource limitations of the involved devices. These often have reduced computation capabilities [18] , meaning that their processors struggle to perform complex cryptographic algorithms such as SHA-2 [24] , let alone the Proof-of-Work of the Nakamoto consensus [22] . This limitation can be addressed by a hardware accelerator which, however, must be designed by taking into account the power budget available for the device. In fact, power consumption is a fundamental issue in IoT scenarios, since the battery life is typically limited and often determines the very lifetime of the device itself.
III. RELATED WORK
The variety of applications relying on hash functions and their wide range of different requirements has driven the interest in the design of dedicated SHA-2 accelerators. The technical literature includes proposals focusing on the hash implementation alone as well as complete architectures where the hash function is a component of a more complex design.
A detailed description of a basic hardware implementation of the SHA-2 family is given in [25] , which employs hardware reuse of a single transformation round block to perform all the required iterations of a SHA-2 hash function. The work is further improved in [26] , where the architecture is modified so as to perform both SHA-256 and SHA-512 with the same hardware. The choice between the two is made at run-time, according to the user's request.
A different approach is proposed in [27] , implementing the design methodology presented in [28] . This methodology is derived from the structure of SHA-1 and SHA-2, which share the same underlying architecture, and can be applied to other algorithms having commonalities with these hash functions. The key idea is not to apply pipelining in the standard way, i.e. by distributing different iterations of the compression function among different stages, but rather within the iteration round. The accumulator register of the compressor is redesigned as a shift register, similarly to the working register of the expander, and quasi-pipelining registers are introduced within the compressor function so as to shorten the critical path. Although the resulting stage achieves improved speed, as a downside this architecture cannot take advantage of classical pipelining, which can compensate for the frequency limitation of a single round by processing multiple messages simultaneously.
A different approach is exploited in [29] . The precomputation technique takes advantage of the SHA-2 iteration function structure to compute in a previous iteration some values that are useful in a subsequent iteration. This results in lighter logic, and hence a reduced critical path for the iteration round, at the price of a more complicated initialization step, where the precomputed values for the first iteration must be generated. Further architectures for the transformation round block are proposed in [29] and compared against a common platform. In that respect, [29] resembles the philosophy of our workbench, aiming to factor out different design aspects and facilitate the optimization of the transformation round. However, [29] compares only single-cycle architectures, with or without precomputation but without unrolling.
The framework presented in this paper, instead, allows for evaluating and comparing a wider range of designs, supporting both pipelining and single-cycle architectures, including for example unrolled and non-unrolled architectures.
In [30] a highly optimized round architecture is proposed for the compressor pipeline stage. The work makes use of variable precomputation, spatial reordering, and data prefetching techniques. Spatial reordering moves the pipeline register to the middle of the round, so as to store intermediate values in it and, most importantly, alter the critical path in the circuit. Spatial reordering has the interesting side effect of placing the adder of the final sum in parallel with the adders in the precomputing circuitry, hence removing them from the critical path. This allows avoiding a separate pipeline stage for the final sum. Data prefetching refers to the fact that the values K t , which are constants established by the SHA-2 standard, and W t , which are derived from the input message M j according to Equation 6 , are precomputed in advance of their usage. Therefore, their sum K t + W t , which appears within the Compressor as shown in Equation 2, can be precomputed.
The architecture of [30] is further optimized in [31] and [32] , with the addition of the unrolling technique and the use of Carry-Save Adders, which allows a three-operand addition to be performed in almost the same time as a two-operand addition. With unrolling, more than one iteration of the hash function are performed in a single clock cycle. Unrolling an algorithm has the obvious consequence of increasing the area occupation, but it can enable other optimizations, improving the resulting area efficiency. SHA-2 is an ideal candidate for unrolling, since it updates only two working variables per iteration out of eight. In [31] , a preliminary study on the unrolling factor is performed, concluding that 2 is the best factor according to area efficiency, but if a reasonable area penalty can be sustained in favor of performance, unrolling factor 4 is preferred.
For the sake of completeness, it is worth mentioning further approaches to the acceleration of the SHA-2 hash function based on application-specific processors. In fact, while all the designs cited above can be regarded as dedicated coprocessors, the alternative approach is to design a software-programmable processor with an extremely reduced instruction set that matches the computational patterns in SHA-2 much better than a general-purpose processor. An example of this approach is presented in [33] . Such approaches are not addressed by our work since, while being more flexible in terms of supported algorithms, they cannot reach the performance and energy efficiency levels that can be achieved by a pure hardware accelerator.
IV. THE SHA-2 WORKBENCH
Most of the designs proposed in the literature [25] , [26] , [29] - [32] follow the same approach to the hardware implementation, i.e. focusing on the optimization of the transformation round core, the component in charge of implementing Equations 1 to 5, then building the whole hashing circuit around the optimized round core. As explained in Section III, each design exploits one or several of the following techniques:
• improved arithmetic components, for example CarrySave Adders (CSAs);
• pipelining;
• loop unrolling;
• variable precomputation within the round;
• data prefetching (system-level);
• spatial reordering. Despite the common approaches and the functional compatibility between the various designs, it is still challenging to compare architectures presented in different articles, for two main reasons. First, each work develops its own control and supporting circuitry, which influences the reported performance of the proposal. Second, the experimental results are also influenced by the specific target technology and synthesis toolchain, whose impact is deeply intertwined with purely architectural aspects.
The above issues are both tackled in this work by means of a newly introduced SHA-2 workbench, a flexible, easy-touse exploration framework for the evaluation and comparison of different alternatives for the hardware implementation of SHA-2. The evaluation platform we propose provides a control and supporting circuit flexible enough to accommodate a wide range of different designs. On the contrary, previous works such as [29] only provide support for a limited number of design techniques. A particular implementation of the transformation round core which is captured in an HDL language can be easily plugged into the framework and synthesized for a given target. The common evaluation platform facilitates the task of comparing different designs factoring out the impact of the target hardware technology and related software toolchain, which typically introduce a VOLUME 7, 2019 FIGURE 2. Top level entity of the proposed evaluation platform.
great deal of variability and unpredictability in the design performance. The workbench also ensures that the obtained results solely depend on the optimization techniques implemented by the design proposal of the round core, effectively supporting an extensive architectural exploration for SHA-2 implementations. An architecture which fulfills a given set of constraints can be found according to the process outlined in Figure 1 . Once the designer has chosen a transformation round core, a full hash circuit architecture can be obtained by simply tuning the parameters of the architecture. The resulting circuit can be implemented against a target FPGA. If all the design constraints are met, the exploration stops. Otherwise, another iteration of the exploration is performed, where the same transformation round core can be evaluated with different architectural parameters. The designer may also choose to insert a newly developed transformation round core in the evaluation loop. In fact, the proposed SHA-2 workbench also facilitates the development of new transformation round cores, since the designer can focus solely on the implementation of their own optimizations, then properly configuring the framework to obtain a complete hash circuit.
V. WORKBENCH ARCHITECTURE
The architecture of the evaluation platform proposed in this paper implements a SHA-2 hash core taking as input a full PDB and generating as output the corresponding hash value. The framework may be configured as a SHA-256 or SHA-512 hash core, while padding and multiple block handling are performed externally to the SHA-2 core. Figure 2 shows the datapath of the configurable architecture. The workbench may be configured to employ n-stage pipelining, in which case the same combinatorial block for round computation is replicated throughout the stages and each stage performs a fraction of the total number of iterations. With pipelining, multiple PDBs belonging to different messages can be processed simultaneously, whereas the processing of successive PDBs of the same message is strictly sequential, as observed in Section II. On the other hand, when pipelining is disabled, the combinatorial circuitry is instantiated only once to perform all the iterations.
The architecture is composed by two parallel pipelines, one for the Compressor and one for the Expander, driven by a counter whose carry output works as major cycle signal for the circuit. The number of pipeline stages can be configured, and can be also set to 1 to disable pipelining at all. For each stage of the Compressor pipeline, there is also an associated ROM containing the values of the constant K corresponding to that stage.
A. COMPRESSOR
Each stage of the Compressor pipeline is an instance of the transformation round core selected by the designer. The compressor pipeline registers are expected to contain at least the eight working variables and a validity flag, which is set during the first stage and is carried to the output, in order to signal that the value of the output hash register is valid. If required by the transformation round core, the compressor pipeline can also be configured so that the pipeline registers contain additional working variables. When this is the case, an initialization unit (not shown in Figure 2 ) is instantiated before the Compressor pipeline to compute the initial values for these variables.
The compressor pipeline may be configured to work with a transformation round core which employs loop unrolling and system-level data prefetching. If the latter optimization is to be used, initial values for the K and W parameters, not used by the first stage, are forwarded to the initialization unit.
The compressor ends with the chaining sum, which may be configured to be placed into a separate stage. Both the optional final stage and the initialization unit work as separate stages even if pipelining of the transformation round core is disabled.
B. EXPANDER
Within the Expander, the round registers work as 16-position word-wide shift registers during the stage, turned into parallel registers when the major cycle signal is asserted, by means of a multiplexer array. Since the last shifted value of the stage works with the major cycle signal asserted, it is not written to the shift register. Instead, it must be captured by properly rearranging the connection with the register of the following stage, as shown in Figure 3 .
To perform unrolling, the shift register chain of each Expander stage is split into a number of chains being equal to the unrolling factor, as shown in Figure 4 , since a number of expanded words W must be generated at each clock cycle. Words are distributed among split chains cyclically with respect to their positions in the original chain.
According to Equation 6 , the 16 initial words, making up the input message, are in big-endian order, causing a reverse sorting of the input message, which must be taken in little-endian order. This makes it necessary to reverse the input of the expander, and this reversal must in turn be taken into account when splitting the expander into stages. In particular, the right shift becomes a left shift when reversed.
C. CONTROL UNIT
The Control Unit is responsible for properly driving the internal signals of the circuitry. It provides for the correct loading of the two pipelines at the very beginning of the operations of the circuit, when the major cycle signal is not active yet. Moreover, it properly enables and clears the round counter.
The finite state machines (FSMs) of the Control Unit are shown in Figure 5 and Figure 6 . A specific configuration VOLUME 7, 2019 FIGURE 5. FSM of the Control Unit (Compressor and Expander aligned).
FIGURE 6. FSM of the Control Unit (Expander moved ahead).
parameter determines which one is to be used. The two FSMs differ in that one uses the same major cycle signal for the two pipelines, the Compressor and the Expander, while the other moves ahead the Expander, as required by some transformation round designs, including those employing spatial reordering or variables precomputation.
To accommodate for a varying number of pipeline stages, a stage counter is employed to flush the pipeline when no new PDBs are supplied to the core for hashing. Therefore, only one state called last_stages is required within the FSM to manage pipeline flushing, which ends when the stage counter reaches its maximum. As for the round counter, the size of the stage counter is determined according to the provided configuration parameters.
VI. EXPERIMENTAL RESULTS
The proposed architecture has been described in VHDL, synthesized, as well as placed-and-routed with the Xilinx Vivado IDE for an extensive range of configurations. Since the proposed design is meant to be used as a part of a larger system, the VHDL description has been synthesized in Out of Context mode.
A. DESIGN COMPARISON AGAINST A SPECIFIC TARGET
Different design alternatives have been analyzed assuming the same target technology, in order to show how the proposed framework can help in comparing fairly different designs on the same target technology, allowing the designer to identify the best one for a given platform. The target platform considered in this section is the Xilinx Kintex UltraScale+ XCKU5P, which is a 16 nm FPGA featuring more than 200k Look-Up Tables (LUT) [ The transformation round cores which have been considered are described below, where UF stands for the unrolling factor.
1) NAIVE
This is a straightforward implementation of the transformation round. The pipeline register is placed before all the combinatorial parts, letting the Compressor and the Expander stay aligned on the same major cycle signal. This also allows for encapsulating the combinatorial part in a single component, making this transformation round core reusable. Two combinatorial components have been considered, the former implementing a single iteration, and the latter employing loop unrolling with factor 4.
2) PRECOMPUTED (UF1)
This is an implementation of the round function with precomputation, originally presented in [29] within a nonpipelined architecture.
3) REORDERED (UF1)
This is an architecture with precomputation, system-level data prefetching, and spatial reordering, originally presented in [30] within a four-stage pipeline.
4) REORDERED (UF2)
This is an architecture with precomputation, system-level data prefetching, spatial reordering, unrolling, and CSAs, presented in [32] within a four-stage pipeline for the specific application of HMAC.
It is worth stressing that, by relying on the proposed evaluation platform, it has been only necessary to implement the internal transformation core. Then, by configuring the parameters of the SHA-2 workbench, a wide range of different architectures have been obtained with nearly zero effort.
The exploration started by comparing architectures with and without pipelining, and with and without unrolling. Table 1 summarizes the architectures that have been explored, along with the corresponding parameter configurations. Table 2 lists the post-routing implementation results for all the architectures described in Table 1 on the Kintex UltraScale+ XCKU5P FPGA for the SHA-256 hash algorithm. Table 3 lists the results of the same implementation for the SHA-512 hash algorithm. Performance is expressed in terms of the hash rate, which is equivalent to the throughput in Mbps except for the constant value of the hash size. On the other hand, efficiency values are computed with respect to the throughput in Mbps. It is worth recalling that, when nstage pipelining is enabled, the same combinatorial circuitry is replicated throughout the stages, increasing the hash rate n times without degrading, in principle, the critical path delay of the design.
The results show that the Reordered transformation round architecture slightly outperforms the Naive implementation, while the Precomputed transformation round implementation underperforms the straightforward implementation, both in the basic (Table 2a) and pipelined (Table 2b) variants. On the other hand, the 2-unrolled variant of the Reordered transformation round architecture, despite the fact that it includes more optimizations, underperforms the unrolled Naive transformation round, both in the basic (Table 2c) and pipelined (Table 2d) variants.
Overall, the best architecture depends on the evaluation metric. The 4-stage pipeline based on a nonoptimized transformation round core, unrolled by a factor of 4 (architecture Number 4), turns out to be the architecture with the highest hash rate and the best power efficiency, but the 4-stage pipeline based on the highly optimized Reordered (UF2) transformation round (architecture Number 9) shows the best area efficiency.
B. ARCHITECTURAL EXPLORATION
As an example of effective architecture exploration enabled by the SHA-2 workbench, the analysis of the critical paths reported by the hardware synthesis tool show that the final adder was on the critical path even when the circuit is synthesized with the Reordered cores, which theoretically should not be the case [32] . This observation obviously raises the question of whether separating that adder with a register could benefit the critical path and hence the hash rate.
With the proposed framework, obtaining the modified architecture is as fast as switching a Boolean parameter. So, designs from 7 to 10 in Table 1 have been reimplemented with the modified architecture, obtaining the results shown in Table 4 . The architectures labeled with a quote (') in the table correspond to those in Table 1 having the same number, with the addition of the final stage. The results are reported in Table 4 for SHA-256 and Table 5 for SHA-512. All but one architectures benefit from the introduction of the additional stage. This gain may not have been exposed by a theoretical analysis of the critical path.
C. EXPLORING A DIFFERENT TARGET
To investigate the influence of the target platform on the measured results, the evaluation was repeated targeting a different technology, namely the Xilinx Artix-7 XC7A200T device, a 28 nm FPGA [36] , smaller and slower compared to the Kintex device considered in the previous section. Table 6 shows the post-route implementation results.
Some of the considerations made for the Kintex target can be repeated for the Artix-7. The Precomputed transformation round still underperforms the straightforward implementation, and the Reordered (UF2) transformation round still underperforms the unrolled Naive transformation round.
However, there are also significant differences. The Reordered transformation round no longer outperforms the straightforward implementation, both in the basic (Table 6a ) and in the pipelined (Table 6b) variants; in the latter case, the Reordered transformation round even underperforms the Naive variant. More surprisingly, the addition of the final stage turns out to be counterproductive, leading to an increase in the critical path delay. This suggests that the gain observed for the Kintex implementation, which was not expected by the theoretical analysis of the critical path, needs to be evaluated on a platform-specific basis.
VII. DISCUSSION
The described exploration leads to a number of considerations about the effect of each design decision on throughput, area occupation, 1 and power efficiency, and most importantly on the joint metrics of area-and energy-efficiency. The latter metrics are of great importance for practical applications, since these are often constrained in terms of area occupation or power consumption, if not both, and typically place a lower bound on the throughput and an upper bound on the area occupation and/or power consumption. The resulting solution space can be explored using the corresponding joint metric as the evaluation criterion. The results obtained show that the effect of the design decisions on the evaluation metrics is not always what is expected based on purely theoretical considerations. This highlights the need for experimental comparisons, which is facilitated by exploration frameworks like the one presented in this work. 
A. THROUGHPUT AND AREA CONSIDERATIONS
It is clear that loop unrolling by a factor 4 is profitable for the straightforward implementation of SHA-256. This happens because the increase factor of the critical path delay due to the unrolling is between 3× for the Artix-7 and 3.3× for the Kintex, i.e., less than the unrolling factor 4. This is no longer VOLUME 7, 2019 the case for the Reordered implementation: the increase factor here is between 2.1× and 2.3×, which is larger than the unrolling factor 2. This suggests that a more aggressive unrolling would be profitable also for the Reordered transformation round core. It must be also stressed, however, that the Reordered (UF2) implementation involves a number of architectural differences compared to the Reordered (UF1) implementation, so it is not a simple unrolling of the same architecture.
For SHA-512, the effect of unrolling is vastly reduced, since the increase factor in the critical path becomes ∼ 3.8× for the straightforward implementation and ∼ 3.4× for the Reordered implementation.
The area occupation increase factor due to unrolling is ∼ 1.8× for the straightforward implementation and ∼ 2× for the Reordered implementation. Even in the former case, this increase factor is higher than the increase factor for throughput, which corresponds to UF/DIF, with DIF being the increase factor in the critical path delay discussed earlier in this section. This figure is ∼ 1.2× for the straightforward implementation, which is lower than the increase factor in terms of area occupation, resulting in a negative impact of loop unrolling on area efficiency.
Resorting to an n-stage pipelined architecture, which essentially replicates n times the same combinatorial block available for round computation, provides the expected benefit of improving the hash rate by a factor n, only slightly reduced by a small increase in the critical path delay. However, the increase in area occupation turns out to be lower than the number of pipeline stages, with factors varying between 2.8× and 3.5× (lower values for the Artix-7, higher values for SHA-512). Therefore, pipelining turns out to be effective for improving area efficiency. Tables 2 to 5 show the power consumption data broken down into static and dynamic power consumption, 2 because the two components are influenced by different aspects. Static power depends on the target device as well as some of its electrical characteristics like the voltage supply and some operating conditions, like the air conditioning. In our experiments, we kept the default values for these parameters, since we are interested in the impact on power consumption of architectural design decisions. Therefore, the static power mostly remains an inherent characteristic of the device, only slightly influenced by the area occupation.
B. POWER CONSIDERATIONS
The dynamic power is far more interesting, since it can be influenced in complex ways by the different design decisions. In fact, the dynamic power consumption at the node level can be written as
where V is the switching voltage, C L is the load capacitance of the node, f is the operating frequency and α is the switching activity. Apart from the switching voltage, which depends on the supply voltage and hence is a parameter of the device, all the other variables are influenced by the logic of the circuit (the load capacitance being influenced by the fan-out of the node). Among the explored round core designs, the Precomputed implementation has the lowest power consumption, with a 0.8× factor against the straightforward implementation in the SHA-256 implementation targeting the Artix, and 0.6× factor in the SHA-512, whereas the Reordered implementation has the same or slightly higher power consumption than the Naive implementation.
Moreover, the experimental results show a direct dependence of the dynamic power consumption on the number of flip-flops. Therefore, n-stage pipelined configurations, which need to buffer data across the n instances of the combinatorial round block, have an adverse effect on this metric, only marginally compensated for by the slight frequency reduction incurred by the larger design. In fact, the increase factor due to pipelining ranges from 3.6× to 4×.
On the other end, loop unrolling is beneficial, since it reduces the operating frequency of the circuit. For the straightforward implementation, the power consumption is 0.7× to 0.8× the corresponding non-unrolled implementation. On the contrary, the Reordered implementation does not benefit from unrolling, since the Reordered (UF2) variant shows an increase in the number of flip-flops which offsets the benefit of the lowered frequency. This explains why the Naive transformation round core, unrolled by a factor 4 and within a 4-stage pipeline, achieves the best power efficiency.
VIII. CONCLUSIONS
While the SHA-2 primitive is adopted in a large range of applications, the different solutions for its acceleration are often difficult to evaluate and compare because of the variety of requirements in terms of performance, resources, and energy consumption as well as the decisive impact of the particular hardware technology used for the implementation of the SHA-2 accelerator. Exposing a comprehensive range of parameterized architectural choices, the SHA-2 workbench proposed in this work turned out to be very effective for the systematic exploration of different combinations of architectural choices and implementation styles as well as the understanding of their performance and energy implications.
The use of the framework even allowed the identification of nonobvious matches between architectural choices and target technologies optimizing hash rate and area efficiency figures.
