Abstract. The U.S. National Institute of Standards and Technology encouraged the publication of works that investigate and evaluate the performances of the second round SHA-3 candidates. Besides the hardware characterization of the 14 candidate algorithms, the main goal of this paper is the description of a reliable methodology to efficiently characterize and compare VLSI circuits of cryptographic primitives. We took the opportunity to apply it on the ongoing SHA-3 competition. To this end, we implemented several architectures in a 90 nm CMOS technology, targeting high-and moderate-speed constraints separately. Thanks to this analysis, we were able to present a complete benchmark of the achieved post-layout results of the circuits.
Introduction
In 2007, the U.S. National Institute of Standards and Technology (NIST) started a public competition aiming at the selection of a new standard for cryptographic hashing [9] . Hash functions are cryptographic primitives that generate a sort of digital fingerprint of an arbitrary-length file, following some fundamental principles. Due to their flexibility, hash functions are used in a wide range of communication protocols where they provide data integrity, user authentication and many other security features. The motivation behind the NIST competition has been the growing concern of the security of two widely deployed hash functions MD5 and SHA-1 following a series of successful attacks [1, 2, 12] . The structural similarity of MD5 and SHA-1 with the current standard SHA-2 encouraged the NIST to start a new evaluation and selection process similar to the competition which promoted the Rijndael block cipher as new Advanced Encryption Standard (AES) in 2001. The cryptographic community was asked to propose new hash functions and to evaluate the security level of other candidates. In 2008, a total of 51 functions were accepted to the first round, while in July 2009 this number has been reduced to 14 second round candidates. The final decision, i.e., the proclamation of the winner algorithm, has been scheduled for 2012. To this end, the organizers are not only interested in the cryptographic strength of the candidates but also in the evaluation of the performance of the algorithm implemented in different platforms. The new SHA-3 standard is indeed expected to provide at least the security of SHA-2 with significantly improved efficiency. Several applications, from multi-gigabit mass storage devices to radio-frequency identification (RFID) tags, are expected to utilize SHA-3. It is therefore crucial that the final SHA-3 function should be flexible enough to be used in both high-performance and resource constrained environments. From a pure hardware point of view, the SHA-3 algorithm should provide good performance in terms of speed, area, and power.
Our interest in the SHA-3 selection process started with our involvement with the development of the candidate algorithm BLAKE. We participated in the algorithm specification, providing relevant information on the hardware performance and possible optimizations in this direction. When the SHA-3 competition entered the second phase, we started a VLSI characterization of several candidates within three separate student projects at our institute. The resulting designs were manufactured in three different ASICs, each containing a dedicated interface for I/O communication and the selected algorithms. At this time, we had implemented twelve out of fourteen candidate algorithms (all apart from ECHO and SIMD). We then decided to extend the analysis to all candidate algorithms.
In this paper we develop and present one methodology to evaluate the ASIC implementation of all SHA-3 second round algorithms. Rather than going for extremes of performance (fastest or smallest implementation) we propose to optimize all algorithms for multiple clearly defined specifications. We have applied our methodology and have evaluated several architectural variations of all candidate algorithms and presented the results.
The organization of the paper is as follows: A discussion of our methodology is the focus of Sect. 2. We present our approach to have a fair comparison, provide details and reasoning for key design decisions. Implementation details are given in Sect. 3. Due to limited space we were unable to provide implementation details for the architectures, an abbreviated summary of all architectures is provided in the Appendix. The results of our evaluation are presented in Sect. 4 together with a subsection that explains the errors in our methodology. We hope that this "open" approach will allow independent researchers to validate our findings. Finally in Sect. 5 we have concluding remarks.
Evaluation Methodology
In this work we will attempt to make a fair comparison between VLSI implementations of a set of algorithms all of which realize a similar function, but have very different structures. The main difficulty in this particular evaluation is the lack of concrete hardware specifications for the secure hash function candidates.
In practice, the specifications of the hardware are determined by the application. The hardware designers can then make several well-known trade-offs to come up with a design that offers the best compromise between, the required silicon area, the amount of energy required for the operation and the throughput/latency of the operation. For this study the requirements state efficient hardware implementation without being specific 4 .
In some cases, such as telecommunication algorithms which have to fulfill requirements of certain well-defined standards, the application field alone sets sufficient constraints on the system. However cryptographic functions, like the SHA-3 hash function candidates that is the topic of this paper, are used for a very wide range of applications with different requirements. This makes it difficult to determine which of the performance parameters is more important. A hash function that is part of a battery operated wireless transmitter would probably be optimized for energy consumption, while the same algorithm when implemented in a telecommunication base station would most likely favor a high-throughput realization.
For comparative studies, if concrete specifications are not present, the authors will usually determine one parameter to be more important (i.e. throughput) [7, 8, 11] , or will come up with aggregate performance metrics such as throughput per mm 2 [3, 5, 10] . Both approaches have their problems. Focusing on one parameter will favor algorithms which are strong on one parameter (i.e. throughput), but will not merit algorithms which perform better in other scenarios. Aggregate performance metrics on the other hand, may end up hiding the absolute performance of an implementation, impractical design corners (i.e. very large area, very low throughput) may perturb the results.
In the following subsection we will first define the performance metrics that we will consider in this evaluation. The next step will be to define specifications that will set limits on these performance metrics.
Performance Metrics
The most common metrics for hardware include the operation speed, the circuit area and the power consumption. For this analysis we have decided to use the following three main metrics for performance:
Generally speaking the cost of an ASIC implementation of a function for a particular technology directly depends on the area required to realize the function 5 . In this evaluation we will use the net circuit area of a placed and routed design, including the overhead for power routing, clock trees. The area will be reported in kilo gate equivalents (kGE), where a gate equivalent corresponds to the area of a nominal drive strength 2-input NAND (or NOR) gate in the standard cell library used for the design realization. This metric covers the evaluation criteria 4.B.ii Memory requirements in the NIST specification [9] .
-Throughput We need a measure to determine how fast the implementation is. To this end we define the throughput of a hash function as the amount of message (input information) in bits for which a message digest can be computed per second. Furthermore, we assume that the hash function has been properly initialized, and the message sizes are matched to individual candidate functions for best case performance. The throughput numbers are given in Gigabits per second (Gbps). This metric covers the evaluation criteria 4.B.i Computational Efficiency in the NIST specification [9] . -Energy Consumption Power and energy metrics have gained more importance in recent years. On one hand there are power density limits the circuits have to comply for sub 100 nm technologies, and on the other hand for systems with scarce energy resources (handheld devices, smartcards, RFID devices etc.) reduced energy consumption equals to increased functionality or longer operating time. In this evaluation we will consider the energy consumption as our metric and will calculate the energy per bit of input information processed by the hash function. This will be obtained by dividing the total power consumption (in Watts) by the throughput (Gigabits/s) described above. The energy consumption will be given in milli Joules per Gigabit (mJ/Gbit). This metric partly covers the evaluation criteria 4.C.i.b Flexibility in the NIST specification [9] as the energy efficiency is a deciding factor for implementation in constrained environments.
SHA-3 parameters
The SHA-3 Minimum Acceptability Requirements state that all candidates should support message digest sizes of 224,256, 384, and 512 bits, and support a maximum message length of at least 2 64 − 1 bits. All algorithms process the message in blocks. The so-called message block size differs from algorithm to algorithm. In addition several submissions have included a salt input that can be used as a parameter in the hash function. In our evaluation we have chosen:
-Message Digest Size of 256 Several algorithms use (slightly) different architectures for different output lengths. Additional circuitry is then required to support all possible digest sizes. By selecting a single length, we aim to focus on the core algorithm which also simplifies certain architectural decisions. Out of the four required sizes, we have eliminated 224 and 384 as they are not a power of two (always an advantage in hardware design). We have settled on 256 as it will usually result in smaller hardware and faster implementations. -Use the largest message block size available
For each algorithm we have used the largest message block size and we have assumed that the message has already been padded (i.e. the length of the padded message is an exact multiple of the message block size). For throughput computation we always give the maximum achievable values, e.g., very long message for algorithms that have an initialization procedure.
Since not all algorithms provide such an input, we have not included any salt inputs. For algorithms that provide a salt, the inputs are set to their default values according to the specification, and these constants have been propagated during synthesis to allow further optimizations whenever possible.
Defining Specifications
As mentioned earlier, the main difficulty in this evaluation is the lack of precise specifications that the candidate algorithms have to fulfill. Hardware design is based on finding a compromise between competing parameters that determine circuit performance. For example, there are several architectural transformations that allow to increase the throughput at the expense of the circuit area (see [6] ). Without guiding specifications, it is difficult to determine which of the circuit metrics is more important for a design.
In summary, the NIST specifications in [9] require that the candidate algorithms to be computationally efficient (4.B.i), have limited memory requirements (4.B.ii), to be flexible (4.C.i) and simple (4.C.ii) 6 .
The classical way to perform this analysis would be to concentrate on only the throughput metric and try to find out which algorithms are the fastest. In the last year, several groups presented comparative works and, almost certainly, others will be publishing new results to this effect. However, if only the maximum throughput requirement is investigated the flexibility of candidate algorithms may not be visible. Therefore we suggest to use two separate specifications: an aggressive high-throughput target and a moderate-throughput target.
The high throughput target has been chosen to be beyond the expected performance of most algorithms, and would therefore still be able to rank the algorithms in their maximum throughput capability. Our observation has been that even with older fabrication technologies, such as 180 nm CMOS, several candidate algorithms are able to reach throughputs of multiple Gigabits/s.
There are certainly applications which could make use of such throughputs, however such data rates are way beyond the requirements for many applications. For the moderate throughput requirement we have decided to determine a throughput which is at least two orders of magnitude lower than that used in the first case.
Fixing one of the performance metrics, allows us to make a fairer comparison between the remaining performance metrics (area and energy), and by considering two distinct throughput targets, we hope to uncover the flexibility of the candidate algorithms for different operational requirements. In particular, we will be interested in the circuit area for our high-throughput target, while we will be more interested in the energy consumption for our moderate-throughput target.
The maximum achievable throughput by a circuit implementing a cryptographic algorithm depends on the specific technology into which the circuit will be mapped. A throughput value that is easily achieved in 65 nm process, may not be feasible at all when using a 180 nm process. Therefore the specifications for our two scenarios have to be chosen while considering the capabilities of our target process.
We have decided to use the 90 nm CMOS process by UMC with the free libraries from Faraday Technology Corporation, mainly because we already had experience in designing ASICs with this technology and it was readily available within our design environment at the time of this study.
Our experiences from designing the three ASICs (one of which was manufactured using this target technology) have given us a good estimation for the expected performance of all algorithms in the 90 nm process. We have decided to use 20 Gigabits/s for our high throughput target and 0.2 Gigabits/s for our moderate performance specifications. In the high-speed mode, almost all designs should be pushed to their speed limit, while with the latter we could evaluate the scalability and therefore the flexibility of each candidate algorithm.
ASIC Realizations
During this work twelve out of the fourteen second round SHA-3 candidates (some with several architectural variations) were fabricated in three different ASICs as shown in Fig. 1 . Table 1 shows a list of algorithms that were implemented and their performances measured on the manufactured chips.
Actually implementing the designs in real silicon is certainly the best way to validate a design and determine its true potential. However, during this work we have realized that several practical factors have affected these results. The maximum available silicon area (that can be afforded for this project), the total number of I/O pins, the capabilities of the test infrastructure that is available for the test of the ASIC have all set limits on the implementations.
Since none of the designs was large enough to merit its own ASIC, each ASIC comprised of several independent modules. All modules shared a common interface which provided the inputs and collected the outputs from individual hash function realizing cores. For practical reasons, cores with similar clock frequencies were grouped together and were optimized using common constraints. In many cases compromises had to be made to allow two or more cores to be optimized at the same time. All of these had non-negligible influence on the outcome. Practical considerations for testing of the systems has brought even more constraints. The necessity to include test structures (scan chains) adds some overhead, but more importantly, the maximum achievable clock rate greatly depends on the capabilities of the ASIC test infrastructure available. Designs with a high clock frequency (more than 500 MHz for 90 nm designs) put yet other constraints. When compared to designs running at lower frequencies, these designs suffer more from clock and power distribution problems, and are difficult to test at speed.
When designing these three ASICs we were forced to make many design decisions (i.e. blocks running faster than 700 MHz were deemed to be impractical within our environment) based on practical constraints which had its influence on the results. Scheduling constraints have also played a role in the choice of technology used to implement the designs. For the last two ASICs, there were no feasible 90 nm MPW (Multi Project Wafer) runs available. Consequently we had to submit these designs to a 180 nm run, which in turn made direct comparisons more difficult.
For this reason we have taken the design experience from the actual implementation of the individual cores, and have decided to re-implement all cores without considering these practical limitations. In particular we have decided:
-No limits on the clock frequency
In this study we will not set any artificial limits on the clock rate. Obviously designs with high clock rates will still face the penalties for clock distribution, but we will not deal with practical considerations such as test, crosstalk and I/O limitations.
-No test structures
Testing is an essential part of IC design. The exact overhead for testing depends on many factors, such as the desired test quality, and a one-size fits all solution is difficult to find 7 . Since the designs in this study will not be manufactured directly we chose not to include any test specific structures into the designs to have a fair comparison. -Assumed an ideal interface
The candidate algorithms differ in the number of I/Os they require. We have assumed that these core will eventually be part of a larger system which has an adequate I/O interface matching the requirements of each core. In this way, every function could express its maximum potentiality without suffering from any external limiation. However, we made no assumptions about how long the inputs stayed valid, all required inputs were sampled by the cores at the beginning of the operation. In other words, we implemented an internal message block memory for designs that require the input to be stable for more than one clock cycle.
-No macro blocks
We have not used any macro blocks to realize look-up tables or register files for portability reasons. All look-up tables and memory blocks were realized by standard cells.
Implementation

Design flow
The same design procedure was used for all candidate algorithms. We have first developed a golden model based on the Known Answer Tests provided by the submission package. This golden model was then used to generate the stimuli vectors and expected responses that we have used to verify the RTL description of the algorithm written in VHDL.
We have then used Synopsys Design Vision-2009.06 to map the RTL description to the UMC 90 nm technology using the fsd0a_a_2009Q2v2.0 RVT standard cell library from Faraday Technology Corporation. All outputs are assumed to have a capacitive loading of 50 fF (equivalent to the input capacitance of about 9 medium strength buffers), and the input drive strength is assumed to be that of a medium strength buffer (BUFX8).
We use the worst case condition (1.08 V, 125°C) characterization of the standard cell libraries. We have decided to use worst case characterized libraries in order to guarantee that we can meet the specifications. Table 2 is given as a reference to be able to compare the three characterizations that are commonly available (worst, typical, best) for one of the candidate algorithms. Depending on the throughput requirements, we try different architectural transformations such as parallelization, pipelining to come up with an architecture that meets (or comes closest to meeting) the requirements. We then use the Cadence Design Systems Velocity-9.1 tool for the back-end design. The technology used in this evaluation uses 8 metal layers (metallization option 8m026), out of which the top-most two are double pitch (wider and thicker). A square floorplan is generated, leaving 30 µm space around the core for the power connections. For all designs we have used a 85 % utilization of the core area, in other words we have left 15 % of the area for post-layout optimization and power and ground distribution overhead. For power routing we have used a power grid utilizing Metal-7 and Metal-8.
Then the design is placed, a clock tree is synthesized and subsequently the design is routed. After every step the timing is checked, and if necessary a timing optimization is performed. At the end, if a valid layout without any Design Rule Check (DRC) violations are found, the total core area is reported as the area of the system. The total core area excludes the 30 µm space reserved for power rings, but includes all the available area that the placement and routing tool can use for the design. By default, all designs start with a 15 % overhead for post-layout optimizations. Depending on the design some amount of this overhead is used during various optimization phases during the backend design. However it is difficult to quantify the minimum required overhead for every design reliably. We have decided to start all designs with the same initial placement density, and verified that the final design was not overlycongested. In a congested design, the routing solution includes many detours which adversely affect timing. For these designs the initial row utilization would have been reduced by 5 %, increasing the overhead. This was not necessary for any designs in this study 8 . In some designs, the routing resources are sparsely utilized. Such designs could have benefited from a higher initial row utilization, which could have resulted in a slightly smaller circuit without noticeable timing penalties. As mentioned earlier, it is not trivial to make sure that two designs have exactly the same amount of overhead. Therefore, we have not considered changing the default row utilization, unless there was a noticeable problem.
The timing results are taken from the finalized design. First, the Velocity tool is used to extract the post-layout parasitics and an SDF file containing the delays of all interconnections and instances is generated. The final netlist and the SDF files is read by the Mentor Graphics Modelsim-6.5a simulator and the functionality of the design is verified. At the same time, a Value Change Dump (VCD) file that records the switching activity of all the nodes during the simulation is produced. To have more realistic results, the start of the VCD file is chosen after the circuit has been properly initialized. This VCD file is then read back into the Velocity tool and a statistical power analysis is performed. The Total Power number is used to determine the energy consumption of the system.
Algorithms
For a given candidate algorithm, there are several well-known architectural transformations such as parallelization, pipelining, loop-unrolling etc. that will allow different trade-offs between circuit size and throughput. In addition, within the submission document, the authors often suggest different computational methods to perform a specific transformation of their candidate function. A good example is the frequently used substitution boxes. They can be implemented as look-up tables, or can be realized as a circuit that computes the underlying function mathematically. To make matters worse, the exact trade-off between alternative realizations may only be visible after placement and routing. All these aspects broaden the spectrum of the possible hardware architectures. For a single candidate, there is often a large set of circuits with different trade-offs between size and speed. To identify the best design among many possibilities is not a trivial task. Despite all attempts to formalize architectural exploration, our experience has been that optimizing the circuit still remains a manual task, that relies on the skill and experience of the designer.
In this work, for each candidate algorithm we have selected what we believe was the most appropriate architecture that was able to reach the target throughput (20 and 0.2 Gbps) with minimal resources. For every candidate we designed and implemented two different architectures. The specifications of the single designs used within this work, is given in App. A. We make no claims that any of the architectures we have reported in this paper is the best possible architecture for a given candidate algorithm. In our opinion, it is not possible to make such a claim, and the exact implementations should be open to public scrutiny and review. For this purpose we have made all the source code that was used for this evaluation public on our www site [4] .
Results
In this section we present the performance of the circuits implemented for high and moderate speed environments. The comparison between these two scenarios gives a further overview of the efficiency and flexibility of the candidate algorithms. We will refrain from concluding remarks about the performance of the algorithms, as we do not consider the results complete without public scrutiny.
For each architecture we report two operating frequencies/throughputs. The Maximum Clock Frequency is the maximum achievable clock frequency of the given architecture. When operating with this clock frequency the circuit can achieve the given Maximum Achievable Throughput. In most cases, this throughput is not exactly the same as the required throughput (either 20 or 0.2 Gbps). The second clock frequency states the clock frequency required to reach the target throughput. The final value in the tables is a relative indicator of how close the architecture is in achieving the target clock frequency. A number lower than one means that the architecture failed to achieve the target throughput. One can take this as a ratio of how closely we were able to optimize the circuit to the given target performance.
High Throughput Scenario
As expected, not all the circuits optimized for high-speed were able to reach the target throughput. Only two algorithms, Keccak and Luffa, were able to achieve the constraint. Table 3 lists the main performance figures for all architectures. In this scenario both area and energy were sacrificed to achieve high-throughput. The corresponding layouts can be seen in Fig. 2 . The scale is given in the lower right corner of the figure. Circuits with a higher congestion rate (i.e. BMW or SIMD) require indeed the entire core for routing, and would probably reach a faster throughput with more core area, i.e., a lower row utilization. Particularly interesting is also the local congestion for the 8-bit LUT-based S-boxes which makes them easily identifiable within ECHO, Grøstl, Fugue, and partly SHAvite.
Medium Throughput Scenario
The moderate-throughput circuits match the target throughput of 0.2 Gbps without difficulty. As can be seen in Table 4 the maximum achievable clock rate always exceeds the clock frequency required for 0.2 Gbps operation. To some extent, the additional speed can be traded to reduce the overall energy consumption, by lowering the supply voltage. It must be noted that there is a lower limit for the supply voltage (around 0.5 V for this process). Such voltage scaling techniques were not considered in this comparison, all results are listed for 1.2 V supply voltage.
Since in this scenario, timing was quite relaxed, the main figure of merit becomes the area and the energy dissipation. The layouts of all fourteen architectures are compared in Fig. 3 , with the scale indicated on the bottom left.
The most interesting result is that a smaller area (or indeed throughput) does not always equal lower energy consumption (see Hamsi or Skein compared to BMW or SIMD). It must be noted that, no special precautions were taken for a low-power design (i.e. proper clock-gating, input-silencing). In addition some architectural decisions resulted in increased number of operations and/or increased circuit activity which affected the energy consumption differently for separate algorithms. We believe that there is much room for improvement in terms of low-power performance of the architectures. We must conclude that the present specifications do not necessarily result in low-power realizations in the medium-throughput corner. In a next step, the design methodology could be extended to provide a low-power scenario.
Sources of Error
Although we have tried our best to ensure a fair comparison, there are many factors that could have influenced the results. In this section we try to outline the possible sources of error in our results, and outline what we have done to address them. -Conflict of interest One of the authors of this paper, Luca Henzen, is involved with the SHA-3 candidate algorithm BLAKE. Our interest in implementing the SHA-3 candidate algorithms has started by investigating optimal hardware implementations of BLAKE. We have tried to be as impartial as possible when implementing other candidate algorithms. However, it is true that we are more familiar with this algorithm than any other algorithm.
-Designer experience
The algorithms have been implemented by a group of students over a period of several months. Different designers may have more or less success in optimizing a given design. We have confidence in our team, but it is possible that for some algorithms we have inadvertently missed a possible optimization while for the others we were more successful. In addition, over time the designers naturally gain more experience and are more successful with the designs. We believe that the most important aspect of a fair comparison is openness. For this reason we have made the source code and run scripts for the EDA tools used to implement all designs presented in this paper available on our website [4] . In this way, other groups can replicate our results, and can find and correct any mistakes we might have made in the process.
-Accuracy of numbers
The numbers delivered by synthesis and analysis tools rely on the library files provided by the manufacturer. The values in the libraries are essentially statistical entities and sometimes have large uncertainties associated with it. In addition most of the design process involves heuristic algorithms which depending on a vast number of parameters can return different results. Our experience with synthesis tools suggest that the results have around ± 5% variation. We therefore consider results that are within 10% of each other to be comparable.
In an effort to be more accurate we have chosen to report post-layout area numbers that include clock and power distribution overhead. We have designed all circuits with the same overhead. For some circuits this overhead is adequate, for others it is too much, and for others is insufficient. We made sure that there is an acceptable solution for all cases. -Bias through specification
We have chosen two design corners in our applications, these specifications have helped us to have a common base for comparing all 14 algorithms. Regardless of how these specifications are chosen, it is possible that they benefit some algorithms more than the others. We hope that, similar studies by other groups which use different specifications will help to give a clearer picture. -Simplification due to assumptions All our assumptions, the specific choices we made for SHA-3 parameters and the practical choices we made in the design flow will have some effect on the results. For example, we have decided not take IR-drop or crosstalk effects into account. As a result, the cores that achieve their reported performance by using very high clock frequencies will be more difficult to realize in practice. The assumptions in the design flow are a practical necessity and were designed to create a methodology in which the same solution could be used for all designs.
Conclusions
In this paper we have presented a methodology to compare the SHA-3 candidate algorithms. Our previous experiences in designing ASIC implementations of candidate algorithms (Table 1) has been instrumental in developing what we believe is a fair set of specifications. Rather than targeting outright performance, we have set limits for one performance metric (throughput) and re-implemented all algorithms to meet two distinct throughput requirements. This enabled us to compare the flexibility of the algorithms (Tables 3 and 4) . A public selection process, such as the SHA-3 invariably attracts a large number of submissions with many different algorithms. In early stages of the selection process, the sheer number of algorithms (51 in the first round) makes it impractical to employ a detailed analysis for hardware suitability. Our experience has shown that even with the 14 second round candidates, it is difficult to present an authoritative and fair evaluation of all candidates. We believe that for the final round of evaluations, a similar approach to what we have demonstrated in this paper should be utilized: Clear constraints should be set for the implementations, preferably more than one performance corner should be targeted, the evaluation process should be well documented and the errors in the evaluation process should be openly discussed. We would also suggest the addition of a low-power corner that also considers voltage scaling for low-power operation to our methodology.
In many parts of this paper, we have extensively commented on limitations of our methodology, and have included a whole subsection on sources of error. We strongly believe that any such comparison must be thorough with its analysis of error sources and clear with its performance metrics. Table 5 gives an overview of the architectures, used within this work. For some candidates we used the same design for the 20 Gbps (HS) and 0.2 Gbps (MS) analysis. In such cases, different optimization parameters were used. The detailed description of the architectures has been omitted because of the limited article length. Refer to [4] for the complete source code for all the architectures used in this evaluation
A Hardware Architectures
