7 research outputs found
Temporal and system level modifications for high speed VLSI implementations of cryptographic core
Hash functions are forming a special family of cryptographic algorithms, which are applied wherever message integrity and authentication issues are critical. As time passes it seems that all applications call for higher throughput due to their rapid acceptance by the market. In this work a new technique is presented for increasing frequency and throughput of the currently most used hash function, which is SHA-1. This technique involves the application of spatial and temporal pre-computation. Comparing to conventional pipelined implementations of hash functions the proposed technique leads to an implementation with more than 75% higher throughpu
Novel high throughput implementation of SHA-256 hash function through pre-computation technique
Hash functions are utilized in the security layer of every communication protocol and in signature authentication schemes for electronic transactions. As time passes more sophisticated applications-that invoke a security layer-arise and address to more users-clients. This means that all these applications demand for higher throughput. In this work a pre-computation technique has been developed for optimizing SHA-256 which has already started replacing both SHA-l and MD-5. Comparing to conventional pipelined implementations of SHA-256 hash function the applied pre-computation technique leads to about 30% higher throughput with only an area penalty of approximately 9.5%
Efficient implementation of the Keyed-Hash Message Authentication Code (HMAC) using the SHA-1 hash function
In this paper an efficient implementation, in terms of performance, of the Keyed-Hash Message Authentication Code (HMAC) using the SHA-1 hash function is presented. This mechanism is used for message authentication in combination with a shared secret key. The proposed hardware implementation can be synthesized easily for a variety of FPGA and ASIC technologies. Simulation results, using commercial tools, verified the efficiency of the HMAC implementation in terms of performance and throughput Special care has been taken so that the proposed implementation doesn't introduce extra design complexity; while in parallel functionality was kept to the required levels
Decoupled processors architecture for accelerating data intensive applications using scratch-pad memory hierarchy
We present an architecture of decoupled processors with a memory hierarchy consisting only of scratch-pad memories, and a main memory. This architecture exploits the more efficient pre-fetching of Decoupled processors, that make use of the parallelism between address computation and application data processing, which mainly exists in streaming applications. This benefit combined with the ability of scratch-pad memories to store data with no conflict misses and low energy per access contributes significantly for increasing the system's performance. The application code is split in two parallel programs the first runs on the Access processor and computes the addresses of the data in the memory hierarchy. The second processes the application data and runs on the Execute processor, a processor with a limited address space-just the register file addresses. Each transfer of any block in the memory hierarchy up to the Execute processor's register file is controlled by the Access processor and the DMA units. This strongly differentiates this architecture from traditional uniprocessors and existing decoupled processors with cache memory hierarchies. The architecture is compared in performance with uniprocessor architectures with (a) scratch-pad and (b) cache memory hierarchies and (c) the existing decoupled architectures, showing its higher normalized performance. The reason for this gain is the efficiency of data transferring that the scratch-pad memory hierarchy provides combined with the ability of the Decoupled processors to eliminate memory latency using memory management techniques for transferring data instead of fixed prefetching methods. Experimental results show that the performance is increased up to almost 2 times compared to uniprocessor architectures with scratch-pad and up to 3.7 times compared to the ones with cache. The proposed architecture achieves the above performance without having penalties in energy delay product cost
Low-power architecture with scratch-pad memory for accelerating embedded applications with run-time reuse
Current embedded systems are usually designed for data-dominated applications, but they have a tight energy and time budget. Scratch-pad memories are completely software-controlled memories with predictable behaviour and good performance and energy characteristics, thus they tend to become a standard feature in many embedded systems. However, their predictability is not helping if the application accesses its data dynamically, when the addresses of the accessed data depend on the application's input. In such cases, predetermining the scratch-pad content at design-time is not always possible as the compiler cannot predict the runtime input. Moreover, in this case, both data reuse and data placement in the scratch-pad are inefficient because chunks of data already stored cannot be efficiently reused and combined with the runtime accessed data blocks. State-of-the art techniques copy each new data block to the scratch-pad without considering whether portions of them are already in it. Such dynamic temporal locality cannot be predicted or exploited by the compiler. The authors here present a system architecture, strongly connected to the system's scratch-pad and the processor's compiler, which is able to efficiently exploit run-time data reuse in the scratch-pad by being capable of holding valuable information, such as the exact data contents of the scratch-pad at runtime, and using it to do all the necessary operations for placing each new data block in scratch-pad. It is fine tuned for applications with run-time reuse between rectangular data blocks. The application domain of the proposed architecture is multimedia applications with run-time reuse, certain applications with linked lists and multi-threaded applications. It operates in a time and energy-efficient manner when compared with existing scratch-pad architectures without the authors' scratch-pad accelerator engine, showing its higher normalised performance and lower normalised energy consumption. Experimental results show up to 2.5 times performance increase compared with existing scratch-pad architectures and 5 times compared with cache architectures and energy decrease up to 1.9 and 3.9 times, respectively
A decoupled architecture of processors with scratch-pad memory hierarchy
We present a decoupled architecture of processors with a memory hierarchy of only scratch-pad memories, and a main memory. The decoupled architecture also exploits the parallelism between address computation and processing the application data. The application code is split in two programs the first for computing the addresses of the data in the memory hierarchy and the second for processing the application data. The first program is executed by one of the decoupled processors called Access which uses compiler methods for placing data in the memory hierarchy. In parallel, the second program is executed by the other processor called Execute. The synchronization of the memory hierarchy and the Execute processor is achieved through simple handshake protocol. The Access processor requires strong communication with the memory hierarchy which strongly differentiates it from traditional uniprocessors. The architecture is compared in performance with the MIPS IV architecture of SimpleScalar and with the existing decoupled architectures showing its higher normalized performance. Experimental results show that the performance is increased up to 3.7 times. Compared with MIPS IV the proposed architecture achieves the above performance with insignificant overheads in terms of areaEuropean Design and Automation Association,The EDA Consortium,The IEEE Computer Society TTTC,IEEE Council on Electronic Design Automation, CEDA,ECSI,et a
A top-down design methodology for ultrahigh-performance hashing cores
Many cryptographic primitives that are used in cryptographic schemes and security protocols such as SET, PKI, IPSec, and VPNs utilize hash functions, which form a special family of cryptographic algorithms. Applications that use these security schemes are becoming very popular as time goes by and this means that some of these applications call for higher throughput either due to their rapid acceptance by the market or due to their nature. In this work, a new methodology is presented for achieving high operating frequency and throughput for the implementations of all widely usedand those expected to be used in the near futurehash functions such as MD-5, SHA-1, RIPEMD (all versions), SHA-256, SHA-384, SHA-512, and so forth. In the proposed methodology, five different techniques have been developed and combined with the finest way so as to achieve the maximum performance. Compared to conventional pipelined implementations of hash functions (in FPGAs), the proposed methodology can lead even to a 160 percent throughput increase