1,066 research outputs found

    Comparative Study of Keccak SHA-3 Implementations

    Get PDF
    This paper conducts an extensive comparative study of state-of-the-art solutions for im- plementing the SHA-3 hash function. SHA-3, a pivotal component in modern cryptography, has spawned numerous implementations across diverse platforms and technologies. This research aims to provide valuable insights into selecting and optimizing Keccak SHA-3 implementations. Our study encompasses an in-depth analysis of hardware, software, and software–hardware (hybrid) solutions. We assess the strengths, weaknesses, and performance metrics of each approach. Critical factors, including computational efficiency, scalability, and flexibility, are evaluated across differ- ent use cases. We investigate how each implementation performs in terms of speed and resource utilization. This research aims to improve the knowledge of cryptographic systems, aiding in the informed design and deployment of efficient cryptographic solutions. By providing a comprehensive overview of SHA-3 implementations, this study offers a clear understanding of the available options and equips professionals and researchers with the necessary insights to make informed decisions in their cryptographic endeavors

    Design of testbed and emulation tools

    Get PDF
    The research summarized was concerned with the design of testbed and emulation tools suitable to assist in projecting, with reasonable accuracy, the expected performance of highly concurrent computing systems on large, complete applications. Such testbed and emulation tools are intended for the eventual use of those exploring new concurrent system architectures and organizations, either as users or as designers of such systems. While a range of alternatives was considered, a software based set of hierarchical tools was chosen to provide maximum flexibility, to ease in moving to new computers as technology improves and to take advantage of the inherent reliability and availability of commercially available computing systems

    Instruction-set architecture synthesis for VLIW processors

    Get PDF

    Vector processor virtualization: distributed memory hierarchy and simultaneous multithreading

    Get PDF
    Taking advantage of DLP (Data-Level Parallelism) is indispensable in most data streaming and multimedia applications. Several architectures have been proposed to improve both the performance and energy consumption for such applications. Superscalar and VLIW (Very Long Instruction Word) processors, along with SIMD (Single-Instruction Multiple-Data) and vector processor (VP) accelerators, are among the available options for designers to accomplish their desired requirements. On the other hand, these choices turn out to be large resource and energy consumers, while also not being always used efficiently due to data dependencies among instructions and limited portion of vectorizable code in single applications that deploy them. This dissertation proposes an innovative architecture for a multithreaded VP which separates the path for performing data shuffle and memory-indexed accesses from the data path for executing other vector instructions that access the memory. This separation speeds up the most common memory access operations by avoiding extra delays and unnecessary stalls. In this multilane-based VP design, each vector lane uses its own private memory to avoid any stalls during memory access instructions. More importantly, the proposed VP has an innovative multithreaded architecture which makes it highly suitable for concurrent sharing in multicore environments. To this end, the VP which is developed in VHDL and prototyped on an FPGA (Field-Programmable Gate Array), serves as a coprocessor for one or more scalar cores in various system architectures presented in the dissertation. In the first system architecture, the VP is allocated exclusively to a single scalar core. Benchmarking shows that the VP can achieve very high performance. The inclusion of distributed data shuffle engines across vector lanes has a spectacular impact on the execution time, primarily for applications like FFT (Fast-Fourier Transform) that require large amounts of data shuffling. In the second system architecture, a VP virtualization technique is presented which, when applied, enables the multithreaded VP to simultaneously execute many threads of various vector lengths. The threads compete simultaneously for the VP resources having as a goal an improved aggregate VP utilization. This approach yields high VP utilization even under low utilization for the individual threads. A vector register file (VRF) virtualization technique dynamically allocates physical vector registers to running threads. The technique is implemented for a multi-core processor embedded in an FPGA. Under the dynamic creation of threads, benchmarking demonstrates large VP speedups and drastic energy savings when compared to the first system architecture. In the last system architecture, further improvements focus on VP virtualization relying exclusively on hardware. Moreover, a pipelined data shuffle network replaces the non-pipelined shuffle engines. The VP can then take advantage of identical instruction flows that may be present in different vector applications by running in a fused instruction mode that increases its utilization. A power dissipation model is introduced as well as two optimization policies towards minimizing the consumed energy, or the product of the energy and runtime for a given application. Benchmarking shows the positive impact of these optimizations

    Efficient design-space exploration of custom instruction-set extensions

    Get PDF
    Customization of processors with instruction set extensions (ISEs) is a technique that improves performance through parallelization with a reasonable area overhead, in exchange for additional design effort. This thesis presents a collection of novel techniques that reduce the design effort and cost of generating ISEs by advancing automation and reconfigurability. In addition, these techniques maximize the perfomance gained as a function of the additional commited resources. Including ISEs into a processor design implies development at many levels. Most prior works on ISEs solve separate stages of the design: identification, selection, and implementation. However, the interations between these stages also hold important design trade-offs. In particular, this thesis addresses the lack of interaction between the hardware implementation stage and the two previous stages. Interaction with the implementation stage has been mostly limited to accurately measuring the area and timing requirements of the implementation of each ISE candidate as a separate hardware module. However, the need to independently generate a hardware datapath for each ISE limits the flexibility of the design and the performance gains. Hence, resource sharing is essential in order to create a customized unit with multi-function capabilities. Previously proposed resource-sharing techniques aggressively share resources amongst the ISEs, thus minimizing the area of the solution at any cost. However, it is shown that aggressively sharing resources leads to large ISE datapath latency. Thus, this thesis presents an original heuristic that can be parameterized in order to control the degree of resource sharing amongst a given set of ISEs, thereby permitting the exploration of the existing implementation trade-offs between instruction latency and area savings. In addition, this thesis introduces an innovative predictive model that is able to quickly expose the optimal trade-offs of this design space. Compared to an exhaustive exploration of the design space, the predictive model is shown to reduce by two orders of magnitude the number of executions of the resource-sharing algorithm that are required in order to find the optimal trade-offs. This thesis presents a technique that is the first one to combine the design spaces of ISE selection and resource sharing in ISE datapath synthesis, in order to offer the designer solutions that achieve maximum speedup and maximum resource utilization using the available area. Optimal trade-offs in the design space are found by guiding the selection process to favour ISE combinations that are likely to share resources with low speedup losses. Experimental results show that this combined approach unveils new trade-offs between speedup and area that are not identified by previous selection techniques; speedups of up to 238% over previous selection thecniques were obtained. Finally, multi-cycle ISEs can be pipelined in order to increase their throughput. However, it is shown that traditional ISE identification techniques do not allow this optimization due to control flow overhead. In order to obtain the benefits of overlapping loop executions, this thesis proposes to carefully insert loop control flow statements into the ISEs, thus allowing the ISE to control the iterations of the loop. The proposed ISEs broaden the scope of instruction-level parallelism and obtain higher speedups compared to traditional ISEs, primarily through pipelining, the exploitation of spatial parallelism, and reducing the overhead of control flow statements and branches. A detailed case study of a real application shows that the proposed method achieves 91% higher speedups than the state-of-the-art, with an area overhead of less than 8% in hardware implementation

    Performance Aspects of Synthesizable Computing Systems

    Get PDF

    Increasing the efficacy of automated instruction set extension

    Get PDF
    The use of Instruction Set Extension (ISE) in customising embedded processors for a specific application has been studied extensively in recent years. The addition of a set of complex arithmetic instructions to a baseline core has proven to be a cost-effective means of meeting design performance requirements. This thesis proposes and evaluates a reconfigurable ISE implementation called “Configurable Flow Accelerators” (CFAs), a number of refinements to an existing Automated ISE (AISE) algorithm called “ISEGEN”, and the effects of source form on AISE. The CFA is demonstrated repeatedly to be a cost-effective design for ISE implementation. A temporal partitioning algorithm called “staggering” is proposed and demonstrated on average to reduce the area of CFA implementation by 37% for only an 8% reduction in acceleration. This thesis then turns to concerns within the ISEGEN AISE algorithm. A methodology for finding a good static heuristic weighting vector for ISEGEN is proposed and demonstrated. Up to 100% of merit is shown to be lost or gained through the choice of vector. ISEGEN early-termination is introduced and shown to improve the runtime of the algorithm by up to 7.26x, and 5.82x on average. An extension to the ISEGEN heuristic to account for pipelining is proposed and evaluated, increasing acceleration by up to an additional 1.5x. An energyaware heuristic is added to ISEGEN, which reduces the energy used by a CFA implementation of a set of ISEs by an average of 1.6x, up to 3.6x. This result directly contradicts the frequently espoused notion that “bigger is better” in ISE. The last stretch of work in this thesis is concerned with source-level transformation: the effect of changing the representation of the application on the quality of the combined hardwaresoftware solution. A methodology for combined exploration of source transformation and ISE is presented, and demonstrated to improve the acceleration of the result by an average of 35% versus ISE alone. Floating point is demonstrated to perform worse than fixed point, for all design concerns and applications studied here, regardless of ISEs employed

    A Survey of Recent Developments in Testability, Safety and Security of RISC-V Processors

    Get PDF
    With the continued success of the open RISC-V architecture, practical deployment of RISC-V processors necessitates an in-depth consideration of their testability, safety and security aspects. This survey provides an overview of recent developments in this quickly-evolving field. We start with discussing the application of state-of-the-art functional and system-level test solutions to RISC-V processors. Then, we discuss the use of RISC-V processors for safety-related applications; to this end, we outline the essential techniques necessary to obtain safety both in the functional and in the timing domain and review recent processor designs with safety features. Finally, we survey the different aspects of security with respect to RISC-V implementations and discuss the relationship between cryptographic protocols and primitives on the one hand and the RISC-V processor architecture and hardware implementation on the other. We also comment on the role of a RISC-V processor for system security and its resilience against side-channel attacks

    Vector coprocessor sharing techniques for multicores: performance and energy gains

    Get PDF
    Vector Processors (VPs) created the breakthroughs needed for the emergence of computational science many years ago. All commercial computing architectures on the market today contain some form of vector or SIMD processing. Many high-performance and embedded applications, often dealing with streams of data, cannot efficiently utilize dedicated vector processors for various reasons: limited percentage of sustained vector code due to substantial flow control; inherent small parallelism or the frequent involvement of operating system tasks; varying vector length across applications or within a single application; data dependencies within short sequences of instructions, a problem further exacerbated without loop unrolling or other compiler optimization techniques. Additionally, existing rigid SIMD architectures cannot tolerate efficiently dynamic application environments with many cores that may require the runtime adjustment of assigned vector resources in order to operate at desired energy/performance levels. To simultaneously alleviate these drawbacks of rigid lane-based VP architectures, while also releasing on-chip real estate for other important design choices, the first part of this research proposes three architectural contexts for the implementation of a shared vector coprocessor in multicore processors. Sharing an expensive resource among multiple cores increases the efficiency of the functional units and the overall system throughput. The second part of the dissertation regards the evaluation and characterization of the three proposed shared vector architectures from the performance and power perspectives on an FPGA (Field-Programmable Gate Array) prototype. The third part of this work introduces performance and power estimation models based on observations deduced from the experimental results. The results show the opportunity to adaptively adjust the number of vector lanes assigned to individual cores or processing threads in order to minimize various energy-performance metrics on modern vector- capable multicore processors that run applications with dynamic workloads. Therefore, the fourth part of this research focuses on the development of a fine-to-coarse grain power management technique and a relevant adaptive hardware/software infrastructure which dynamically adjusts the assigned VP resources (number of vector lanes) in order to minimize the energy consumption for applications with dynamic workloads. In order to remove the inherent limitations imposed by FPGA technologies, the fifth part of this work consists of implementing an ASIC (Application Specific Integrated Circuit) version of the shared VP towards precise performance-energy studies involving high- performance vector processing in multicore environments
    • 

    corecore