14 research outputs found

    A high-performance matrix-matrix multiplication methodology for CPU and GPU architectures

    Get PDF
    Current compilers cannot generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like matrix–matrix multiplication (MMM). A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and number of levels of tiling. The scheduling parameter values selection is a very difficult and time-consuming task, since parameter values depend on each other; this is why they are found by using searching methods and empirical techniques. To overcome this problem, the scheduling sub-problems must be optimized together, as one problem and not separately. In this paper, an MMM methodology is presented where the optimum scheduling parameters are found by decreasing the search space theoretically, while the major scheduling sub-problems are addressed together as one problem and not separately according to the hardware architecture parameters and input size; for different hardware architecture parameters and/or input sizes, a different implementation is produced. This is achieved by fully exploiting the software characteristics (e.g., data reuse) and hardware architecture parameters (e.g., data caches sizes and associativities), giving high-quality solutions and a smaller search space. This methodology refers to a wide range of CPU and GPU architectures

    Conclusions and Future Directions

    No full text

    Authentication with RIPEMD-160 and other alternatives: A Hardware Design Perspective 103 X Authentication with RIPEMD-160 and other alternatives: A Hardware Design Perspective

    No full text
    Abstract Taking into consideration the rapid evolution of communication standards that include message authentication and integrity verification, it is realized that constructions like MAC and HMAC, are widely used in the most popular cryptographic schemes since provision of a way to check the integrity of information transmitted over or stored in an unreliable medium is a prime necessity in the world of open computing and communications. MACs are used so as to protect both a message's integrity as well as its authenticity, by allowing verifiers (who also possess the secret key) to detect any changes to the message content. In every modern cryptographic scheme that is used to secure a crucial application that calls for security, a keyed-hash message authentication code, or HMAC, is incorporated. Beyond HMAC, a block cipher algorithm is also incorporated (i.e like AES), thus resulting to the whole security scheme. The proposed hardware design invokes a number of optimizing techniques like pipeline, evaluation-based partial unrolling, certain algorithmic transformations in space and time and computational re-ordering, leading to a highthroughput and low-power design for the whole HMAC construction. Finally, a new algorithm, CMAC, for producing message authenticating codes (MACs) which was recently proposed by NIST, is also described. The proposed security scheme incorporates a FIPS approved and a secure block cipher algorithm (that might have already been deployed in the security scheme) and was standardized by NIST in May, 2005. This work concludes with an efficient hardware implementation of the CMAC standard

    Near-Optimal Microprocessor and Accelerators Codesign with Latency and Throughput Constraints

    No full text
    A systematic methodology for near-optimal software/hardware codesign mapping onto an FPGA platform with microprocessor and HW accelerators is proposed. The mapping steps deal with the inter-organization, the foreground memory management, and the datapath mapping. A step is described by parameters and equations combined in a scalable template. Mapping decisions are propagated as design constraints to prune suboptimal options in next steps. Several performance-area Pareto points are produced by instantiating the parameters. To evaluate our methodology we map a real-time bio-imaging application and loop-dominated benchmarks

    Authentication with RIPEMD-160 and other alternatives: A Hardware Design Perspective

    Get PDF
    Taking into consideration the rapid evolution of communication standards that include message authentication and integrity verification, it is realized that constructions like MAC and HMAC, are widely used in the most popular cryptographic schemes since provision of a way to check the integrity of information transmitted over or stored in an unreliable medium is a prime necessity in the world of open computing and communications. MACs are used so as to protect both a message's integrity as well as its authenticity, by allowing verifiers (who also possess the secret key) to detect any changes to the message content. In every modern cryptographic scheme that is used to secure a crucial application that calls for security, a keyed-hash message authentication code, or HMAC, is incorporated. Beyond HMAC, a block cipher algorithm is also incorporated (i.e like AES), thus resulting to the whole security scheme. The proposed hardware design invokes a number of optimizing techniques like pipeline, evaluation-based partial unrolling, certain algorithmic transformations in space and time and computational re-ordering, leading to a highthroughput and low-power design for the whole HMAC construction. Finally, a new algorithm, CMAC, for producing message authenticating codes (MACs) which was recently proposed by NIST, is also described. The proposed security scheme incorporates a FIPS approved and a secure block cipher algorithm (that might have already been deployed in the security scheme) and was standardized by NIST in May, 2005. This work concludes with an efficient hardware implementation of the CMAC standard

    A scalable and near-optimal representation of access schemes for memory management

    No full text
    Memory management searches for the resources required to store the concurrently alive elements. The solution quality is affected by the representation of the element accesses: a sub-optimal representation leads to overestimation and a non-scalable representation increases the exploration time. We propose a methodology to near-optimal and scalable represent regular and irregular accesses. The representation consists of a set of pattern entries to compactly describe the behavior of the memory accesses and of pattern operations to consistently combine the pattern entries. The result is a final sequence of pattern entries which represents the global access scheme without unnecessary overestimation

    Ultra low energy domain specific instruction-set processor for on-line surveillance

    No full text
    \u3cp\u3eMany signal processing applications demand for highly energy efficient flexible implementations. In this paper, we propose a novel Domain Specific Instruction-set Processor (DSIP) architecture template which is tuned to deploy in the targeted domain of on-line surveillance. The architectur e, when implemented using a 40-nm CMOS standard cell library, executes a representative test vehicle with an energy efficiency of near ly 900 MOPS/mW including instruction and data memor ies. This is about 20 times higher than a state-of-the-ar t low power DSP architecture and less than a factor 2 below a heavily optimized ASIC realization for the same application benchmark.\u3c/p\u3
    corecore