333 research outputs found

    ANALYZING GENERAL-PURPOSE COMPUTING PERFORMANCE ON GPU

    Get PDF
    ABSTRACT Analyzing General-Purpose Computing Performance on GPU Graphic Processing Unit (GPU) has become one of the most important components in modern computer systems. GPUs have evolved from a single -purpose graphic rendering hardware to a powerful processor that is capable of handling many different kinds of computing tasks. However, GPUs don’t perform well on every application, and it takes a lot of design effort to get good performance on a GPU. This thesis aims to investigate the relative performance of a GPU vs. CPU. Design effort is held minimum for both CPU implementations and GPU implementations. Matrix multiplication, Advance Encryption Standard (AES) and 32-bit Cyclic Redundancy Check (CRC32) are implemented on both a CPU and GPU. Input data size is varied to test the performance of the CPU and the GPU. The GPU generally has better performance than the CPU for matrix multiplication and AES because of the applications\u27 good instruction and data parallelism. CRC has very poor parallelism, so the CPU performs better. For very small data inputs, the CPU generally outperformed the GPU because of GPU memory transfer overhead

    Within-Die Delay Variation Measurement And Analysis For Emerging Technologies Using An Embedded Test Structure

    Get PDF
    Both random and systematic within-die process variations (PV) are growing more severe with shrinking geometries and increasing die size. Escalation in the variations in delay and power with reductions in feature size places higher demands on the accuracy of variation models. Their availability can be used to improve yield, and the corresponding profitability and product quality of the fabricated integrated circuits (ICs). Sources of within-die variations include optical source limitations, and layout-based systematic effects (pitch, line-width variability, and microscopic etch loading). Unfortunately, accurate models of within-die PVs are becoming more difficult to derive because of their increasingly sensitivity to design-context. Embedded test structures (ETS) continue to play an important role in the development of models of PVs and as a mechanism to improve correlations between hardware and models. Variations in path delays are increasing with scaling, and are increasingly affected by neighborhood\u27 interactions. In order to fully characterize within-die variations, delays must be measured in the context of actual core-logic macros. Doing so requires the use of an embedded test structure, as opposed to traditional scribe line test structures such as ring oscillators (RO). Accurate measurements of within-die variations can be used, e.g., to better tune models to actual hardware (model-to-hardware correlations). In this research project, I propose an embedded test structure called REBEL (Regional dELay BEhavior) that is designed to measure path delays in a minimally invasive fashion; and its architecture measures the path delays more accurately. Design for manufacture-ability (DFM) analysis is done on the on 90 nm ASIC chips and 28nm Zynq 7000 series FPGA boards. I present ASIC results on within-die path delay variations in a floating-point unit (FPU) fabricated in IBM\u27s 90 nm technology, with 5 pipeline stages, used as a test vehicle in chip experiments carried out at nine different temperature/voltage (TV) corners. Also experimental data has been analyzed for path delay variations in short vs long paths. FPGA results on within-die variation and die-to-die variations on Advanced Encryption System (AES) using single pipelined stage are also presented. Other analysis that have been performed on the calibrated path delays are Flip Flop propagation delays for both rising and falling edge (tpHL and tpLH), uncertainty analysis, path distribution analysis, short versus long path variations and mid-length path within-die variation. I also analyze the impact on delay when the chips are subjected to industrial-level temperature and voltage variations. From the experimental results, it has been established that the proposed REBEL provides capabilities similar to an off-chip logic analyzer, i.e., it is able to capture the temporal behavior of the signal over time, including any static and dynamic hazards that may occur on the tested path. The ASIC results further show that path delays are correlated to the launch-capture (LC) interval used to time them. Therefore, calibration as proposed in this work must be carried out in order to obtain an accurate analysis of within-die variations. Results on ASIC chips show that short paths can vary up to 35% on average, while long paths vary up to 20% at nominal temperature and voltage. A similar trend occurs for within-die variations of mid-length paths where magnitudes reduced to 20% and 5%, respectively. The magnitude of delay variations in both these analyses increase as temperature and voltage are changed to increase performance. The high level of within-die delay variations are undesirable from a design perspective, but they represent a rich source of entropy for applications that make use of \u27secrets\u27 such as authentication, hardware metering and encryption. Physical unclonable functions (PUFs) are a class of primitives that leverage within-die-variations as a means of generating random bit strings for these types of applications, including hardware security and trust. Zynq FPGAs Die-to-Die and within-die variation study shows that on average there is 5% of within-Die variation and the range of die-to-Die variation can go upto 3ns. The die-to-Die variations can be explored in much further detail to study the variations spatial dependance. Additionally, I also carried out research in the area data mining to cater for big data by focusing the work on decision tree classification (DTC) to speed-up the classification step in hardware implementation. For this purpose, I devised a pipelined architecture for the implementation of axis parallel binary decision tree classification for meeting up with the requirements of execution time and minimal resource usage in terms of area. The motivation for this work is that analyzing larger data-sets have created abundant opportunities for algorithmic and architectural developments, and data-mining innovations, thus creating a great demand for faster execution of these algorithms, leading towards improving execution time and resource utilization. Decision trees (DT) have since been implemented in software programs. Though, the software implementation of DTC is highly accurate, the execution times and the resource utilization still require improvement to meet the computational demands in the ever growing industry. On the other hand, hardware implementation of DT has not been thoroughly investigated or reported in detail. Therefore, I propose a hardware acceleration of pipelined architecture that incorporates the parallel approach in acquiring the data by having parallel engines working on different partitions of data independently. Also, each engine is processing the data in a pipelined fashion to utilize the resources more efficiently and reduce the time for processing all the data records/tuples. Experimental results show that our proposed hardware acceleration of classification algorithms has increased throughput, by reducing the number of clock cycles required to process the data and generate the results, and it requires minimal resources hence it is area efficient. This architecture also enables algorithms to scale with increasingly large and complex data sets. We developed the DTC algorithm in detail and explored techniques for adapting it to a hardware implementation successfully. This system is 3.5 times faster than the existing hardware implementation of classification.\u2

    HARDWARE ATTACK DETECTION AND PREVENTION FOR CHIP SECURITY

    Get PDF
    Hardware security is a serious emerging concern in chip designs and applications. Due to the globalization of the semiconductor design and fabrication process, integrated circuits (ICs, a.k.a. chips) are becoming increasingly vulnerable to passive and active hardware attacks. Passive attacks on chips result in secret information leaking while active attacks cause IC malfunction and catastrophic system failures. This thesis focuses on detection and prevention methods against active attacks, in particular, hardware Trojan (HT). Existing HT detection methods have limited capability to detect small-scale HTs and are further challenged by the increased process variation. We propose to use differential Cascade Voltage Switch Logic (DCVSL) method to detect small HTs and achieve a success rate of 66% to 98%. This work also presents different fault tolerant methods to handle the active attacks on symmetric-key cipher SIMON, which is a recent lightweight cipher. Simulation results show that our Even Parity Code SIMON consumes less area and power than double modular redundancy SIMON and Reversed-SIMON, but yields a higher fault -detection-failure rate as the number of concurrent faults increases. In addition, the emerging technology, memristor, is explored to protect SIMON from passive attacks. Simulation results indicate that the memristor-based SIMON has a unique power characteristic that adds new challenges on secrete key extraction

    Optimizing the MapReduce Framework on Intel Xeon Phi Coprocessor

    Full text link
    With the ease-of-programming, flexibility and yet efficiency, MapReduce has become one of the most popular frameworks for building big-data applications. MapReduce was originally designed for distributed-computing, and has been extended to various architectures, e,g, multi-core CPUs, GPUs and FPGAs. In this work, we focus on optimizing the MapReduce framework on Xeon Phi, which is the latest product released by Intel based on the Many Integrated Core Architecture. To the best of our knowledge, this is the first work to optimize the MapReduce framework on the Xeon Phi. In our work, we utilize advanced features of the Xeon Phi to achieve high performance. In order to take advantage of the SIMD vector processing units, we propose a vectorization friendly technique for the map phase to assist the auto-vectorization as well as develop SIMD hash computation algorithms. Furthermore, we utilize MIMD hyper-threading to pipeline the map and reduce to improve the resource utilization. We also eliminate multiple local arrays but use low cost atomic operations on the global array for some applications, which can improve the thread scalability and data locality due to the coherent L2 caches. Finally, for a given application, our framework can either automatically detect suitable techniques to apply or provide guideline for users at compilation time. We conduct comprehensive experiments to benchmark the Xeon Phi and compare our optimized MapReduce framework with a state-of-the-art multi-core based MapReduce framework (Phoenix++). By evaluating six real-world applications, the experimental results show that our optimized framework is 1.2X to 38X faster than Phoenix++ for various applications on the Xeon Phi

    Radiation Hardened by Design Methodologies for Soft-Error Mitigated Digital Architectures

    Get PDF
    abstract: Digital architectures for data encryption, processing, clock synthesis, data transfer, etc. are susceptible to radiation induced soft errors due to charge collection in complementary metal oxide semiconductor (CMOS) integrated circuits (ICs). Radiation hardening by design (RHBD) techniques such as double modular redundancy (DMR) and triple modular redundancy (TMR) are used for error detection and correction respectively in such architectures. Multiple node charge collection (MNCC) causes domain crossing errors (DCE) which can render the redundancy ineffectual. This dissertation describes techniques to ensure DCE mitigation with statistical confidence for various designs. Both sequential and combinatorial logic are separated using these custom and computer aided design (CAD) methodologies. Radiation vulnerability and design overhead are studied on VLSI sub-systems including an advanced encryption standard (AES) which is DCE mitigated using module level coarse separation on a 90-nm process with 99.999% DCE mitigation. A radiation hardened microprocessor (HERMES2) is implemented in both 90-nm and 55-nm technologies with an interleaved separation methodology with 99.99% DCE mitigation while achieving 4.9% increased cell density, 28.5 % reduced routing and 5.6% reduced power dissipation over the module fences implementation. A DMR register-file (RF) is implemented in 55 nm process and used in the HERMES2 microprocessor. The RF array custom design and the decoders APR designed are explored with a focus on design cycle time. Quality of results (QOR) is studied from power, performance, area and reliability (PPAR) perspective to ascertain the improvement over other design techniques. A radiation hardened all-digital multiplying pulsed digital delay line (DDL) is designed for double data rate (DDR2/3) applications for data eye centering during high speed off-chip data transfer. The effect of noise, radiation particle strikes and statistical variation on the designed DDL are studied in detail. The design achieves the best in class 22.4 ps peak-to-peak jitter, 100-850 MHz range at 14 pJ/cycle energy consumption. Vulnerability of the non-hardened design is characterized and portions of the redundant DDL are separated in custom and auto-place and route (APR). Thus, a range of designs for mission critical applications are implemented using methodologies proposed in this work and their potential PPAR benefits explored in detail.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    Design and Characterization of a Secure Automatic Dependent Surveillance-Broadcast Prototype

    Get PDF
    During sensitive military operations, such as air operations in theater, the broadcasting of ADS-B messages poses a serious security concern, but the small message payload of ADS-B transmissions renders encryption standards such as AES unsuitable. Format-preserving encryption (FPE), a technique already in use on small message sizes such as credit card numbers, provides a suitable implementation of strong symmetric key encryption for the military context. This research proposes and details a Secure ADS-B (SADS-B) system which utilizes a bump-in-the-wire approach for encryption and decryption of ADS-B messages using FPE, going on to then characterize the prototype\u27s performance using the following metrics: detection and error rates, operational latency, and utilization of resources on the underlying field programmable gate array (FPGA) hardware. Findings include sub-millisecond operational latency, full data rate throughput capability for ADS-B, and approximately 50% utilization of the available FPGA resources. Overall, findings suggest that a layered security system such as SADS-B is suitable for adding confidentiality to ADS-B
    corecore