2,067 research outputs found

    Address generator synthesis

    Get PDF

    Randomized cache placement for eliminating conflicts

    Get PDF
    Applications with regular patterns of memory access can experience high levels of cache conflict misses. In shared-memory multiprocessors conflict misses can be increased significantly by the data transpositions required for parallelization. Techniques such as blocking which are introduced within a single thread to improve locality, can result in yet more conflict misses. The tension between minimizing cache conflicts and the other transformations needed for efficient parallelization leads to complex optimization problems for parallelizing compilers. This paper shows how the introduction of a pseudorandom element into the cache index function can effectively eliminate repetitive conflict misses and produce a cache where miss ratio depends solely on working set behavior. We examine the impact of pseudorandom cache indexing on processor cycle times and present practical solutions to some of the major implementation issues for this type of cache. Our conclusions are supported by simulations of a superscalar out-of-order processor executing the SPEC95 benchmarks, as well as from cache simulations of individual loop kernels to illustrate specific effects. We present measurements of instructions committed per cycle (IPC) when comparing the performance of different cache architectures on whole-program benchmarks such as the SPEC95 suite.Peer ReviewedPostprint (published version

    Image Display and Manipulation System (IDAMS) program documentation, Appendixes A-D

    Get PDF
    The IDAMS Processor is a package of task routines and support software that performs convolution filtering, image expansion, fast Fourier transformation, and other operations on a digital image tape. A unique task control card for that program, together with any necessary parameter cards, selects each processing technique to be applied to the input image. A variable number of tasks can be selected for execution by including the proper task and parameter cards in the input deck. An executive maintains control of the run; it initiates execution of each task in turn and handles any necessary error processing

    A FAST IMPLEMENTATION FOR CORRECTING ERRORS IN HIGH THROUGHPUT SEQUENCING DATA

    Get PDF
    ABSTRACT The impact of the next generation DNA sequencing technologies (NGS) produced a revolu­tion in biological research. New computational tools are needed to deal with the huge amounts of data they output. Significantly shorter length of the reads and higher per-base error rate compared with Sanger technology make things more difficult and still critical problems, such as genome assembly, are not satisfactorily solved. Significant efforts have been spent recently on software programs aimed at increasing the quality of the NGS data by correcting errors. The most accurate program to date is HiTEC and our contribution is providing a completely new implementation, HiTEC2. The new program is many times faster and uses much less space, while correcting more errors in the same number of iterations. We have eliminated the need of the suffix array data structure and the need of installing complicating statistical libraries as well, thus making HiTEC2 not only more efficient but also friendlier

    Improving the Performance and Energy Efficiency of GPGPU Computing through Adaptive Cache and Memory Management Techniques

    Get PDF
    Department of Computer Science and EngineeringAs the performance and energy efficiency requirement of GPGPUs have risen, memory management techniques of GPGPUs have improved to meet the requirements by employing hardware caches and utilizing heterogeneous memory. These techniques can improve GPGPUs by providing lower latency and higher bandwidth of the memory. However, these methods do not always guarantee improved performance and energy efficiency due to the small cache size and heterogeneity of the memory nodes. While prior works have proposed various techniques to address this issue, relatively little work has been done to investigate holistic support for memory management techniques. In this dissertation, we analyze performance pathologies and propose various techniques to improve memory management techniques. First, we investigate the effectiveness of advanced cache indexing (ACI) for high-performance and energy-efficient GPGPU computing. Specifically, we discuss the designs of various static and adaptive cache indexing schemes and present implementation for GPGPUs. We then quantify and analyze the effectiveness of the ACI schemes based on a cycle-accurate GPGPU simulator. Our quantitative evaluation shows that ACI schemes achieve significant performance and energy-efficiency gains over baseline conventional indexing scheme. We also analyze the performance sensitivity of ACI to key architectural parameters (i.e., capacity, associativity, and ICN bandwidth) and the cache indexing latency. We also demonstrate that ACI continues to achieve high performance in various settings. Second, we propose IACM, integrated adaptive cache management for high-performance and energy-efficient GPGPU computing. Based on the performance pathology analysis of GPGPUs, we integrate state-of-the-art adaptive cache management techniques (i.e., cache indexing, bypassing, and warp limiting) in a unified architectural framework to eliminate performance pathologies. Our quantitative evaluation demonstrates that IACM significantly improves the performance and energy efficiency of various GPGPU workloads over the baseline architecture (i.e., 98.1% and 61.9% on average, respectively) and achieves considerably higher performance than the state-of-the-art technique (i.e., 361.4% at maximum and 7.7% on average). Furthermore, IACM delivers significant performance and energy efficiency gains over the baseline GPGPU architecture even when enhanced with advanced architectural technologies (e.g., higher capacity, associativity). Third, we propose bandwidth- and latency-aware page placement (BLPP) for GPGPUs with heterogeneous memory. BLPP analyzes the characteristics of a application and determines the optimal page allocation ratio between the GPU and CPU memory. Based on the optimal page allocation ratio, BLPP dynamically allocate pages across the heterogeneous memory nodes. Our experimental results show that BLPP considerably outperforms the baseline and state-of-the-art technique (i.e., 13.4% and 16.7%) and performs similar to the static-best version (i.e., 1.2% difference), which requires extensive offline profiling.clos

    Domain specific high performance reconfigurable architecture for a communication platform

    Get PDF

    Design and Performance Analysis of Hardware Realization of 3GPP Physical Layer for 5G Cell Search

    Full text link
    5G Cell Search (CS) is the first step for user equipment (UE) to initiate the communication with the 5G node B (gNB) every time it is powered ON. In cellular networks, CS is accomplished via synchronization signals (SS) broadcasted by gNB. 5G 3rd generation partnership project (3GPP) specifications offer a detailed discussion on the SS generation at gNB but a limited understanding of their blind search, and detection is available. Unlike 4G, 5G SS may not be transmitted at the center of carrier frequency and their frequency location is unknown to UE. In this work, we demonstrate the 5G CS by designing 3GPP compatible hardware realization of the physical layer (PHY) of the gNB transmitter and UE receiver. The proposed SS detection explores a novel down-sampling approach resulting in a significant reduction in complexity and latency. Via detailed performance analysis, we analyze the functional correctness, computational complexity, and latency of the proposed approach for different word lengths, signal-to-noise ratio (SNR), and down-sampling factors. We demonstrate the complete CS functionality on GNU Radio-based RFNoC framework and USRP-FPGA platform. The 3GPP compatibility and demonstration on hardware strengthen the commercial significance of the proposed work

    Design and Testing of High Speed Multipliers by using Reversible Liner Feedback Shift Register

    Get PDF
    In recent designs of IC’s (Integrated Circuits) BIST (Built-In Self-Test) is becoming vital for memory where memory is essential part of SoC (System on Chip). BIST design technique allows circuit for self testing. A technique may provide the short test-time as compared to test which applied externally and it allows a use of the low cost test instruments throughout the all production stages. Because of LFSRs randomness properties, it requires less hardware overhead. In particular dissertation, optimization and structure design of BIST design is based on the Reversible LFSRs, which are described. As well Reversible LFSR and Proposed LT LFSR are used to design and test Architecture of different Multipliers such as Array Multipliers and Booth Multiplier

    A high-speed integrated circuit with applications to RSA Cryptography

    Get PDF
    Merged with duplicate record 10026.1/833 on 01.02.2017 by CS (TIS)The rapid growth in the use of computers and networks in government, commercial and private communications systems has led to an increasing need for these systems to be secure against unauthorised access and eavesdropping. To this end, modern computer security systems employ public-key ciphers, of which probably the most well known is the RSA ciphersystem, to provide both secrecy and authentication facilities. The basic RSA cryptographic operation is a modular exponentiation where the modulus and exponent are integers typically greater than 500 bits long. Therefore, to obtain reasonable encryption rates using the RSA cipher requires that it be implemented in hardware. This thesis presents the design of a high-performance VLSI device, called the WHiSpER chip, that can perform the modular exponentiations required by the RSA cryptosystem for moduli and exponents up to 506 bits long. The design has an expected throughput in excess of 64kbit/s making it attractive for use both as a general RSA processor within the security function provider of a security system, and for direct use on moderate-speed public communication networks such as ISDN. The thesis investigates the low-level techniques used for implementing high-speed arithmetic hardware in general, and reviews the methods used by designers of existing modular multiplication/exponentiation circuits with respect to circuit speed and efficiency. A new modular multiplication algorithm, MMDDAMMM, based on Montgomery arithmetic, together with an efficient multiplier architecture, are proposed that remove the speed bottleneck of previous designs. Finally, the implementation of the new algorithm and architecture within the WHiSpER chip is detailed, along with a discussion of the application of the chip to ciphering and key generation
    corecore