2,067 research outputs found
Randomized cache placement for eliminating conflicts
Applications with regular patterns of memory access can experience high levels of cache conflict misses. In shared-memory multiprocessors conflict misses can be increased significantly by the data transpositions required for parallelization. Techniques such as blocking which are introduced within a single thread to improve locality, can result in yet more conflict misses. The tension between minimizing cache conflicts and the other transformations needed for efficient parallelization leads to complex optimization problems for parallelizing compilers. This paper shows how the introduction of a pseudorandom element into the cache index function can effectively eliminate repetitive conflict misses and produce a cache where miss ratio depends solely on working set behavior. We examine the impact of pseudorandom cache indexing on processor cycle times and present practical solutions to some of the major implementation issues for this type of cache. Our conclusions are supported by simulations of a superscalar out-of-order processor executing the SPEC95 benchmarks, as well as from cache simulations of individual loop kernels to illustrate specific effects. We present measurements of instructions committed per cycle (IPC) when comparing the performance of different cache architectures on whole-program benchmarks such as the SPEC95 suite.Peer ReviewedPostprint (published version
Image Display and Manipulation System (IDAMS) program documentation, Appendixes A-D
The IDAMS Processor is a package of task routines and support software that performs convolution filtering, image expansion, fast Fourier transformation, and other operations on a digital image tape. A unique task control card for that program, together with any necessary parameter cards, selects each processing technique to be applied to the input image. A variable number of tasks can be selected for execution by including the proper task and parameter cards in the input deck. An executive maintains control of the run; it initiates execution of each task in turn and handles any necessary error processing
A FAST IMPLEMENTATION FOR CORRECTING ERRORS IN HIGH THROUGHPUT SEQUENCING DATA
ABSTRACT
The impact of the next generation DNA sequencing technologies (NGS) produced a revolution in biological research. New computational tools are needed to deal with the huge amounts of data they output. Significantly shorter length of the reads and higher per-base error rate compared with Sanger technology make things more difficult and still critical problems, such as genome assembly, are not satisfactorily solved. Significant efforts have been spent recently on software programs aimed at increasing the quality of the NGS data by correcting errors. The most accurate program to date is HiTEC and our contribution is providing a completely new implementation, HiTEC2. The new program is many times faster and uses much less space, while correcting more errors in the same number of iterations. We have eliminated the need of the suffix array data structure and the need of installing complicating statistical libraries as well, thus making HiTEC2 not only more efficient but also friendlier
Improving the Performance and Energy Efficiency of GPGPU Computing through Adaptive Cache and Memory Management Techniques
Department of Computer Science and EngineeringAs the performance and energy efficiency requirement of GPGPUs have risen, memory management techniques of GPGPUs have improved to meet the requirements by employing hardware caches and utilizing heterogeneous memory. These techniques can improve GPGPUs by providing lower latency and higher bandwidth of the memory. However, these methods do not always guarantee improved performance and energy efficiency due to the small cache size and heterogeneity of the memory nodes. While prior works have proposed various techniques to address this issue, relatively little work has been done to investigate holistic support for memory management techniques.
In this dissertation, we analyze performance pathologies and propose various techniques to improve memory management techniques. First, we investigate the effectiveness of advanced cache indexing (ACI) for high-performance and energy-efficient GPGPU computing. Specifically, we discuss the designs of various static and adaptive cache indexing schemes and present implementation for GPGPUs. We then quantify and analyze the effectiveness of the ACI schemes based on a cycle-accurate GPGPU simulator. Our quantitative evaluation shows that ACI schemes achieve significant performance and energy-efficiency gains over baseline conventional indexing scheme. We also analyze the performance sensitivity of ACI to key architectural parameters (i.e., capacity, associativity, and ICN bandwidth) and the cache indexing latency. We also demonstrate that ACI continues to achieve high performance in various settings.
Second, we propose IACM, integrated adaptive cache management for high-performance and energy-efficient GPGPU computing. Based on the performance pathology analysis of GPGPUs, we integrate state-of-the-art adaptive cache management techniques (i.e., cache indexing, bypassing, and warp limiting) in a unified architectural framework to eliminate performance pathologies. Our quantitative evaluation demonstrates that IACM significantly improves the performance and energy efficiency of various GPGPU workloads over the baseline architecture (i.e., 98.1% and 61.9% on average, respectively) and achieves considerably higher performance than the state-of-the-art technique (i.e., 361.4% at maximum and 7.7% on average). Furthermore, IACM delivers significant performance and energy efficiency gains over the baseline GPGPU architecture even when enhanced with advanced architectural technologies (e.g., higher capacity, associativity).
Third, we propose bandwidth- and latency-aware page placement (BLPP) for GPGPUs with heterogeneous memory. BLPP analyzes the characteristics of a application and determines the optimal page allocation ratio between the GPU and CPU memory. Based on the optimal page allocation ratio, BLPP dynamically allocate pages across the heterogeneous memory nodes. Our experimental results show that BLPP considerably outperforms the baseline and state-of-the-art technique (i.e., 13.4% and 16.7%) and performs similar to the static-best version (i.e., 1.2% difference), which requires extensive offline profiling.clos
Design and Performance Analysis of Hardware Realization of 3GPP Physical Layer for 5G Cell Search
5G Cell Search (CS) is the first step for user equipment (UE) to initiate the
communication with the 5G node B (gNB) every time it is powered ON. In cellular
networks, CS is accomplished via synchronization signals (SS) broadcasted by
gNB. 5G 3rd generation partnership project (3GPP) specifications offer a
detailed discussion on the SS generation at gNB but a limited understanding of
their blind search, and detection is available. Unlike 4G, 5G SS may not be
transmitted at the center of carrier frequency and their frequency location is
unknown to UE. In this work, we demonstrate the 5G CS by designing 3GPP
compatible hardware realization of the physical layer (PHY) of the gNB
transmitter and UE receiver. The proposed SS detection explores a novel
down-sampling approach resulting in a significant reduction in complexity and
latency. Via detailed performance analysis, we analyze the functional
correctness, computational complexity, and latency of the proposed approach for
different word lengths, signal-to-noise ratio (SNR), and down-sampling factors.
We demonstrate the complete CS functionality on GNU Radio-based RFNoC framework
and USRP-FPGA platform. The 3GPP compatibility and demonstration on hardware
strengthen the commercial significance of the proposed work
Design and Testing of High Speed Multipliers by using Reversible Liner Feedback Shift Register
In recent designs of IC’s (Integrated Circuits) BIST (Built-In Self-Test) is becoming vital for memory where memory is essential part of SoC (System on Chip). BIST design technique allows circuit for self testing. A technique may provide the short test-time as compared to test which applied externally and it allows a use of the low cost test instruments throughout the all production stages. Because of LFSRs randomness properties, it requires less hardware overhead. In particular dissertation, optimization and structure design of BIST design is based on the Reversible LFSRs, which are described. As well Reversible LFSR and Proposed LT LFSR are used to design and test Architecture of different Multipliers such as Array Multipliers and Booth Multiplier
A high-speed integrated circuit with applications to RSA Cryptography
Merged with duplicate record 10026.1/833 on 01.02.2017 by CS (TIS)The rapid growth in the use of computers and networks in government, commercial and
private communications systems has led to an increasing need for these systems to be
secure against unauthorised access and eavesdropping. To this end, modern computer
security systems employ public-key ciphers, of which probably the most well known is the
RSA ciphersystem, to provide both secrecy and authentication facilities.
The basic RSA cryptographic operation is a modular exponentiation where the modulus
and exponent are integers typically greater than 500 bits long. Therefore, to obtain reasonable
encryption rates using the RSA cipher requires that it be implemented in hardware.
This thesis presents the design of a high-performance VLSI device, called the WHiSpER
chip, that can perform the modular exponentiations required by the RSA cryptosystem
for moduli and exponents up to 506 bits long. The design has an expected throughput
in excess of 64kbit/s making it attractive for use both as a general RSA processor within
the security function provider of a security system, and for direct use on moderate-speed
public communication networks such as ISDN.
The thesis investigates the low-level techniques used for implementing high-speed arithmetic
hardware in general, and reviews the methods used by designers of existing modular
multiplication/exponentiation circuits with respect to circuit speed and efficiency.
A new modular multiplication algorithm, MMDDAMMM, based on Montgomery arithmetic,
together with an efficient multiplier architecture, are proposed that remove the
speed bottleneck of previous designs.
Finally, the implementation of the new algorithm and architecture within the WHiSpER
chip is detailed, along with a discussion of the application of the chip to ciphering and key
generation
- …