1,221 research outputs found

    Hardware/software approaches for reducing the process variation impact on instruction fetches

    Get PDF
    Cataloged from PDF version of article.As technology moves towards finer process geometries, it is becoming extremely difficult to control critical physical parameters such as channel length, gate oxide thickness, and dopant ion concentration. Variations in these parameters lead to dramatic variations in access latencies in Static Random Access Memory (SRAM) devices. This means that different lines of the same cache may have different access latencies. A simple solution to this problem is to adopt the worst-case latency paradigm. While this egalitarian cache management is simple, it may introduce significant performance overhead during instruction fetches when both address translation (instruction Translation Lookaside Buffer (TLB) access) and instruction cache access take place, making this solution infeasible for future high-performance processors. In this study, we first propose some hardware and software enhancements and then, based on those, investigate several techniques to mitigate the effect of process variation on the instruction fetch pipeline stage in modern processors. For address translation, we study an approach that performs the virtual-to-physical page translation once, then stores it in a special register, reusing it as long as the execution remains on the same instruction page. To handle varying access latencies across different instruction cache lines, we annotate the cache access latency of instructions within themselves to give the circuitry a hint about how long to wait for the next instruction to become available

    Exploiting Fine-Grain Concurrency Analytical Insights in Superscalar Processor Design

    Get PDF
    This dissertation develops analytical models to provide insight into various design issues associated with superscalar-type processors, i.e., the processors capable of executing multiple instructions per cycle. A survey of the existing machines and literature has been completed with a proposed classification of various approaches for exploiting fine-grain concurrency. Optimization of a single pipeline is discussed based on an analytical model. The model-predicted performance curves are found to be in close proximity to published results using simulation techniques. A model is also developed for comparing different branch strategies for single-pipeline processors in terms of their effectiveness in reducing branch delay. The additional instruction fetch traffic generated by certain branch strategies is also studied and is shown to be a useful criterion for choosing between equally well performing strategies. Next, processors with multiple pipelines are modelled to study the tradeoffs associated with deeper pipelines versus multiple pipelines. The model developed can reveal the cause of performance bottleneck: insufficient resources to exploit discovered parallelism, insufficient instruction stream parallelism, or insufficient scope of concurrency detection. The cost associated with speculative (i.e., beyond basic block) execution is examined via probability distributions that characterize the inherent parallelism in the instruction stream. The throughput prediction of the analytic model is shown, using a variety of benchmarks, to be close to the measured static throughput of the compiler output, under resource and scope constraints. Further experiments provide misprediction delay estimates for these benchmarks under scope constraints, assuming beyond-basic-block, out-of-order execution and run-time scheduling. These results were derived using traces generated by the Multiflow TRACE SCHEDULING™(*) compacting C and FORTRAN 77 compilers. A simplified extension to the model to include multiprocessors is also proposed. The extended model is used to analyze combined systems, such as superpipelined multiprocessors and superscalar multiprocessors, both with shared memory. It is shown that the number of pipelines (or processors) at which the maximum throughput is obtained is increasingly sensitive to the ratio of memory access time to network access delay, as memory access time increases. Further, as a function of inter-iteration dependency distance, optimum throughput is shown to vary nonlinearly, whereas the corresponding Optimum number of processors varies linearly. The predictions from the analytical model agree with published results based on simulations. (*)TRACE SCHEDULING is a trademark of Multiflow Computer, Inc

    A comprehensive approach in performance evaluation for modernreal-time operating systems

    Get PDF
    In real-time computing the accurate characterization of the performance and determinism that a particular real-time operating system/hardware combination can provide for real-time applications is essential. This issue is not properly addressed by existing performance metrics mainly due to the lack of completeness and generalization. In this paper we present a set of comprehensive, easy-to-implement and useful metrics covering three basic real-time operating system features: response to external events, intertask synchronization and resource sharing, and intertask data transferring. The evaluation of real-time operating systems using a set of fine-grained metrics is fundamental to guarantee that we can reach the required determinism in real-world applications.Publicad

    FPGA-based acceleration of the RMAP short read mapping tool

    Get PDF
    Bioinformatics is a quickly emerging field. Next generation sequencing technologies are producing data up to several gigabytes per day, making bioinformatics applications increasingly computationally intensive. In order to achieve greater speeds for processing this data, various techniques have been developed. These techniques involve parallelizing algorithms and/or spreading data across many computing nodes composed of devices such as Microprocessors, Graphics Processing Units (GPUs), and Field Programmable Gate Arrays (FPGAs). In this thesis, an FPGA is used to accelerate a bioinformatics application called RMAP, which is used for Short-Read Mapping. The most computationally intensive function in RMAP, the read mapping function, is implemented on the FPGA\u27s reconfigurable hardware fabric. This is a first step in a larger effort to develop a more optimal hardware/software co-design for RMAP. The Convey HC-1 Hybrid Computing System was used as the platform for development. The short-read mapping functionality of RMAP was implemented on one of the four Xilinx Virtex 5 FPGAs available in the HC-1 system. The RMAP 2.0 software was rewritten to separate the read mapping function to facilitate its porting over to hardware. The implemented design was evaluated by varying input parameters such as genome size and number of reads. In addition, the hardware design was analyzed to find potential bottlenecks. The implementation results showed a speedup of ~5x using datasets with varying number of reads and a fixed reference genome, and ~2x using datasets with varying genome size and a fixed number of reads, for the hardware-implemented short-read mapping function of RMAP

    Issues in the design and implementation of a real-time garbage collection architecture

    Get PDF
    This dissertation proposes a new garbage-collected memory module architecture for hard real-time systems. The memory module is designed for compatibility with standard workstation architectures, and cooperates with standard cache consistency protocols. Processes read and write garbage- collected memory in the same manner as standard memory, with identical performance under most conditions. Occasional contention between user processes and the garbage collector results in delays to the user process of at most six memory cycles. Thus the proposed architecture guarantees real-time performance at fine granularity. This dissertation investigates the viability of the proposed architecture in two senses. First, it demonstrates that a fundamental component of the architecture, the object space manager, can be produced at a reasonable cost. Second, this dissertation reports the results of experiments that measure the performance of the proposed architecture under real workloads. Results of these experiments show that the architecture currently performs more slowly than traditional schemes; but this appears to be correctable by employing a more efficient function call mechanism that caches heap- allocated activation frames. Finally, this dissertation reports on some simple extensions to the C++ programming language to support slice objects. Slice objects, which are supported by the garbage collection architecture, are useful for implementing fragmentable arrays, i.e., arrays in which subarrays may be retained while unused elements become garbage and are collected. Experimental evidence demonstrates that slice objects can be used to implement strings more efficiently than at least some popular class libraries

    Mixing multi-core CPUs and GPUs for scientific simulation software

    Get PDF
    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

    SEPARATING INSTRUCTION FETCHES FROM MEMORY ACCESSES : ILAR (INSTRUCTION LINE ASSOCIATIVE REGISTERS)

    Get PDF
    Due to the growing mismatch between processor performance and memory latency, many dynamic mechanisms which are “invisible” to the user have been proposed: for example, trace caches and automatic pre-fetch units. However, these dynamic mechanisms have become inadequate due to implicit memory accesses that have become so expensive. On the other hand, compiler-visible mechanisms like SWAR (SIMD Within A Register) and LARs (Line Associative Registers) are potentially more effective at improving data access performance. This thesis investigates applying the same ideas to improve instruction access. ILAR (Instruction LARs) store instructions in wide registers. Instruction blocks are explicitly loaded into ILAR, using block compression to enhance memory bandwidth. The control flow of the program then refers to instructions directly by their position within an ILAR, rather than by lengthy memory addresses. Because instructions are accessed directly from within registers, there is no implicit instruction fetch from memory. This thesis proposes an instruction set architecture for ILAR, investigates a mechanism to load ILAR using the best available block compression algorithm and also develop hardware descriptions for both ILAR and a conventional memory cache model so that performance comparisons could be made on the instruction fetch stage
    corecore