3,531 research outputs found

    From FPGA to ASIC: A RISC-V processor experience

    Get PDF
    This work document a correct design flow using these tools in the Lagarto RISC- V Processor and the RTL design considerations that must be taken into account, to move from a design for FPGA to design for ASIC

    Exploring performance and power properties of modern multicore chips via simple machine models

    Full text link
    Modern multicore chips show complex behavior with respect to performance and power. Starting with the Intel Sandy Bridge processor, it has become possible to directly measure the power dissipation of a CPU chip and correlate this data with the performance properties of the running code. Going beyond a simple bottleneck analysis, we employ the recently published Execution-Cache-Memory (ECM) model to describe the single- and multi-core performance of streaming kernels. The model refines the well-known roofline model, since it can predict the scaling and the saturation behavior of bandwidth-limited loop kernels on a multicore chip. The saturation point is especially relevant for considerations of energy consumption. From power dissipation measurements of benchmark programs with vastly different requirements to the hardware, we derive a simple, phenomenological power model for the Sandy Bridge processor. Together with the ECM model, we are able to explain many peculiarities in the performance and power behavior of multicore processors, and derive guidelines for energy-efficient execution of parallel programs. Finally, we show that the ECM and power models can be successfully used to describe the scaling and power behavior of a lattice-Boltzmann flow solver code.Comment: 23 pages, 10 figures. Typos corrected, DOI adde

    Optimization of Discrete-parameter Multiprocessor Systems using a Novel Ergodic Interpolation Technique

    Full text link
    Modern multi-core systems have a large number of design parameters, most of which are discrete-valued, and this number is likely to keep increasing as chip complexity rises. Further, the accurate evaluation of a potential design choice is computationally expensive because it requires detailed cycle-accurate system simulation. If the discrete parameter space can be embedded into a larger continuous parameter space, then continuous space techniques can, in principle, be applied to the system optimization problem. Such continuous space techniques often scale well with the number of parameters. We propose a novel technique for embedding the discrete parameter space into an extended continuous space so that continuous space techniques can be applied to the embedded problem using cycle accurate simulation for evaluating the objective function. This embedding is implemented using simulation-based ergodic interpolation, which, unlike spatial interpolation, produces the interpolated value within a single simulation run irrespective of the number of parameters. We have implemented this interpolation scheme in a cycle-based system simulator. In a characterization study, we observe that the interpolated performance curves are continuous, piece-wise smooth, and have low statistical error. We use the ergodic interpolation-based approach to solve a large multi-core design optimization problem with 31 design parameters. Our results indicate that continuous space optimization using ergodic interpolation-based embedding can be a viable approach for large multi-core design optimization problems.Comment: A short version of this paper will be published in the proceedings of IEEE MASCOTS 2015 conferenc

    Coarse-grained reconfigurable array architectures

    Get PDF
    Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code

    Instruction prefetching techniques for ultra low-power multicore architectures

    Get PDF
    As the gap between processor and memory speeds increases, memory latencies have become a critical bottleneck for computing performance. To reduce this bottleneck, designers have been working on techniques to hide these latencies. On the other hand, design of embedded processors typically targets low cost and low power consumption. Therefore, techniques which can satisfy these constraints are more desirable for embedded domains. While out-of-order execution, aggressive speculation, and complex branch prediction algorithms can help hide the memory access latency in high-performance systems, yet they can cost a heavy power budget and are not suitable for embedded systems. Prefetching is another popular method for hiding the memory access latency, and has been studied very well for high-performance processors. Similarly, for embedded processors with strict power requirements, the application of complex prefetching techniques is greatly limited, and therefore, a low power/energy solution is mostly desired in this context. In this work, we focus on instruction prefetching for ultra-low power processing architectures and aim to reduce energy overhead of this operation by proposing a combination of simple, low-cost, and energy efficient prefetching techniques. We study a wide range of applications from cryptography to computer vision and show that our proposed mechanisms can effectively improve the hit-rate of almost all of them to above 95%, achieving an average performance improvement of more than 2X. Plus, by synthesizing our designs using the state-of-the-art technologies we show that the prefetchers increase system’s power consumption less than 15% and total silicon area by less than 1%. Altogether, a total energy reduction of 1.9X is achieved, thanks to the proposed schemes, enabling a significantly higher battery life

    On Designing Multicore-aware Simulators for Biological Systems

    Full text link
    The stochastic simulation of biological systems is an increasingly popular technique in bioinformatics. It often is an enlightening technique, which may however result in being computational expensive. We discuss the main opportunities to speed it up on multi-core platforms, which pose new challenges for parallelisation techniques. These opportunities are developed in two general families of solutions involving both the single simulation and a bulk of independent simulations (either replicas of derived from parameter sweep). Proposed solutions are tested on the parallelisation of the CWC simulator (Calculus of Wrapped Compartments) that is carried out according to proposed solutions by way of the FastFlow programming framework making possible fast development and efficient execution on multi-cores.Comment: 19 pages + cover pag

    Garbage collection auto-tuning for Java MapReduce on Multi-Cores

    Get PDF
    MapReduce has been widely accepted as a simple programming pattern that can form the basis for efficient, large-scale, distributed data processing. The success of the MapReduce pattern has led to a variety of implementations for different computational scenarios. In this paper we present MRJ, a MapReduce Java framework for multi-core architectures. We evaluate its scalability on a four-core, hyperthreaded Intel Core i7 processor, using a set of standard MapReduce benchmarks. We investigate the significant impact that Java runtime garbage collection has on the performance and scalability of MRJ. We propose the use of memory management auto-tuning techniques based on machine learning. With our auto-tuning approach, we are able to achieve MRJ performance within 10% of optimal on 75% of our benchmark tests

    Lock-free Concurrent Data Structures

    Full text link
    Concurrent data structures are the data sharing side of parallel programming. Data structures give the means to the program to store data, but also provide operations to the program to access and manipulate these data. These operations are implemented through algorithms that have to be efficient. In the sequential setting, data structures are crucially important for the performance of the respective computation. In the parallel programming setting, their importance becomes more crucial because of the increased use of data and resource sharing for utilizing parallelism. The first and main goal of this chapter is to provide a sufficient background and intuition to help the interested reader to navigate in the complex research area of lock-free data structures. The second goal is to offer the programmer familiarity to the subject that will allow her to use truly concurrent methods.Comment: To appear in "Programming Multi-core and Many-core Computing Systems", eds. S. Pllana and F. Xhafa, Wiley Series on Parallel and Distributed Computin
    corecore