1,435 research outputs found

    Adaptive runtime-assisted block prefetching on chip-multiprocessors

    Get PDF
    Memory stalls are a significant source of performance degradation in modern processors. Data prefetching is a widely adopted and well studied technique used to alleviate this problem. Prefetching can be performed by the hardware, or be initiated and controlled by software. Among software controlled prefetching we find a wide variety of schemes, including runtime-directed prefetching and more specifically runtime-directed block prefetching. This paper proposes a hybrid prefetching mechanism that integrates a software driven block prefetcher with existing hardware prefetching techniques. Our runtime-assisted software prefetcher brings large blocks of data on-chip with the support of a low cost hardware engine, and synergizes with existing hardware prefetchers that manage locality at a finer granularity. The runtime system that drives the prefetch engine dynamically selects which cache to prefetch to. Our evaluation on a set of scientific benchmarks obtains a maximum speed up of 32 and 10 % on average compared to a baseline with hardware prefetching only. As a result, we also achieve a reduction of up to 18 and 3 % on average in energy-to-solution.Peer ReviewedPostprint (author's final draft

    The synergy of multithreading and access/execute decoupling

    Get PDF
    This work presents and evaluates a novel processor microarchitecture which combines two paradigms: access/execute decoupling and simultaneous multithreading. We investigate how both techniques complement each other: while decoupling features an excellent memory latency hiding efficiency, multithreading supplies the in-order issue stage with enough ILP to hide the functional unit latencies. Its partitioned layout, together with its in-order issue policy makes it potentially less complex, in terms of critical path delays, than a centralized out-of-order design, to support future growths in issue-width and clock speed. The simulations show that by adding decoupling to a multithreaded architecture, its miss latency tolerance is sharply increased and in addition, it needs fewer threads to achieve maximum throughput, especially for a large miss latency. Fewer threads result in a hardware complexity reduction and lower demands on the memory system, which becomes a critical resource for large miss latencies, since bandwidth may become a bottleneck.Peer ReviewedPostprint (published version

    Performance and enhancement for HD videoconference environment

    Get PDF
    In this work proposed here is framed in the project of research V3 (Video, Videoconference, and Visualization) of the Foundation i2CAT, that has for final goal to design and development of a platform of video, videoconference and independent visualization of resolution in high and super though inside new generation IP networks. i2CAT Foundation uses free software for achieving its goals. UltraGrid for the transmission of HD video is used and SAGE is used for distributed visualization among multiple monitors. The equipment used for management (capturing, sending, visualization, etc) of the high definition stream of work environment it has to be optimized so that all the disposable resources can be used, in order to improve the quality and stability of the platform. We are speaking about the treatment of datum flows of more of 1 Gbps with raw formats, so that the optimization of the use of the disposable resources of a system is given back a need. In this project it is evaluated the requirements for the high definition streams without compressing and a study of the current platform is carried out, in order to extract the functional requirements that an optimum system has to have to work in the best conditions. From this extracted information, a series of systems tests are carried out in order to improve the performance, from level of network until level of application. Different distributions of the Linux operating system have been proved in order to evaluate their performance. These are Debian 4 and openSUSE 10.3. The creation of a system from sources of software has also been proved in order to optimize its code in the compilation. It has been carried out with the help of Linux From Scratch project. It has also been tried to use systems Real Time (RT) with the distributions used. It offers more stability in the stream frame rate. Once operating systems has been test, it has proved different compilers in order to evaluate their efficiency. The GCC and the Intel C++ Compilers have proved, this second with more satisfactory results. Finally a Live CD has been carried out in order to include all the possible improvements in a system of easy distribution

    Mechanistic modeling of architectural vulnerability factor

    Get PDF
    Reliability to soft errors is a significant design challenge in modern microprocessors owing to an exponential increase in the number of transistors on chip and the reduction in operating voltages with each process generation. Architectural Vulnerability Factor (AVF) modeling using microarchitectural simulators enables architects to make informed performance, power, and reliability tradeoffs. However, such simulators are time-consuming and do not reveal the microarchitectural mechanisms that influence AVF. In this article, we present an accurate first-order mechanistic analytical model to compute AVF, developed using the first principles of an out-of-order superscalar execution. This model provides insight into the fundamental interactions between the workload and microarchitecture that together influence AVF. We use the model to perform design space exploration, parametric sweeps, and workload characterization for AVF

    Hardware Barrier Synchronization: Static Barrier MIMD (SBM)

    Get PDF
    In this paper, we give the design, and performance analysis, of a new, highly efficient, synchronization mechanism called “Static Barrier MIMD” or “SBM.” Unlike traditional barrier synchronization, the proposed barriers are designed to facilitate the use of static (compile-time) code scheduling for eliminating some synchronizations. For this reason, our barrier hardware is more general than most hardware barrier mechanisms, allowing any subset of the processors to participate in each barrier. Since code scheduling typically operates on fine-grain parallelism, it is also vital that barriers be able to execute in a small number of clock ticks. The SBM is actually only one of two new classes of barrier machines proposed to facilitate static code scheduling; the other architecture is the “Dynamic Barrier MIMD,” or “DBM,” which is described in a companion paper1. The DBM differs from the SBM in that the DBM employs more complex hardware to make the system less dependent on the precision of the static analysis and code scheduling; for example, an SBM cannot efficiently manage simultaneous execution of independent parallel programs, whereas a DBM can

    REBOUND: An open-source multi-purpose N-body code for collisional dynamics

    Full text link
    REBOUND is a new multi-purpose N-body code which is freely available under an open-source license. It was designed for collisional dynamics such as planetary rings but can also solve the classical N-body problem. It is highly modular and can be customized easily to work on a wide variety of different problems in astrophysics and beyond. REBOUND comes with three symplectic integrators: leap-frog, the symplectic epicycle integrator (SEI) and a Wisdom-Holman mapping (WH). It supports open, periodic and shearing-sheet boundary conditions. REBOUND can use a Barnes-Hut tree to calculate both self-gravity and collisions. These modules are fully parallelized with MPI as well as OpenMP. The former makes use of a static domain decomposition and a distributed essential tree. Two new collision detection modules based on a plane-sweep algorithm are also implemented. The performance of the plane-sweep algorithm is superior to a tree code for simulations in which one dimension is much longer than the other two and in simulations which are quasi-two dimensional with less than one million particles. In this work, we discuss the different algorithms implemented in REBOUND, the philosophy behind the code's structure as well as implementation specific details of the different modules. We present results of accuracy and scaling tests which show that the code can run efficiently on both desktop machines and large computing clusters.Comment: 10 pages, 9 figures, accepted by A&A, source code available at https://github.com/hannorein/reboun

    Optimisation of multicore processor and GPU for use in embedded systems

    Get PDF
    The advancement in technology continues to consume an increasing part of our lives and as we watch the slowing of Moore’s Law as Integrated Circuits approach physical limitations, we will continue to search for faster execution of programs. The advancement in robotics and machine vision will see them become part of our daily lives and the need for real time machine vision algorithms will increase. This dissertation will investigate optimisation options when executing machine vision algorithms on a multi-core processor and provide a guide for programmers to use when writing similar machine vision algorithms on Arm A7 or A15 processors containing a Mali T628 Graphics processing unit

    Multicore-optimized wavefront diamond blocking for optimizing stencil updates

    Full text link
    The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multi-core wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor
    • 

    corecore