1,435 research outputs found
Adaptive runtime-assisted block prefetching on chip-multiprocessors
Memory stalls are a significant source of performance degradation in modern processors. Data prefetching is a widely adopted and well studied technique used to alleviate this problem. Prefetching can be performed by the hardware, or be initiated and controlled by software. Among software controlled prefetching we find a wide variety of schemes, including runtime-directed prefetching and more specifically runtime-directed block prefetching. This paper proposes a hybrid prefetching mechanism that integrates a software driven block prefetcher with existing hardware prefetching techniques. Our runtime-assisted software prefetcher brings large blocks of data on-chip with the support of a low cost hardware engine, and synergizes with existing hardware prefetchers that manage locality at a finer granularity. The runtime system that drives the prefetch engine dynamically selects which cache to prefetch to. Our evaluation on a set of scientific benchmarks obtains a maximum speed up of 32 and 10 % on average compared to a baseline with hardware prefetching only. As a result, we also achieve a reduction of up to 18 and 3 % on average in energy-to-solution.Peer ReviewedPostprint (author's final draft
The synergy of multithreading and access/execute decoupling
This work presents and evaluates a novel processor microarchitecture which combines two paradigms: access/execute decoupling and simultaneous multithreading. We investigate how both techniques complement each other: while decoupling features an excellent memory latency hiding efficiency, multithreading supplies the in-order issue stage with enough ILP to hide the functional unit latencies. Its partitioned layout, together with its in-order issue policy makes it potentially less complex, in terms of critical path delays, than a centralized out-of-order design, to support future growths in issue-width and clock speed. The simulations show that by adding decoupling to a multithreaded architecture, its miss latency tolerance is sharply increased and in addition, it needs fewer threads to achieve maximum throughput, especially for a large miss latency. Fewer threads result in a hardware complexity reduction and lower demands on the memory system, which becomes a critical resource for large miss latencies, since bandwidth may become a bottleneck.Peer ReviewedPostprint (published version
Performance and enhancement for HD videoconference environment
In this work proposed here is framed in the project of research V3 (Video, Videoconference, and Visualization) of the Foundation i2CAT, that has for final goal to design and development of a platform of video, videoconference and independent visualization of resolution in high and super though inside new generation IP networks. i2CAT Foundation uses free software for achieving its goals. UltraGrid for the transmission of HD video is used and SAGE is used for distributed visualization among multiple monitors. The equipment used for management (capturing, sending, visualization, etc) of the high definition stream of work environment it has to be optimized so that all the disposable resources can be used, in order to improve the quality and stability of the platform. We are speaking about the treatment of datum flows of more of 1 Gbps with raw formats, so that the optimization of the use of the disposable resources of a system is given back a need. In this project it is evaluated the requirements for the high definition streams without compressing and a study of the current platform is carried out, in order to extract the functional requirements that an optimum system has to have to work in the best conditions. From this extracted information, a series of systems tests are carried out in order to improve the performance, from level of network until level of application. Different distributions of the Linux operating system have been proved in order to evaluate their performance. These are Debian 4 and openSUSE 10.3. The creation of a system from sources of software has also been proved in order to optimize its code in the compilation. It has been carried out with the help of Linux From Scratch project. It has also been tried to use systems Real Time (RT) with the distributions used. It offers more stability in the stream frame rate. Once operating systems has been test, it has proved different compilers in order to evaluate their efficiency. The GCC and the Intel C++ Compilers have proved, this second with more satisfactory results. Finally a Live CD has been carried out in order to include all the possible improvements in a system of easy distribution
Mechanistic modeling of architectural vulnerability factor
Reliability to soft errors is a significant design challenge in modern microprocessors owing to an exponential increase in the number of transistors on chip and the reduction in operating voltages with each process generation. Architectural Vulnerability Factor (AVF) modeling using microarchitectural simulators enables architects to make informed performance, power, and reliability tradeoffs. However, such simulators are time-consuming and do not reveal the microarchitectural mechanisms that influence AVF. In this article, we present an accurate first-order mechanistic analytical model to compute AVF, developed using the first principles of an out-of-order superscalar execution. This model provides insight into the fundamental interactions between the workload and microarchitecture that together influence AVF. We use the model to perform design space exploration, parametric sweeps, and workload characterization for AVF
Hardware Barrier Synchronization: Static Barrier MIMD (SBM)
In this paper, we give the design, and performance analysis, of a new, highly efficient, synchronization mechanism called âStatic Barrier MIMDâ or âSBM.â Unlike traditional barrier synchronization, the proposed barriers are designed to facilitate the use of static (compile-time) code scheduling for eliminating some synchronizations. For this reason, our barrier hardware is more general than most hardware barrier mechanisms, allowing any subset of the processors to participate in each barrier. Since code scheduling typically operates on fine-grain parallelism, it is also vital that barriers be able to execute in a small number of clock ticks. The SBM is actually only one of two new classes of barrier machines proposed to facilitate static code scheduling; the other architecture is the âDynamic Barrier MIMD,â or âDBM,â which is described in a companion paper1. The DBM differs from the SBM in that the DBM employs more complex hardware to make the system less dependent on the precision of the static analysis and code scheduling; for example, an SBM cannot efficiently manage simultaneous execution of independent parallel programs, whereas a DBM can
REBOUND: An open-source multi-purpose N-body code for collisional dynamics
REBOUND is a new multi-purpose N-body code which is freely available under an
open-source license. It was designed for collisional dynamics such as planetary
rings but can also solve the classical N-body problem. It is highly modular and
can be customized easily to work on a wide variety of different problems in
astrophysics and beyond.
REBOUND comes with three symplectic integrators: leap-frog, the symplectic
epicycle integrator (SEI) and a Wisdom-Holman mapping (WH). It supports open,
periodic and shearing-sheet boundary conditions. REBOUND can use a Barnes-Hut
tree to calculate both self-gravity and collisions. These modules are fully
parallelized with MPI as well as OpenMP. The former makes use of a static
domain decomposition and a distributed essential tree. Two new collision
detection modules based on a plane-sweep algorithm are also implemented. The
performance of the plane-sweep algorithm is superior to a tree code for
simulations in which one dimension is much longer than the other two and in
simulations which are quasi-two dimensional with less than one million
particles.
In this work, we discuss the different algorithms implemented in REBOUND, the
philosophy behind the code's structure as well as implementation specific
details of the different modules. We present results of accuracy and scaling
tests which show that the code can run efficiently on both desktop machines and
large computing clusters.Comment: 10 pages, 9 figures, accepted by A&A, source code available at
https://github.com/hannorein/reboun
Optimisation of multicore processor and GPU for use in embedded systems
The advancement in technology continues to consume an increasing part of our lives and as we watch the slowing of Mooreâs Law as Integrated Circuits approach physical limitations, we will continue to search for faster execution of programs.
The advancement in robotics and machine vision will see them become part of our daily lives and the need for real time machine vision algorithms will increase.
This dissertation will investigate optimisation options when executing machine vision algorithms on a multi-core processor and provide a guide for programmers to use when writing similar machine vision algorithms on Arm A7 or A15 processors containing a Mali T628 Graphics processing unit
Multicore-optimized wavefront diamond blocking for optimizing stencil updates
The importance of stencil-based algorithms in computational science has
focused attention on optimized parallel implementations for multilevel
cache-based processors. Temporal blocking schemes leverage the large bandwidth
and low latency of caches to accelerate stencil updates and approach
theoretical peak performance. A key ingredient is the reduction of data traffic
across slow data paths, especially the main memory interface. In this work we
combine the ideas of multi-core wavefront temporal blocking and diamond tiling
to arrive at stencil update schemes that show large reductions in memory
pressure compared to existing approaches. The resulting schemes show
performance advantages in bandwidth-starved situations, which are exacerbated
by the high bytes per lattice update case of variable coefficients. Our thread
groups concept provides a controllable trade-off between concurrency and memory
usage, shifting the pressure between the memory interface and the CPU. We
present performance results on a contemporary Intel processor
- âŠ