583 research outputs found
Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes
The ongoing hardware evolution exhibits an escalation in the number, as well
as in the heterogeneity, of computing resources. The pressure to maintain
reasonable levels of performance and portability forces application developers
to leave the traditional programming paradigms and explore alternative
solutions. PaStiX is a parallel sparse direct solver, based on a dynamic
scheduler for modern hierarchical manycore architectures. In this paper, we
study the benefits and limits of replacing the highly specialized internal
scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and
StarPU. The tasks graph of the factorization step is made available to the two
runtimes, providing them the opportunity to process and optimize its traversal
in order to maximize the algorithm efficiency for the targeted hardware
platform. A comparative study of the performance of the PaStiX solver on top of
its native internal scheduler, PaRSEC, and StarPU frameworks, on different
execution environments, is performed. The analysis highlights that these
generic task-based runtimes achieve comparable results to the
application-optimized embedded scheduler on homogeneous platforms. Furthermore,
they are able to significantly speed up the solver on heterogeneous
environments by taking advantage of the accelerators while hiding the
complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014
Correct and efficient accelerator programming
This report documents the program and the outcomes of Dagstuhl Seminar 13142 “Correct and Efficient Accelerator Programming”. The aim of this Dagstuhl seminar was to bring together researchers from various sub-disciplines of computer science to brainstorm and discuss the theoretical foundations, design and implementation of techniques and tools for correct and efficient accelerator programming
MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory
Shared L1 memory clusters are a common architectural pattern (e.g., in
GPGPUs) for building efficient and flexible multi-processing-element (PE)
engines. However, it is a common belief that these tightly-coupled clusters
would not scale beyond a few tens of PEs. In this work, we tackle scaling
shared L1 clusters to hundreds of PEs while supporting a flexible and
productive programming model and maintaining high efficiency. We present
MemPool, a manycore system with 256 RV32IMAXpulpimg "Snitch" cores featuring
application-tunable functional units. We designed and implemented an efficient
low-latency PE to L1-memory interconnect, an optimized instruction path to
ensure each PE's independent execution, and a powerful DMA engine and system
interconnect to stream data in and out. MemPool is easy to program, with all
the cores sharing a global view of a large, multi-banked, L1 scratchpad memory,
accessible within at most five cycles in the absence of conflicts. We provide
multiple runtimes to program MemPool at different abstraction levels and
illustrate its versatility with a wide set of applications. MemPool runs at 600
MHz (60 gate delays) in typical conditions (TT/0.80V/25{\deg}C) in 22 nm FDX
technology and achieves a performance of up to 229 GOPS or 192 GOPS/W with less
than 2% of execution stalls.Comment: 14 pages, 17 figures, 2 table
Enforcing Predictability of Many-cores with DCFNoC
© 2021 IEEE. Personal use of this material is permitted. Permissíon from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertisíng or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.[EN] The ever need for higher performance forces industry to include technology based on multi-processors system on chip (MPSoCs) in their safety-critical embedded systems. MPSoCs include a network-on-chip (NoC) to interconnect the cores between them and with memory and the rest of shared resources. Unfortunately, the inclusion of NoCs compromises guaranteeing time predictability as network-level conflicts may occur. To overcome this problem, in this paper we propose DCFNoC, a new time-predictable NoC design paradigm where conflicts within the network are eliminated by design. This new paradigm builds on top of the Channel Dependency Graph (CDG) in order to deterministically avoid network conflicts. The network guarantees predictability to applications and is able to naturally inject messages using a TDM period equal to the optimal theoretical bound without the need of using a computationally demanding offline process. DCFNoC is integrated in a tile-based many-core system and adapted to its memory hierarchy. Our results show that DCFNoC guarantees time predictability avoiding network interference among multiple running applications. DCFNoC always guarantees performance and also improves wormhole performance in a 4 × 4 setting by a factor of 3.7× when interference traffic is injected. For a 8 × 8 network differences are even larger. In addition, DCFNoC obtains a total area saving of 10.79% over a standard wormhole implementation.This work has been supported by MINECO under Grant BES-2016-076885, by MINECO and funds from the European ERDF under Grant TIN2015-66972-C05-1-R and Grant RTI2018-098156-B-C51, and by the EC H2020 RECIPE project under Grant 801137.Picornell-Sanjuan, T.; Flich Cardo, J.; Hernández Luz, C.; Duato Marín, JF. (2021). Enforcing Predictability of Many-cores with DCFNoC. IEEE Transactions on Computers. 70(2):270-283. https://doi.org/10.1109/TC.2020.2987797S27028370
A RECONFIGURABLE AND EXTENSIBLE EXPLORATION PLATFORM FOR FUTURE HETEROGENEOUS SYSTEMS
Accelerator-based -or heterogeneous- computing has become increasingly
important in a variety of scenarios, ranging from High-Performance Computing (HPC) to embedded systems. While most solutions use sometimes
custom-made components, most of today’s systems rely on commodity highend CPUs and/or GPU devices, which deliver adequate performance while
ensuring programmability, productivity, and application portability. Unfortunately, pure general-purpose hardware is affected by inherently limited
power-efficiency, that is, low GFLOPS-per-Watt, now considered as a primary metric. The many-core model and architectural customization can
play here a key role, as they enable unprecedented levels of power-efficiency
compared to CPUs/GPUs. However, such paradigms are still immature and
deeper exploration is indispensable.
This dissertation investigates customizability and proposes novel solutions
for heterogeneous architectures, focusing on mechanisms related to coherence and network-on-chip (NoC). First, the work presents a non-coherent
scratchpad memory with a configurable bank remapping system to reduce
bank conflicts. The experimental results show the benefits of both using a
customizable hardware bank remapping function and non-coherent memories for some types of algorithms. Next, we demonstrate how a distributed
synchronization master better suits many-cores than standard centralized
solutions. This solution, inspired by the directory-based coherence mechanism, supports concurrent synchronizations without relying on memory
transactions. The results collected for different NoC sizes provided indications about the area overheads incurred by our solution and demonstrated
the benefits of using a dedicated hardware synchronization support. Finally, this dissertation proposes an advanced coherence subsystem, based
on the sparse directory approach, with a selective coherence maintenance
system which allows coherence to be deactivated for blocks that do not require it. Experimental results show that the use of a hybrid coherent and
non-coherent architectural mechanism along with an extended coherence
protocol can enhance performance.
The above results were all collected by means of a modular and customizable heterogeneous many-core system developed to support the exploration
of power-efficient high-performance computing architectures. The system is
based on a NoC and a customizable GPU-like accelerator core, as well as
a reconfigurable coherence subsystem, ensuring application-specific configuration capabilities. All the explored solutions were evaluated on this real heterogeneous system, which comes along with the above methodological
results as part of the contribution in this dissertation. In fact, as a key
benefit, the experimental platform enables users to integrate novel hardware/software solutions on a full-system scale, whereas existing platforms
do not always support a comprehensive heterogeneous architecture exploration
FIFTY YEARS OF MICROPROCESSOR EVOLUTION: FROM SINGLE CPU TO MULTICORE AND MANYCORE SYSTEMS
Nowadays microprocessors are among the most complex electronic systems that man has ever designed. One small silicon chip can contain the complete processor, large memory and logic needed to connect it to the input-output devices. The performance of today's processors implemented on a single chip surpasses the performance of a room-sized supercomputer from just 50 years ago, which cost over $ 10 million [1]. Even the embedded processors found in everyday devices such as mobile phones are far more powerful than computer developers once imagined. The main components of a modern microprocessor are a number of general-purpose cores, a graphics processing unit, a shared cache, memory and input-output interface and a network on a chip to interconnect all these components [2]. The speed of the microprocessor is determined by its clock frequency and cannot exceed a certain limit. Namely, as the frequency increases, the power dissipation increases too, and consequently the amount of heating becomes critical. So, silicon manufacturers decided to design new processor architecture, called multicore processors [3]. With aim to increase performance and efficiency these multiple cores execute multiple instructions simultaneously. In this way, the amount of parallel computing or parallelism is increased [4]. In spite of mentioned advantages, numerous challenges must be addressed carefully when more cores and parallelism are used.This paper presents a review of microprocessor microarchitectures, discussing their generations over the past 50 years. Then, it describes the currently used implementations of the microarchitecture of modern microprocessors, pointing out the specifics of parallel computing in heterogeneous microprocessor systems. To use efficiently the possibility of multi-core technology, software applications must be multithreaded. The program execution must be distributed among the multi-core processors so they can operate simultaneously. To use multi-threading, it is imperative for programmer to understand the basic principles of parallel computing and parallel hardware. Finally, the paper provides details how to implement hardware parallelism in multicore systems
High-Performance and Time-Predictable Embedded Computing
Nowadays, the prevalence of computing systems in our lives is so ubiquitous that we live in a cyber-physical world dominated by computer systems, from pacemakers to cars and airplanes. These systems demand for more computational performance to process large amounts of data from multiple data sources with guaranteed processing times. Actuating outside of the required timing bounds may cause the failure of the system, being vital for systems like planes, cars, business monitoring, e-trading, etc.
High-Performance and Time-Predictable Embedded Computing presents recent advances in software architecture and tools to support such complex systems, enabling the design of embedded computing devices which are able to deliver high-performance whilst guaranteeing the application required timing bounds.
Technical topics discussed in the book include: Parallel embedded platforms Programming models Mapping and scheduling of parallel computations Timing and schedulability analysis Runtimes and operating systems
The work reflected in this book was done in the scope of the European project P SOCRATES, funded under the FP7 framework program of the European Commission. High-performance and time-predictable embedded computing is ideal for personnel in computer/communication/embedded industries as well as academic staff and master/research students in computer science, embedded systems, cyber-physical systems and internet-of-things.info:eu-repo/semantics/publishedVersio
- …