3,765 research outputs found
LEGaTO: first steps towards energy-efficient toolset for heterogeneous computing
LEGaTO is a three-year EU H2020 project which started in December 2017. The LEGaTO project will leverage task-based programming models to provide a software ecosystem for Made-in-Europe heterogeneous hardware composed of CPUs, GPUs, FPGAs and dataflow engines. The aim is to attain one order of magnitude energy savings from the edge to the converged cloud/HPC.Peer ReviewedPostprint (author's final draft
Optimizing Collective Communication for Scalable Scientific Computing and Deep Learning
In the realm of distributed computing, collective operations involve coordinated communication and synchronization among multiple processing units, enabling efficient data exchange and collaboration. Scientific applications, such as simulations, computational fluid dynamics, and scalable deep learning, require complex computations that can be parallelized across multiple nodes in a distributed system. These applications often involve data-dependent communication patterns, where collective operations are critical for achieving high performance in data exchange. Optimizing collective operations for scientific applications and deep learning involves improving the algorithms, communication patterns, and data distribution strategies to minimize communication overhead and maximize computational efficiency.
Within the context of this dissertation, the specific focus is on optimizing the alltoall operation in 3D Fast Fourier Transform (FFT) applications and the allreduce operation in parallel deep learning, particularly on High-Performance Computing (HPC) systems. Advanced communication algorithms and methods are explored and implemented to improve communication efficiency, consequently enhancing the overall performance of 3D FFT applications.
Furthermore, this dissertation investigates the identification of performance bottlenecks during collective communication over Horovod on distributed systems. These bottlenecks are addressed by proposing an optimized parallel communication pattern specifically tailored to alleviate the aforementioned limitations during the training phase in distributed deep learning. The objective is to achieve faster convergence and improve the overall training efficiency.
Moreover, this dissertation proposes fault tolerance and elastic scaling features for distributed deep learning by leveraging the User-Level Failure Mitigation (ULFM) from Message Passing Interface (MPI). By incorporating ULFM MPI, the dissertation aims to enhance the elastic capabilities of distributed deep learning systems. This approach enables graceful and lightweight handling of failures while facilitating seamless scaling in dynamic computing environments
Elasticity and Petri nets
Digital electronic systems typically use synchronous clocks and primarily assume fixed duration of their operations to simplify the design process. Time elastic systems can be constructed either by replacing the clock with communication handshakes (asynchronous version) or by augmenting the clock with a synchronous version of a handshake (synchronous version). Time elastic systems can tolerate static and dynamic changes in delays (asynchronous case) or latencies (synchronous case) of operations that can be used for modularity, ease of reuse and better power-delay trade-off. This paper describes methods for the modeling, performance analysis and optimization of elastic systems using Marked Graphs and their extensions capable of describing behavior with early evaluation. The paper uses synchronous elastic systems (aka latency-tolerant systems) for illustrating the use of Petri nets, however, most of the methods can be applied without changes (except changing the delay model associated with events of the system) to asynchronous elastic systems.Peer ReviewedPostprint (author's final draft
Elastic bundles :modelling and architecting asynchronous circuits with granular rigidity
PhD ThesisIntegrated Circuit (IC) designs these days are predominantly System-on-Chips (SoCs).
The complexity of designing a SoC has increased rapidly over the years due to growing
process and environmental variations coupled with global clock distribution di culty.
Moreover, traditional synchronous design is not apt to handle the heterogeneous timing
nature of modern SoCs. As a countermeasure, the semiconductor industry witnessed
a strong revival of asynchronous design principles. A new paradigm of digital circuits
emerged, as a result, namely mixed synchronous-asynchronous circuits. With a wave
of recent innovations in synchronous-asynchronous CAD integration, this paradigm is
showing signs of commercial adoption in future SoCs mainly due to the scope for reuse
of synchronous functional blocks and IP cores, and the co-existence of synchronous and
asynchronous design styles in a common EDA framework.
However, there is a lack of formal methods and tools to facilitate mixed synchronousasynchronous
design. In this thesis, we propose a formal model based on Petri nets with
step semantics to describe these circuits behaviourally. Implication of this model in the
veri cation and synthesis of mixed synchronous-asynchronous circuits is studied. Till
date, this paradigm has been mainly explored on the basis of Globally Asynchronous
Locally Synchronous (GALS) systems. Despite decades of research, GALS design has
failed to gain traction commercially. To understand its drawbacks, a simulation framework
characterising the physical and functional aspects of GALS SoCs is presented.
A novel method for synthesising mixed synchronous-asynchronous circuits with varying
levels of rigidity is proposed. Starting with a high-level data ow model of a system which
is intrinsically asynchronous, the key idea is to introduce rigidity of chosen granularity
levels in the model without changing functional behaviour. The system is then partitioned
into functional blocks of synchronous and asynchronous elements before being transformed
into an equivalent circuit which can be synthesised using standard EDA tools
Revisiting Actor Programming in C++
The actor model of computation has gained significant popularity over the
last decade. Its high level of abstraction makes it appealing for concurrent
applications in parallel and distributed systems. However, designing a
real-world actor framework that subsumes full scalability, strong reliability,
and high resource efficiency requires many conceptual and algorithmic additives
to the original model.
In this paper, we report on designing and building CAF, the "C++ Actor
Framework". CAF targets at providing a concurrent and distributed native
environment for scaling up to very large, high-performance applications, and
equally well down to small constrained systems. We present the key
specifications and design concepts---in particular a message-transparent
architecture, type-safe message interfaces, and pattern matching
facilities---that make native actors a viable approach for many robust,
elastic, and highly distributed developments. We demonstrate the feasibility of
CAF in three scenarios: first for elastic, upscaling environments, second for
including heterogeneous hardware like GPGPUs, and third for distributed runtime
systems. Extensive performance evaluations indicate ideal runtime behaviour for
up to 64 cores at very low memory footprint, or in the presence of GPUs. In
these tests, CAF continuously outperforms the competing actor environments
Erlang, Charm++, SalsaLite, Scala, ActorFoundry, and even the OpenMPI.Comment: 33 page
Elastic circuits
Elasticity in circuits and systems provides tolerance to variations in computation and communication delays. This paper presents a comprehensive overview of elastic circuits for those designers who are mainly familiar with synchronous design. Elasticity can be implemented both synchronously and asynchronously, although it was traditionally more often associated with asynchronous circuits. This paper shows that synchronous and asynchronous elastic circuits can be designed, analyzed, and optimized using similar techniques. Thus, choices between synchronous and asynchronous implementations are localized and deferred until late in the design process.Peer ReviewedPostprint (published version
- …