115 research outputs found
Coarray-based Load Balancing on Heterogeneous and Many-Core Architectures
In order to reach challenging performance goals, computer architecture is expected to change significantly in the near future. Heterogeneous chips, equipped with different types of cores and memory, will force application developers to deal with irregular communication patterns, high levels of parallelism, and unexpected behavior.
Load balancing among the heterogeneous compute units will be a critical task in order to achieve an effective usage of the computational power provided by such new architectures. In this highly dynamic scenario, Partitioned Global Address Space (PGAS) languages, like Coarray Fortran, appear a promising alternative to standard MPI programming that uses two-sided communications, in particular because of PGAS one-sided semantic and ease of programmability. In this paper, we show how Coarray Fortran can be used for implementing dynamic load balancing algorithms on an exascale compute node and how these algorithms can produce performance benefits for an Asian option pricing problem, running in symmetric mode on Intel Xeon Phi Knights Corner and Knights Landing architectures
A framework for unit testing with coarray Fortran
Parallelism is a ubiquitous feature of modern computing architectures; indeed, we might even say that serial code is now automatically legacy code. Writing parallel code poses significant challenges to programs, and is often error-prone. Partitioned Global Address Space (PGAS) languages, such as Coarray Fortran (CAF), represent a promising development direction in the quest for a trade-off between simplicity and performance. CAF is a parallel programming model that allows a smooth migration from serial to parallel code. However, despite CAF simplicity, refactoring serial code and migrating it to parallel versions is still error-prone, especially in complex softwares. The combination of unit testing, which drastically reduces defect injection, and CAF is therefore a very appealing prospect; however, it requires appropriate tools to realize its potential. In this paper, we present the first CAF-compatible framework for unit tests, developed as an extension to the Parallel Fortran Unit Test framework (pFUnit)
QoS-aware offloading policies for serverless functions in the Cloud-to-Edge continuum
Function-as-a-Service (FaaS) paradigm is increasingly attractive to bring the benefits of serverless computing to the edge of the network, besides traditional Cloud data centers. However, FaaS adoption in the emerging Cloud-to-Edge Continuum is challenging, mostly due to geographical distribution and heterogeneous resource availability. This emerging landscape calls for effective strategies to trade off low latency at the edge of the network with Cloud resource richness, taking into account the needs of different functions and users. In this paper, we present QoS-aware offloading policies for serverless functions running in the Cloud-to-Edge continuum. We consider heterogeneous functions and service classes, and aim to maximize utility given a monetary budget for resource usage. Specifically, we introduce a two-level approach, where (i) FaaS nodes rely on a randomized policy to schedule every incoming request according to a set of probability values, and (ii) periodically, a linear programming model is solved to determine the probabilities to use for scheduling. We show by extensive simulation that our approach outperforms alternative approaches in terms of generated utility across multiple scenarios. Moreover, we demonstrate that our solution is computationally efficient and can be adopted in large-scale systems. We also demonstrate the functionality of our approach through a proof-of-concept experiment on an open-source FaaS framework
FIGARO: reinForcement learnInG mAnagement acRoss the computing cOntinuum
The widespread adoption of Artificial Intelligence applications to analyze data generated by Internet of Things sensors leads to the development of the edge computing paradigm. Deploying applications at the periphery of the network effectively addresses cost and latency concerns associated with cloud computing. However, it generates a highly distributed environment with heterogeneous devices, opening the challenges of how to select resources and place application components. Starting from a state-of-the-art design-time tool, we present in this paper a novel framework based on Reinforcement Learning, named FIGARO (reinForcement learnInG mAnagement acRoss the computing cOntinuum). It handles the runtime adaptation of a computing continuum environment, dealing with the variability of the incoming load and service times. To reduce the training time, we exploit the design-time knowledge, achieving a significant reduction in the violations of the response time constraint
C-peptide: a predictor of cardiovascular mortality in subjects with established atherosclerotic disease
Aim: Insulin resistance and type 2 diabetes are independent risk factors for cardiovascular diseases. Levels of C-peptide
are increased in these patients and its role in the atherosclerosis progression was studied in vitro and in vivo over the
past years. To evaluate the possible use of C-peptide as cardiovascular biomarkers, we designed an observational study
in which we enrolled patients with mono- or poly-vascular atherosclerotic disease.
Methods: We recruited 431 patients with stable atherosclerosis and performed a yearly follow-up to estimate the
cardiovascular and total mortality and cardiovascular events.
Results: We performed a mean follow-up of 56months on 268 patients. A multivariate Cox analysis showed that
C-peptide significantly increased the risk of cardiovascular mortality [Hazard Ratio: 1.29 (95% confidence interval:
1.02-1.65, p<0.03513)] after adjustment for age, sex, diabetes treatment, estimated glomerular filtration rate and
known diabetes status. Furthermore, levels of C-peptide were significantly correlated with metabolic parameters and
atherogenic factors.
Conclusion: C-peptide was associated with cardiovascular mortality independently of known diabetes status in a cohort
of patients with chronic atherosclerotic disease. Future studies using C-peptide into a reclassification approach might be
undertaken to consider its potential as a cardiovascular disease biomarker
The Performance of Distributed Applications: A Traffic Shaping Perspective
Widely used in datacenters and clouds, network traffic shaping is a performance influencing factor that is often overlooked when benchmarking or simply deploying distributed applications. While in theory traffic shaping should allow for a fairer sharing of network resources, in practice it also introduces new problems: performance (measurement) inconsistency and long tails. In this paper we investigate the effects of traffic shaping mechanisms on common distributed applications. We characterize the performance of a distributed key-value store, big data workloads, and high-performance computing under state-of-the-art benchmarks, while the underlying network's traffic is shaped using state-of-the-art mechanisms such as token-buckets or priority queues. Our results show that the impact of traffic shaping needs to be taken into account when benchmarking or deploying distributed applications. To help researchers, practitioners, and application developers we uncover several practical implications and make recommendations on how certain applications are to be deployed so that performance is least impacted by the shaping protocols
A game-theoretic approach to computation offloading in mobile cloud computing
We consider a three-tier architecture for mobile and pervasive computing
scenarios, consisting of a local tier ofmobile nodes, a middle tier (cloudlets) of nearby
computing nodes, typically located at the mobile nodes access points but characterized by a limited amount of resources, and a remote tier of distant cloud servers, which have
practically infinite resources. This architecture has been proposed to get the benefits of
computation offloading from mobile nodes to external servers while limiting the use
of distant servers whose higher latency could negatively impact the user experience.
For this architecture, we consider a usage scenario where no central authority exists
and multiple non-cooperative mobile users share the limited computing resources of
a close-by cloudlet and can selfishly decide to send their computations to any of the
three tiers. We define a model to capture the users interaction and to investigate the
effects of computation offloading on the users’ perceived performance. We formulate
the problem as a generalized Nash equilibrium problem and show existence of an
equilibrium.We present a distributed algorithm for the computation of an equilibrium
which is tailored to the problem structure and is based on an in-depth analysis of
the underlying equilibrium problem. Through numerical examples, we illustrate its
behavior and the characteristics of the achieved equilibria
Efficient algebraic multigrid preconditioners on clusters of GPUs
Many scientific applications require the solution of large and sparse linear systems of equations using Krylov subspace methods; in this case, the choice of an effective preconditioner may be crucial for the convergence of the Krylov solver. Algebraic MultiGrid (AMG) methods are widely used as preconditioners, because of their optimal computational cost and their algorithmic scalability. The wide availability of GPUs, now found in many of the fastest supercomputers, poses the problem of implementing efficiently these methods on high-throughput processors. In this work we focus on the application phase of AMG preconditioners, and in particular on the choice and implementation of smoothers and coarsest-level solvers capable of exploiting the computational power of clusters of GPUs. We consider block-Jacobi smoothers using sparse approximate inverses in the solve phase associated with the local blocks. The choice of approximate inverses instead of sparse matrix factorizations is driven by the large amount of parallelism exposed by the matrix-vector product as compared to the solution of large triangular systems on GPUs. The selected smoothers and solvers are implemented within the AMG preconditioning framework provided by the MLD2P4 library, using suitable sparse matrix data structures from the PSBLAS library. Their behaviour is illustrated in terms of execution speed and scalability, on a test case concerning groundwater modelling, provided by the Jülich Supercomputing Center within the Horizon 2020 Project EoCoE
A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures
In recent years, the field of Deep Learning has seen many disruptive and
impactful advancements. Given the increasing complexity of deep neural
networks, the need for efficient hardware accelerators has become more and more
pressing to design heterogeneous HPC platforms. The design of Deep Learning
accelerators requires a multidisciplinary approach, combining expertise from
several areas, spanning from computer architecture to approximate computing,
computational models, and machine learning algorithms. Several methodologies
and tools have been proposed to design accelerators for Deep Learning,
including hardware-software co-design approaches, high-level synthesis methods,
specific customized compilers, and methodologies for design space exploration,
modeling, and simulation. These methodologies aim to maximize the exploitable
parallelism and minimize data movement to achieve high performance and energy
efficiency. This survey provides a holistic review of the most influential
design methodologies and EDA tools proposed in recent years to implement Deep
Learning accelerators, offering the reader a wide perspective in this rapidly
evolving field. In particular, this work complements the previous survey
proposed by the same authors in [203], which focuses on Deep Learning hardware
accelerators for heterogeneous HPC platforms
- …