123 research outputs found
Many-Core CPUs Can Deliver Scalable Performance to Stochastic Simulations of Large-Scale Biochemical Reaction Networks
Stochastic simulation of large-scale biochemical reaction networks is becoming essential for Systems Biology. It enables the in-silico investigation of complex biological system dynamics under different conditions and intervention strategies, while also taking into account the inherent "biological noise" especially present in the low species count regime. It is however a great computational challenge since in practice we need to execute many repetitions of a complex simulation model to assess the average and extreme cases behavior of the dynamical system it represents. The problem's work scales quickly, with the number of repetitions required and the number of reactions in the bio-model. The worst case scenario s when there is a need to run thousands of repetitions of a complex model with thousands of reactions. We have developed a stochastic simulation software framework for many- and multi-core CPUs. It is evaluated using Intel's experimental many-cores Single-chip Cloud Computer (SCC) CPU and the latest generation consumer grade Core i7 multi-core Intel CPU, when running Gillespie's First Reaction Method exact stochastic simulation algorithm. It is shown that emerging many-core NoC processors can provide scalable performance achieving linear speedup as simulation work scales in both dimensions
Harnessing Performance Variability in Embedded and High-performance Many/Multi-core Platforms
This book describes the state-of-the art of industrial and academic research in the architectural design of heterogeneous, multi/many-core processors. The authors describe methods and tools to enable next-generation embedded and high-performance heterogeneous processors to confront cost-effectively the inevitable variations by providing Dependable-Performance: correct functionality and timing guarantees throughout the expected lifetime of a platform under thermal, power, and energy constraints. Various aspects of the reliability problem are discussed, at both the circuit and architecture level, the intelligent selection of knobs and monitors in multicore platforms, and systematic design methodologies. The authors demonstrate how new techniques have been applied in real case studies from different applications domain and report on results and conclusions of those experiments.
Enables readers to develop performance-dependable heterogeneous multi/many-core architectures
Describes system software designs that support high performance dependability requirements
Discusses and analyzes low level methodologies to tradeoff conflicting metrics, i.e. power, performance, reliability and thermal management
Includes new application design guidelines to improve performance dependabilit
The Unexpected Efficiency of Bin Packing Algorithms for Dynamic Storage Allocation in the Wild: An Intellectual Abstract
Recent work has shown that viewing allocators as black-box 2DBP solvers bears
meaning. For instance, there exists a 2DBP-based fragmentation metric which
often correlates monotonically with maximum resident set size (RSS). Given the
field's indeterminacy with respect to fragmentation definitions, as well as the
immense value of physical memory savings, we are motivated to set
allocator-generated placements against their 2DBP-devised, makespan-optimizing
counterparts. Of course, allocators must operate online while 2DBP algorithms
work on complete request traces; but since both sides optimize criteria related
to minimizing memory wastage, the idea of studying their relationship preserves
its intellectual--and practical--interest.
Unfortunately no implementations of 2DBP algorithms for DSA are available.
This paper presents a first, though partial, implementation of the
state-of-the-art. We validate its functionality by comparing its outputs'
makespan to the theoretical upper bound provided by the original authors. Along
the way, we identify and document key details to assist analogous future
efforts.
Our experiments comprise 4 modern allocators and 8 real application
workloads. We make several notable observations on our empirical evidence: in
terms of makespan, allocators outperform Robson's worst-case lower bound
of the time. In of cases, GNU's \texttt{malloc}
implementation demonstrates equivalent or superior performance to the 2DBP
state-of-the-art, despite the second operating offline.
Most surprisingly, the 2DBP algorithm proves competent in terms of
fragmentation, producing up to x better solutions. Future research can
leverage such insights towards memory-targeting optimizations.Comment: 13 pages, 10 figures, 3 tables. To appear in ISMM '2
Adjacent LSTM-Based Page Scheduling for Hybrid DRAM/NVM Memory Systems
Recent advances in memory technologies have led to the rapid growth of hybrid systems that combine traditional DRAM and Non Volatile Memory (NVM) technologies, as the latter provide lower cost per byte, low leakage power and larger capacities than DRAM, while they can guarantee comparable access latency. Such kind of heterogeneous memory systems impose new challenges in terms of page placement and migration among the alternative technologies of the heterogeneous memory system. In this paper, we present a novel approach for efficient page placement on heterogeneous DRAM/NVM systems. We design an adjacent LSTM-based approach for page placement, which strongly relies on page accesses prediction, while sharing knowledge among pages with behavioral similarity. The proposed approach leads up to 65.5% optimized performance compared to existing approaches, while achieving near-optimal results and saving 20.2% energy consumption on average. Moreover, we propose a new page replacement policy, namely clustered-LRU, achieving up to 8.1% optimized performance, compared to the default Least Recently Used (LRU) policy
Resource Aware GPU Scheduling in Kubernetes Infrastructure
Nowadays, there is an ever-increasing number of artificial intelligence inference workloads pushed and executed on the cloud. To effectively serve and manage the computational demands, data center operators have provisioned their infrastructures with accelerators. Specifically for GPUs, support for efficient management lacks, as state-of-the-art schedulers and orchestrators, threat GPUs only as typical compute resources ignoring their unique characteristics and application properties. This phenomenon combined with the GPU over-provisioning problem leads to severe resource under-utilization. Even though prior work has addressed this problem by colocating applications into a single accelerator device, its resource agnostic nature does not manage to face the resource under-utilization and quality of service violations especially for latency critical applications.
In this paper, we design a resource aware GPU scheduling framework, able to efficiently colocate applications on the same GPU accelerator card. We integrate our solution with Kubernetes, one of the most widely used cloud orchestration frameworks. We show that our scheduler can achieve 58.8% lower end-to-end job execution time 99%-ile, while delivering 52.5% higher GPU memory usage, 105.9% higher GPU utilization percentage on average and 44.4% lower energy consumption on average, compared to the state-of-the-art schedulers, for a variety of ML representative workloads
EDEN: A high-performance, general-purpose, NeuroML-based neural simulator
Modern neuroscience employs in silico experimentation on ever-increasing and
more detailed neural networks. The high modelling detail goes hand in hand with
the need for high model reproducibility, reusability and transparency. Besides,
the size of the models and the long timescales under study mandate the use of a
simulation system with high computational performance, so as to provide an
acceptable time to result. In this work, we present EDEN (Extensible Dynamics
Engine for Networks), a new general-purpose, NeuroML-based neural simulator
that achieves both high model flexibility and high computational performance,
through an innovative model-analysis and code-generation technique. The
simulator runs NeuroML v2 models directly, eliminating the need for users to
learn yet another simulator-specific, model-specification language. EDEN's
functional correctness and computational performance were assessed through
NeuroML models available on the NeuroML-DB and Open Source Brain model
repositories. In qualitative experiments, the results produced by EDEN were
verified against the established NEURON simulator, for a wide range of models.
At the same time, computational-performance benchmarks reveal that EDEN runs up
to 2 orders-of-magnitude faster than NEURON on a typical desktop computer, and
does so without additional effort from the user. Finally, and without added
user effort, EDEN has been built from scratch to scale seamlessly over multiple
CPUs and across computer clusters, when available.Comment: 29 pages, 9 figure
Co-Design of Approximate Multilayer Perceptron for Ultra-Resource Constrained Printed Circuits
Printed Electronics (PE) exhibits on-demand, extremely low-cost hardware due to its additive manufacturing process, enabling machine learning (ML) applications for domains that feature ultra-low cost, conformity, and non-toxicity requirements that silicon-based systems cannot deliver. Nevertheless, large feature sizes in PE prohibit the realization of complex printed ML circuits. In this work, we present, for the first time, an automated printed-aware software/hardware co-design framework that exploits approximate computing principles to enable ultra-resource constrained printed multilayer perceptrons (MLPs). Our evaluation demonstrates that, compared to the state-of-the-art baseline, our circuits feature on average 6x (5.7x) lower area (power) and less than 1% accuracy loss
Approximate Computing Survey, Part II: Application-Specific & Architectural Approximation Techniques and Applications
The challenging deployment of compute-intensive applications from domains
such Artificial Intelligence (AI) and Digital Signal Processing (DSP), forces
the community of computing systems to explore new design approaches.
Approximate Computing appears as an emerging solution, allowing to tune the
quality of results in the design of a system in order to improve the energy
efficiency and/or performance. This radical paradigm shift has attracted
interest from both academia and industry, resulting in significant research on
approximation techniques and methodologies at different design layers (from
system down to integrated circuits). Motivated by the wide appeal of
Approximate Computing over the last 10 years, we conduct a two-part survey to
cover key aspects (e.g., terminology and applications) and review the
state-of-the art approximation techniques from all layers of the traditional
computing stack. In Part II of our survey, we classify and present the
technical details of application-specific and architectural approximation
techniques, which both target the design of resource-efficient
processors/accelerators & systems. Moreover, we present a detailed analysis of
the application spectrum of Approximate Computing and discuss open challenges
and future directions.Comment: Under Review at ACM Computing Survey
- …