1,200 research outputs found
Collective Tuning Initiative
Computing systems rarely deliver best possible performance due to ever
increasing hardware and software complexity and limitations of the current
optimization technology. Additional code and architecture optimizations are
often required to improve execution time, size, power consumption, reliability
and other important characteristics of computing systems. However, it is often
a tedious, repetitive, isolated and time consuming process. In order to
automate, simplify and systematize program optimization and architecture
design, we are developing open-source modular plugin-based Collective Tuning
Infrastructure (CTI, http://cTuning.org) that can distribute optimization
process and leverage optimization experience of multiple users. CTI provides a
novel fully integrated, collaborative, "one button" approach to improve
existing underperfoming computing systems ranging from embedded architectures
to high-performance servers based on systematic iterative compilation,
statistical collective optimization and machine learning. Our experimental
results show that it is possible to reduce execution time (and code size) of
some programs from SPEC2006 and EEMBC among others by more than a factor of 2
automatically. It can also reduce development and testing time considerably.
Together with the first production quality machine learning enabled interactive
research compiler (MILEPOST GCC) this infrastructure opens up many research
opportunities to study and develop future realistic self-tuning and
self-organizing adaptive intelligent computing systems based on systematic
statistical performance evaluation and benchmarking. Finally, using common
optimization repository is intended to improve the quality and reproducibility
of the research on architecture and code optimization.Comment: GCC Developers' Summit'09, 14 June 2009, Montreal, Canad
Optimized Composition: Generating Efficient Code for Heterogeneous Systems from Multi-Variant Components, Skeletons and Containers
In this survey paper, we review recent work on frameworks for the high-level,
portable programming of heterogeneous multi-/manycore systems (especially,
GPU-based systems) using high-level constructs such as annotated user-level
software components, skeletons (i.e., predefined generic components) and
containers, and discuss the optimization problems that need to be considered in
selecting among multiple implementation variants, generating code and providing
runtime support for efficient execution on such systems.Comment: Presented at 1st Workshop on Resource Awareness and Adaptivity in
Multi-Core Computing (Racing 2014) (arXiv:1405.2281
EOS: Automatic In-vivo Evolution of Kernel Policies for Better Performance
Today's monolithic kernels often implement a small, fixed set of policies
such as disk I/O scheduling policies, while exposing many parameters to let
users select a policy or adjust the specific setting of the policy. Ideally,
the parameters exposed should be flexible enough for users to tune for good
performance, but in practice, users lack domain knowledge of the parameters and
are often stuck with bad, default parameter settings.
We present EOS, a system that bridges the knowledge gap between kernel
developers and users by automatically evolving the policies and parameters in
vivo on users' real, production workloads. It provides a simple policy
specification API for kernel developers to programmatically describe how the
policies and parameters should be tuned, a policy cache to make in-vivo tuning
easy and fast by memorizing good parameter settings for past workloads, and a
hierarchical search engine to effectively search the parameter space.
Evaluation of EOS on four main Linux subsystems shows that it is easy to use
and effectively improves each subsystem's performance.Comment: 14 pages, technique repor
Performance-Detective: Automatic Deduction of Cheap and Accurate Performance Models
The many configuration options of modern applications make it difficult for users to select a performance-optimal configuration. Performance models help users in understanding system performance and choosing a fast configuration. Existing performance modeling approaches for applications and configurable systems either require a full-factorial experiment design or a sampling design based on heuristics. This results in high costs for achieving accurate models. Furthermore, they require repeated execution of experiments to account for measurement noise. We propose Performance-Detective, a novel code analysis tool that deduces insights on the interactions of program parameters. We use the insights to derive the smallest necessary experiment design and avoiding repetitions of measurements when possible, significantly lowering the cost of performance modeling. We evaluate Performance-Detective using two case studies where we reduce the number of measurements from up to 3125 to only 25, decreasing cost to only 2.9% of the previously needed core hours, while maintaining accuracy of the resulting model with 91.5% compared to 93.8% using all 3125 measurements
Cloud engineering is search based software engineering too
Many of the problems posed by the migration of computation to cloud platforms can be formulated and solved using techniques associated with Search Based Software Engineering (SBSE). Much of cloud software engineering involves problems of optimisation: performance, allocation, assignment and the dynamic balancing of resources to achieve pragmatic trade-offs between many competing technical and business objectives. SBSE is concerned with the application of computational search and optimisation to solve precisely these kinds of software engineering challenges. Interest in both cloud computing and SBSE has grown rapidly in the past five years, yet there has been little work on SBSE as a means of addressing cloud computing challenges. Like many computationally demanding activities, SBSE has the potential to benefit from the cloud; ‘SBSE in the cloud’. However, this paper focuses, instead, of the ways in which SBSE can benefit cloud computing. It thus develops the theme of ‘SBSE for the cloud’, formulating cloud computing challenges in ways that can be addressed using SBSE
Rolex: Resilience-Oriented Language Extensions for Extreme-Scale Systems
Future exascale high-performance computing (HPC) systems will be constructed
from VLSI devices that will be less reliable than those used today, and faults
will become the norm, not the exception. This will pose significant problems
for system designers and programmers, who for half-a-century have enjoyed an
execution model that assumed correct behavior by the underlying computing
system. The mean time to failure (MTTF) of the system scales inversely to the
number of components in the system and therefore faults and resultant system
level failures will increase, as systems scale in terms of the number of
processor cores and memory modules used. However every error detected need not
cause catastrophic failure. Many HPC applications are inherently fault
resilient. Yet it is the application programmers who have this knowledge but
lack mechanisms to convey it to the system.
In this paper, we present new Resilience Oriented Language Extensions (Rolex)
which facilitate the incorporation of fault resilience as an intrinsic property
of the application code. We describe the syntax and semantics of the language
extensions as well as the implementation of the supporting compiler
infrastructure and runtime system. Our experiments show that an approach that
leverages the programmer's insight to reason about the context and significance
of faults to the application outcome significantly improves the probability
that an application runs to a successful conclusion
OMP2MPI: Automatic MPI code generation from OpenMP programs
In this paper, we present OMP2MPI a tool that generates automatically MPI
source code from OpenMP. With this transformation the original program can be
adapted to be able to exploit a larger number of processors by surpassing the
limits of the node level on large HPC clusters. The transformation can also be
useful to adapt the source code to execute in distributed memory many-cores
with message passing support. In addition, the resulting MPI code can be used
as an starting point that still can be further optimized by software engineers.
The transformation process is focused on detecting OpenMP parallel loops and
distributing them in a master/worker pattern. A set of micro-benchmarks have
been used to verify the correctness of the the transformation and to measure
the resulting performance. Surprisingly not only the automatically generated
code is correct by construction, but also it often performs faster even when
executed with MPI.Comment: Presented at HIP3ES, 2015 (arXiv: 1501.03064
Matching non-uniformity for program optimizations on heterogeneous many-core systems
As computing enters an era of heterogeneity and massive parallelism, it exhibits a distinct feature: the deepening non-uniform relations among the computing elements in both hardware and software. Besides traditional non-uniform memory accesses, much deeper non-uniformity shows in a processor, runtime, and application, exemplified by the asymmetric cache sharing, memory coalescing, and thread divergences on multicore and many-core processors. Being oblivious to the non-uniformity, current applications fail to tap into the full potential of modern computing devices.;My research presents a systematic exploration into the emerging property. It examines the existence of such a property in modern computing, its influence on computing efficiency, and the challenges for establishing a non-uniformity--aware paradigm. I propose several techniques to translate the property into efficiency, including data reorganization to eliminate non-coalesced accesses, asynchronous data transformations for locality enhancement and a controllable scheduling for exploiting non-uniformity among thread blocks. The experiments show much promise of these techniques in maximizing computing throughput, especially for programs with complex data access patterns
Phase distance mapping: a phase-based cache tuning methodology for embedded systems
Networked embedded systems typically leverage a collection of low-power
embedded systems (nodes) to collaboratively execute applications spanning
diverse application domains (e.g., video, image processing, communication,
etc.) with diverse application requirements. The individual networked nodes
must operate under stringent constraints (e.g., energy, memory, etc.) and
should be specialized to meet varying application requirements in order to
adhere to these constraints. Phase-based tuning specializes system tunable
parameters to the varying runtime requirements of different execution phases to
meet optimization goals. Since the design space for tunable systems can be very
large, one of the major challenges in phase-based tuning is determining the
best configuration for each phase without incurring significant tuning overhead
(e.g., energy and/or performance) during design space exploration. In this
paper, we propose phase distance mapping, which directly determines the best
configuration for a phase, thereby eliminating design space exploration. Phase
distance mapping applies the correlation between the characteristics and best
configuration of a known phase to determine the best configuration of a new
phase. Experimental results verify that our phase distance mapping approach,
when applied to cache tuning, determines cache configurations within 1 % of the
optimal configurations on average and yields an energy delay product savings of
27 % on average.Comment: 26 pages, Springer Design Automation for Embedded Systems, Special
Issue on Networked Embedded Systems, 201
Hardware Acceleration of Sparse and Irregular Tensor Computations of ML Models: A Survey and Insights
Machine learning (ML) models are widely used in many domains including media
processing and generation, computer vision, medical diagnosis, embedded
systems, high-performance and scientific computing, and recommendation systems.
For efficiently processing these computational- and memory-intensive
applications, tensors of these over-parameterized models are compressed by
leveraging sparsity, size reduction, and quantization of tensors. Unstructured
sparsity and tensors with varying dimensions yield irregular-shaped
computation, communication, and memory access patterns; processing them on
hardware accelerators in a conventional manner does not inherently leverage
acceleration opportunities. This paper provides a comprehensive survey on how
to efficiently execute sparse and irregular tensor computations of ML models on
hardware accelerators. In particular, it discusses additional enhancement
modules in architecture design and software support; categorizes different
hardware designs and acceleration techniques and analyzes them in terms of
hardware and execution costs; highlights further opportunities in terms of
hardware/software/algorithm co-design optimizations and joint optimizations
among described hardware and software enhancement modules. The takeaways from
this paper include: understanding the key challenges in accelerating sparse,
irregular-shaped, and quantized tensors; understanding enhancements in
acceleration systems for supporting their efficient computations; analyzing
trade-offs in opting for a specific type of design enhancement; understanding
how to map and compile models with sparse tensors on the accelerators;
understanding recent design trends for efficient accelerations and further
opportunities
- …