2,411 research outputs found
Relay: A New IR for Machine Learning Frameworks
Machine learning powers diverse services in industry including search,
translation, recommendation systems, and security. The scale and importance of
these models require that they be efficient, expressive, and portable across an
array of heterogeneous hardware devices. These constraints are often at odds;
in order to better accommodate them we propose a new high-level intermediate
representation (IR) called Relay. Relay is being designed as a
purely-functional, statically-typed language with the goal of balancing
efficient compilation, expressiveness, and portability. We discuss the goals of
Relay and highlight its important design constraints. Our prototype is part of
the open source NNVM compiler framework, which powers Amazon's deep learning
framework MxNet
Link-time smart card code hardening
This paper presents a feasibility study to protect smart card software against fault-injection attacks by means of link-time code rewriting. This approach avoids the drawbacks of source code hardening, avoids the need for manual assembly writing, and is applicable in conjunction with closed third-party compilers. We implemented a range of cookbook code hardening recipes in a prototype link-time rewriter and evaluate their coverage and associated overhead to conclude that this approach is promising. We demonstrate that the overhead of using an automated link-time approach is not significantly higher than what can be obtained with compile-time hardening or with manual hardening of compiler-generated assembly code
Blocked All-Pairs Shortest Paths Algorithm on Intel Xeon Phi KNL Processor: A Case Study
Manycores are consolidating in HPC community as a way of improving
performance while keeping power efficiency. Knights Landing is the recently
released second generation of Intel Xeon Phi architecture. While optimizing
applications on CPUs, GPUs and first Xeon Phi's has been largely studied in the
last years, the new features in Knights Landing processors require the revision
of programming and optimization techniques for these devices. In this work, we
selected the Floyd-Warshall algorithm as a representative case study of graph
and memory-bound applications. Starting from the default serial version, we
show how data, thread and compiler level optimizations help the parallel
implementation to reach 338 GFLOPS.Comment: Computer Science - CACIC 2017. Springer Communications in Computer
and Information Science, vol 79
A Domain-Specific Language and Editor for Parallel Particle Methods
Domain-specific languages (DSLs) are of increasing importance in scientific
high-performance computing to reduce development costs, raise the level of
abstraction and, thus, ease scientific programming. However, designing and
implementing DSLs is not an easy task, as it requires knowledge of the
application domain and experience in language engineering and compilers.
Consequently, many DSLs follow a weak approach using macros or text generators,
which lack many of the features that make a DSL a comfortable for programmers.
Some of these features---e.g., syntax highlighting, type inference, error
reporting, and code completion---are easily provided by language workbenches,
which combine language engineering techniques and tools in a common ecosystem.
In this paper, we present the Parallel Particle-Mesh Environment (PPME), a DSL
and development environment for numerical simulations based on particle methods
and hybrid particle-mesh methods. PPME uses the meta programming system (MPS),
a projectional language workbench. PPME is the successor of the Parallel
Particle-Mesh Language (PPML), a Fortran-based DSL that used conventional
implementation strategies. We analyze and compare both languages and
demonstrate how the programmer's experience can be improved using static
analyses and projectional editing. Furthermore, we present an explicit domain
model for particle abstractions and the first formal type system for particle
methods.Comment: Submitted to ACM Transactions on Mathematical Software on Dec. 25,
201
SEER: Super-Optimization Explorer for HLS using E-graph Rewriting with MLIR
High-level synthesis (HLS) is a process that automatically translates a
software program in a high-level language into a low-level hardware
description. However, the hardware designs produced by HLS tools still suffer
from a significant performance gap compared to manual implementations. This is
because the input HLS programs must still be written using hardware design
principles.
Existing techniques either leave the program source unchanged or perform a
fixed sequence of source transformation passes, potentially missing
opportunities to find the optimal design. We propose a super-optimization
approach for HLS that automatically rewrites an arbitrary software program into
efficient HLS code that can be used to generate an optimized hardware design.
We developed a toolflow named SEER, based on the e-graph data structure, to
efficiently explore equivalent implementations of a program at scale. SEER
provides an extensible framework, orchestrating existing software compiler
passes and hardware synthesis optimizers.
Our work is the first attempt to exploit e-graph rewriting for large software
compiler frameworks, such as MLIR. Across a set of open-source benchmarks, we
show that SEER achieves up to 38x the performance within 1.4x the area of the
original program. Via an Intel-provided case study, SEER demonstrates the
potential to outperform manually optimized designs produced by hardware
experts
Achieving High-Performance the Functional Way: A Functional Pearl on Expressing High-Performance Optimizations as Rewrite Strategies
Optimizing programs to run efficiently on modern parallel hardware is hard but crucial for many applications. The predominantly used imperative languages - like C or OpenCL - force the programmer to intertwine the code describing functionality and optimizations. This results in a portability nightmare that is particularly problematic given the accelerating trend towards specialized hardware devices to further increase efficiency.
Many emerging DSLs used in performance demanding domains such as deep learning or high-performance image processing attempt to simplify or even fully automate the optimization process. Using a high-level - often functional - language, programmers focus on describing functionality in a declarative way. In some systems such as Halide or TVM, a separate schedule specifies how the program should be optimized. Unfortunately, these schedules are not written in well-defined programming languages. Instead, they are implemented as a set of ad-hoc predefined APIs that the compiler writers have exposed.
In this functional pearl, we show how to employ functional programming techniques to solve this challenge with elegance. We present two functional languages that work together - each addressing a separate concern. RISE is a functional language for expressing computations using well known functional data-parallel patterns. ELEVATE is a functional language for describing optimization strategies. A high-level RISE program is transformed into a low-level form using optimization strategies written in ELEVATE . From the rewritten low-level program high-performance parallel code is automatically generated. In contrast to existing high-performance domain-specific systems with scheduling APIs, in our approach programmers are not restricted to a set of built-in operations and optimizations but freely define their own computational patterns in RISE and optimization strategies in ELEVATE in a composable and reusable way. We show how our holistic functional approach achieves competitive performance with the state-of-the-art imperative systems Halide and TVM
Reductie van het geheugengebruik van besturingssysteemkernen Memory Footprint Reduction for Operating System Kernels
In ingebedde systemen is er vaak maar een beperkte hoeveelheid geheugen beschikbaar. Daarom wordt er veel aandacht besteed aan het produceren van compacte programma's voor deze systemen, en zijn er allerhande technieken ontwikkeld die automatisch het geheugengebruik van programma's kunnen verkleinen. Tot nu toe richtten die technieken zich voornamelijk op de toepassingssoftware die op het systeem draait, en werd het besturingssysteem over het hoofd gezien. In dit proefschrift worden een aantal technieken beschreven die het mogelijk maken om op een geautomatiseerde manier het geheugengebruik van een besturingssysteemkern gevoelig te verkleinen. Daarbij wordt in eerste instantie gebruik gemaakt van compactietransformaties tijdens het linken. Als we de hardware en software waaruit het systeem samengesteld is kennen, is het mogelijk om nog verdere reducties te bekomen. Daartoe wordt de kern gespecialiseerd voor een bepaalde hardware-software combinatie. Overbodige functionaliteit wordt opgespoord en uit de kern verwijderd, terwijl de resterende functionaliteit wordt aangepast aan de specifieke gebruikspatronen die uit de hardware en software kunnen afgeleid worden. Als laatste worden technieken voorgesteld die het mogelijk maken om weinig of niet uitgevoerde code (bijvoorbeeld code voor het afhandelen van slechts zeldzaam optredende foutcondities) uit het geheugen te verwijderen. Deze code wordt dan enkel ingeladen op het moment dat ze effectief nodig is. Voor ons testsysteem kunnen we met de gecombineerde technieken het geheugengebruik van een Linux 2.4 kern met meer dan 48% verminderen
- …