18 research outputs found
TRACO: Source-to-Source Parallelizing Compiler
The paper presents a source-to-source compiler, TRACO, for automatic extraction of both coarse- and fine-grained parallelism available in C/C++ loops. Parallelization techniques implemented in TRACO are based on the transitive closure of a relation describing all the dependences in a loop. Coarse- and fine-grained parallelism is represented with synchronization-free slices (space partitions) and a legal loop statement instance schedule (time partitions), respectively. TRACO enables also applying scalar and array variable privatization as well as parallel reduction. On its output, TRACO produces compilable parallel OpenMP C/C++ and/or OpenACC C/C++ code. The effectiveness of TRACO, efficiency of parallel code produced by TRACO, and the time of parallel code production are evaluated by means of the NAS Parallel Benchmark and Polyhedral Benchmark suites. These features of TRACO are compared with closely related compilers such as ICC, Pluto, Par4All, and Cetus. Feature work is outlined
Recommended from our members
Logical partitioning of parallel system simulations
Simulation has been a fundamental tool to prototype, hypothesize, and evaluate
new ideas to continue improving system performance. However, increasing levels
of processor parallelism and heterogeneity have introduced additional
constraints when evaluating new designs. The work embodied in this dissertation
explores how to leverage novel ideas in simulator partitioning to improve
simulator speed and flexibility for simulating these new types of systems.
The contribution of this work includes the introduction of optimistic
partitioned simulation to improve parallelization, and the introduction of
warped partitioned simulation for improved flexibility. These ideas are refined
and demonstrated through the use of prototypes to demonstrate their benefits
compared to state-of-the-art approaches. By leveraging partitioning in a
structured manner, it is possible to design simulators that better address the
open challenges of parallel and heterogeneous systems design.Electrical and Computer Engineerin
Analysis and optimization of task granularity on the Java virtual machine
Task granularity, i.e., the amount of work performed by parallel tasks, is a key performance attribute of parallel applications. On the one hand, fine-grained tasks (i.e., small tasks carrying out few computations) may introduce considerable parallelization overheads. On the other hand, coarse-grained tasks (i.e., large tasks performing substantial computations) may not fully utilize the available CPU cores, leading to missed parallelization opportunities. We focus on task-parallel applications running in a single Java Virtual Machine on a shared- memory multicore. Despite their performance may considerably depend on the granularity of their tasks, this topic has received little attention in the literature. Our work fills this gap, analyzing and optimizing the task granularity of such applications. In this dissertation, we present a new methodology to accurately and efficiently collect the granularity of each executed task, implemented in a novel profiler. Our profiler collects carefully selected metrics from the whole system stack with low overhead. Our tool helps developers locate performance and scalability problems, and identifies classes and methods where optimizations related to task granularity are needed, guiding developers towards useful optimizations. Moreover, we introduce a novel technique to drastically reduce the overhead of task-granularity profiling, by reifying the class hierarchy of the target application within a separate instrumentation process. Our approach allows the instrumentation process to instrument only the classes representing tasks, inserting more efficient instrumentation code which decreases the overhead of task detection. Our technique significantly speeds up task-granularity profiling and so enables the collection of accurate metrics with low overhead.We use our novel techniques to analyze task granularity in the DaCapo, ScalaBench, and Spark Perf benchmark suites. We reveal inefficiencies related to fine-grained and coarse-grained tasks in several workloads. We demonstrate that the collected task-granularity profiles are actionable by optimizing task granularity in numerous benchmarks, performing optimizations in classes and methods indicated by our tool. Our optimizations result in significant speedups (up to a factor of 5.90x) in numerous workloads suffering from fine- and coarse-grained tasks in different environments. Our results highlight the importance of analyzing and optimizing task granularity on the Java Virtual Machine
Compile-time support for thread-level speculation
Una de las principales preocupaciones de las ciencias de la computación es el estudio de las capacidades paralelas tanto de programas como de los procesadores que los ejecutan. Existen varias razones que hacen muy deseable el desarrollo de técnicas que paralelicen automáticamente el código. Entre ellas se encuentran el inmenso número de programas secuenciales existentes ya escritos, la complejidad de los lenguajes de programación paralelos, y los conocimientos que se requieren para paralelizar un código. Sin embargo, los actuales mecanismos de paralelización automática implementados en los compiladores comerciales no son capaces de paralelizar la mayoría de los bucles en un código [1], debido a la dependencias de datos que existen entre ellos [2]. Por lo tanto, se hace necesaria la búsqueda de nuevas técnicas, como la paralelización especulativa [3-5], que saquen beneficio de las potenciales capacidades paralelas del hardware y arquitecturas multiprocesador actuales. Sin embargo, ésta y otras técnicas requieren la intervención manual de programadores experimentados.
Antes de ofrecer soluciones alternativas, se han evaluado las capacidades de paralelización de los compiladores comerciales, exponiendo las limitaciones de los mecanismos de paralelización automática que implementan. El estudio revela que estos mecanismos de paralelización automática sólo alcanzan un 19% de speedup en promedio para los benchmarks del SPEC CPU2006 [6], siendo este un resultado significativamente inferior al obtenido por técnicas de paralelización especulativa [7]. Sin embargo, la paralelización especulativa requiere una extensa modificación manual del código por parte de programadores.
Esta Tesis aborda este problema definiendo una nueva cláusula OpenMP [8], llamada ¿speculative¿, que permite señalar qué variables pueden llevar a una violación de dependencia. Además, esta Tesis también propone un sistema en tiempo de compilación que, usando la información sobre los accesos a las variables que proporcionan las cláusulas OpenMP, añade automáticamente todo el código necesario para gestionar la ejecución especulativa de un programa. Esto libera al programador de modificar el código manualmente, evitando posibles errores y una tediosa tarea. El código generado por nuestro sistema enlaza con la librería de ejecución especulativamente paralela desarrollada por Estebanez, García-Yagüez, Llanos y Gonzalez-Escribano [9,10].Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos
Performance analysis methods for understanding scaling bottlenecks in multi-threaded applications
In dit proefschrift stellen we drie nieuwe methodes voor om de prestatie van meerdradige programma's te analyseren. Onze eerste methode, criticality stacks, is bruikbaar voor het analyseren van onevenwicht tussen draden. Om deze stacks te construeren stellen we een nieuwe criticaliteitsmetriek voor, die de uitvoeringstijd van een applicatie opsplitst in een deel voor iedere draad. Hoe groter dit deel is voor een draad, hoe kritischer deze draad is voor de applicatie. De tweede methode, bottle graphs, stelt iedere draad van een meerdradig programma voor als een rechthoek in een grafiek. De hoogte van de rechthoek wordt berekend door middel van onze criticaliteitsmetriek, en de breedte stelt het parallellisme van een draad voor. Rechthoeken die bovenaan in de grafiek zitten, als het ware in de hals van de fles, hebben een beperkt parallellisme, waardoor we ze beschouwen als “bottlenecks” voor de applicatie. Onze derde methode, speedup stacks, toont de bereikte speedup van een applicatie en de verschillende componenten die speedup beperken in een gestapelde grafiek. De intuïtie achter dit concept is dat door het reduceren van de invloed van een bepaalde component, de speedup van een applicatie proportioneel toeneemt met de grootte van die component in de stapel
Structured parallelism discovery with hybrid static-dynamic analysis and evaluation technique
Parallel computer architectures have dominated the computing landscape for the
past two decades; a trend that is only expected to continue and intensify, with increasing specialization and heterogeneity. This creates huge pressure across the software
stack to produce programming languages, libraries, frameworks and tools which will
efficiently exploit the capabilities of parallel computers, not only for new software, but
also revitalizing existing sequential code. Automatic parallelization, despite decades of
research, has had limited success in transforming sequential software to take advantage
of efficient parallel execution. This thesis investigates three approaches that use commutativity analysis as the enabler for parallelization. This has the potential to overcome
limitations of traditional techniques.
We introduce the concept of liveness-based commutativity for sequential loops.
We examine the use of a practical analysis utilizing liveness-based commutativity in a
symbolic execution framework. Symbolic execution represents input values as groups
of constraints, consequently deriving the output as a function of the input and enabling
the identification of further program properties. We employ this feature to develop an
analysis and discern commutativity properties between loop iterations. We study the
application of this approach on loops taken from real-world programs in the OLDEN
and NAS Parallel Benchmark (NPB) suites, and identify its limitations and related
overheads.
Informed by these findings, we develop Dynamic Commutativity Analysis (DCA), a
new technique that leverages profiling information from program execution with specific
input sets. Using profiling information, we track liveness information and detect loop
commutativity by examining the code’s live-out values. We evaluate DCA against almost
1400 loops of the NPB suite, discovering 86% of them as parallelizable. Comparing
our results against dependence-based methods, we match the detection efficacy of two
dynamic and outperform three static approaches, respectively. Additionally, DCA is
able to automatically detect parallelism in loops which iterate over Pointer-Linked
Data Structures (PLDSs), taken from wide range of benchmarks used in the literature,
where all other techniques we considered failed. Parallelizing the discovered loops, our
methodology achieves an average speedup of 3.6× across NPB (and up to 55×) and up
to 36.9× for the PLDS-based loops on a 72-core host. We also demonstrate that our
methodology, despite relying on specific input values for profiling each program, is able
to correctly identify parallelism that is valid for all potential input sets.
Lastly, we develop a methodology to utilize liveness-based commutativity, as implemented in DCA, to detect latent loop parallelism in the shape of patterns. Our approach
applies a series of transformations which subsequently enable multiple applications
of DCA over the generated multi-loop code section and match its loop commutativity
outcomes against the expected criteria for each pattern. Applying our methodology on
sets of sequential loops, we are able to identify well-known parallel patterns (i.e., maps,
reduction and scans). This extends the scope of parallelism detection to loops, such
as those performing scan operations, which cannot be determined as parallelizable by
simply evaluating liveness-based commutativity conditions on their original form
Recommended from our members
Supervised Design-Space Exploration
Low-cost Very Large Scale Integration (VLSI) electronics have revolutionized daily life and expanded the role of computation in science and engineering. Meanwhile, process-technology scaling has changed VLSI design to an exploration process that strives for the optimal balance among multiple objectives, such as power, performance, and area, i.e. multi-objective Pareto-set optimization. Besides, modern VLSI design has shifted to synthesis-centric methodologies in order to boost the design productivity, which leads to better design quality given limited time and resources. However, current decade-old synthesis-centric design methodologies suffer from: (i) long synthesis tool runtime, (ii) elusive optimal setting of many synthesis knobs, (iii) limitation to one design implementation per synthesis run, and (iv) limited capability of digesting only component-level designs as opposed to holistic system-wide synthesis. These challenges make Design Space Exploration (DSE) with synthesis tools a daunting task for both novice and experienced VLSI designers, thus stagnating the development of more powerful (i.e. more complex) computer systems.
To address these challenges, I propose Supervised Design-Space Exploration (SDSE), an abstraction layer between a designer and a synthesis tool, aiming to autonomously supervise synthesis jobs for DSE. For system-level exploration, SDSE can approximate a system Pareto set given limited information: only lightweight component characterization is required, yet the necessary component synthesis jobs are discovered on-the-fly in order to compose the system Pareto set. For component-level exploration, SDSE can approximate a component Pareto set by iteratively refining the approximation with promising knob settings, guided by synthesis-result estimation with machine-learning models. Combined, SDSE has been applied with the three major synthesis stages, namely high-level, logic, and physical synthesis, to the design of heterogeneous accelerator cores as well as high-performance processor cores. In particular, SDSE has been successfully integrated into the IBM Synthesis Tuning System, yielding 20% better circuit performance than the original system on the design of a 22nm server processor that is currently in production.
Looking ahead, SDSE can be applied to other VLSI designs beyond the accelerator and the programmable cores. Moreover, SDSE opens several research avenues for: (i) new development and deployment platforms of synthesis tools, (ii) large-scale collaborative design engineering, and (iii) new computer-aided design approaches for new classes of systems beyond VLSI chips
Automated detection of structured coarse-grained parallelism in sequential legacy applications
The efficient execution of sequential legacy applications on modern, parallel computer
architectures is one of today’s most pressing problems. Automatic parallelization
has been investigated as a potential solution for several decades but its success
generally remains restricted to small niches of regular, array-based applications.
This thesis investigates two techniques that have the potential to overcome these
limitations.
Beginning at the lowest level of abstraction, the binary executable, it presents
a study of the limits of Dynamic Binary Parallelization (Dbp), a recently proposed
technique that takes advantage of an underlying multicore host to transparently
parallelize a sequential binary executable. While still in its infancy, Dbp has received
broad interest within the research community. This thesis seeks to gain an
understanding of the factors contributing to the limits of Dbp and the costs and
overheads of its implementation. An extensive evaluation using a parameterizable
Dbp system targeting a Cmp with light-weight architectural Tls support is presented.
The results show that there is room for a significant reduction of up to 54%
in the number of instructions on the critical paths of legacy Spec Cpu2006 benchmarks,
but that it is much harder to translate these savings into actual performance
improvements, with a realistic hardware-supported implementation achieving a
speedup of 1.09 on average.
While automatically parallelizing compilers have traditionally focused on data
parallelism, additional parallelism exists in a plethora of other shapes such as task
farms, divide & conquer, map/reduce and many more. These algorithmic skeletons,
i.e. high-level abstractions for commonly used patterns of parallel computation,
differ substantially from data parallel loops. Unfortunately, algorithmic skeletons
are largely informal programming abstractions and are lacking a formal characterization
in terms of established compiler concepts. This thesis develops compiler-friendly
characterizations of popular algorithmic skeletons using a novel notion of
commutativity based on liveness. A hybrid static/dynamic analysis framework for
the context-sensitive detection of skeletons in legacy code that overcomes limitations
of static analysis by complementing it with profiling information is described.
A proof-of-concept implementation of this framework in the Llvm compiler infrastructure
is evaluated against Spec Cpu2006 benchmarks for the detection of a typical skeleton. The results illustrate that skeletons are often context-sensitive in
nature.
Like the two approaches presented in this thesis, many dynamic parallelization
techniques exploit the fact that some statically detected data and control
flow dependences do not manifest themselves in every possible program execution
(may-dependences) but occur only infrequently, e.g. for some corner cases, or not
at all for any legal program input. While the effectiveness of dynamic parallelization
techniques critically depends on the absence of such dependences, not much
is known about their nature. This thesis presents an empirical analysis and characterization
of the variability of both data dependences and control flow across
program runs. The cBench benchmark suite is run with 100 randomly chosen
input data sets to generate whole-program control and data flow graphs (Cdfgs)
for each run, which are then compared to obtain a measure of the variance in the
observed control and data flow. The results show that, on average, the cumulative
profile information gathered with at least 55, and up to 100, different input data
sets is needed to achieve full coverage of the data flow observed across all runs.
For control flow, the figure stands at 46 and 100 data sets, respectively. This suggests
that profile-guided parallelization needs to be applied with utmost care, as
misclassification of sequential loops as parallel was observed even when up to 94
input data sets are used
Efficient Engineering and Execution of Pipe-and-Filter Architectures
Pipe-and-Filter (P&F) is a well-known and often used architectural style.
However, to the best of our knowledge, there is no P&F framework which
can model and execute generic P&F architectures. For example, the frameworks
Fastflow, StreamIt, and Spark do not support multiple input and
output streams per filter and thus cannot model branches. Other frameworks
focus on very specific use cases and neglect type-safety when interconnecting
filters. Furthermore, an efficient parallel execution of P&F architectures
is still an open challenge. Although some available frameworks can execute
filters in parallel, there is much potential for optimization. Unfortunately,
most frameworks have a fixed execution strategy which cannot be altered
without major changes.
In this thesis, we present our generic and parallel P&F framework
TeeTime. It is able to model and to execute arbitrary P&F architectures.
Simultaneously, it is open for modifications in order to experiment with the
P&F style. Moreover, it allows to execute filters in parallel by utilizing the
capabilities of contemporary multi-core processor systems.
Extensive lab experiments show that TeeTime imposes a very low and in
certain cases even no runtime overhead compared to implementations without
framework abstractions. Moreover, several case studies from research
and industry show TeeTime’s broad applicability and extensibility. We provide
a replication package containing all our experimental data and code
to facilitate the verifiability, repeatability, and further extensibility of our
results. Furthermore, we provide reference implementations for TeeTime in
Java and in C++ as open source software on https://teetime-framework.github.io
Formally designing and implementing cyber security mechanisms in industrial control networks.
This dissertation describes progress in the state-of-the-art for developing and deploying formally verified cyber security devices in industrial control networks. It begins by detailing the unique struggles that are faced in industrial control networks and why concepts and technologies developed for securing traditional networks might not be appropriate. It uses these unique struggles and examples of contemporary cyber-attacks targeting control systems to argue that progress in securing control systems is best met with formal verification of systems, their specifications, and their security properties. This dissertation then presents a development process and identifies two technologies, TLA+ and seL4, that can be leveraged to produce a high-assurance embedded security device. The method presented in this dissertation takes an informal design of an embedded device that might be found in a control system and 1) formalizes the design within TLA+, 2) creates and mechanically checks a model built from the formal design, and 3) translates the TLA+ design into a component-based architecture of a native seL4 application. The later chapters of this dissertation describe an application of the process to a security preprocessor embedded device that was designed to add security mechanisms to the network communication of an existing control system. The device and its security properties are formally specified in TLA+ in chapter 4, mechanically checked in chapter 5, and finally its native seL4 architecture is implemented in chapter 6. Finally, the conclusions derived from the research are laid out, as well as some possibilities for expanding the presented method in the future