36 research outputs found
Automatic Performance Optimization of Stencil Codes
A widely used class of codes are stencil codes. Their general structure is very simple: data points in a large grid are repeatedly recomputed from neighboring values. This predefined neighborhood is the so-called stencil. Despite their very simple structure, stencil codes are hard to optimize since only few computations are performed while a comparatively large number of values have to be accessed, i.e., stencil codes usually have a very low computational intensity. Moreover, the set of optimizations and their parameters also depend on the hardware on which the code is executed.
To cut a long story short, current production compilers are not able to fully optimize this class of codes and optimizing each application by hand is not practical. As a remedy, we propose a set of optimizations and describe how they can be applied automatically by a code generator for the domain of stencil codes. A combination of a space and time tiling is able to increase the data locality, which significantly reduces the memory-bandwidth requirements: a standard three-dimensional 7-point Jacobi stencil can be accelerated by a factor of 3. This optimization can target basically any stencil code, while others are more specialized. E.g., support for arbitrary linear data layout transformations is especially beneficial for colored kernels, such as a Red-Black Gauss-Seidel smoother. On the one hand, an optimized data layout for such kernels reduces the bandwidth requirements while, on the other hand, it simplifies an explicit vectorization.
Other noticeable optimizations described in detail are redundancy elimination techniques to eliminate common subexpressions both in a sequence of statements and across loop boundaries, arithmetic simplifications and normalizations, and the vectorization mentioned previously. In combination, these optimizations are able to increase the performance not only of the model problem given by Poisson’s equation, but also of real-world applications: an optical flow simulation and the simulation of a non-isothermal and non-Newtonian fluid flow
Data-parallel concurrent constraint programming.
by Bo-ming Tong.Thesis (M.Phil.)--Chinese University of Hong Kong, 1994.Includes bibliographical references (leaves 104-[110]).Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Concurrent Constraint Programming --- p.2Chapter 1.2 --- Finite Domain Constraints --- p.3Chapter 2 --- The Firebird Language --- p.5Chapter 2.1 --- Finite Domain Constraints --- p.6Chapter 2.2 --- The Firebird Computation Model --- p.6Chapter 2.3 --- Miscellaneous Features --- p.7Chapter 2.4 --- Clause-Based N on determinism --- p.9Chapter 2.5 --- Programming Examples --- p.10Chapter 2.5.1 --- Magic Series --- p.10Chapter 2.5.2 --- Weak Queens --- p.14Chapter 3 --- Operational Semantics --- p.15Chapter 3.1 --- The Firebird Computation Model --- p.16Chapter 3.2 --- The Firebird Commit Law --- p.17Chapter 3.3 --- Derivation --- p.17Chapter 3.4 --- Correctness of Firebird Computation Model --- p.18Chapter 4 --- Exploitation of Data-Parallelism in Firebird --- p.24Chapter 4.1 --- An Illustrative Example --- p.25Chapter 4.2 --- Mapping Partitions to Processor Elements --- p.26Chapter 4.3 --- Masks --- p.27Chapter 4.4 --- Control Strategy --- p.27Chapter 4.4.1 --- A Control Strategy Suitable for Linear Equations --- p.28Chapter 5 --- Data-Parallel Abstract Machine --- p.30Chapter 5.1 --- Basic DPAM --- p.31Chapter 5.1.1 --- Hardware Requirements --- p.31Chapter 5.1.2 --- Procedure Calling Convention And Process Creation --- p.32Chapter 5.1.3 --- Memory Model --- p.34Chapter 5.1.4 --- Registers --- p.41Chapter 5.1.5 --- Process Management --- p.41Chapter 5.1.6 --- Unification --- p.49Chapter 5.1.7 --- Variable Table --- p.49Chapter 5.2 --- DPAM with Backtracking --- p.50Chapter 5.2.1 --- Choice Point --- p.52Chapter 5.2.2 --- Trailing --- p.52Chapter 5.2.3 --- Recovering the Process Queues --- p.57Chapter 6 --- Implementation --- p.58Chapter 6.1 --- The DECmpp Massively Parallel Computer --- p.58Chapter 6.2 --- Implementation Overview --- p.59Chapter 6.3 --- Constraints --- p.60Chapter 6.3.1 --- Breaking Down Equality Constraints --- p.61Chapter 6.3.2 --- Processing the Constraint 'As Is' --- p.62Chapter 6.4 --- The Wide-Tag Architecture --- p.63Chapter 6.5 --- Register Window --- p.64Chapter 6.6 --- Dereferencing --- p.65Chapter 6.7 --- Output --- p.66Chapter 6.7.1 --- Collecting the Solutions --- p.66Chapter 6.7.2 --- Decoding the solution --- p.68Chapter 7 --- Performance --- p.69Chapter 7.1 --- Uniprocessor Performance --- p.71Chapter 7.2 --- Solitary Mode --- p.73Chapter 7.3 --- Bit Vectors of Domain Variables --- p.75Chapter 7.4 --- Heap Consumption of the Heap Frame Scheme --- p.77Chapter 7.5 --- Eager Nondeterministic Derivation vs Lazy Nondeterministic Deriva- tion --- p.78Chapter 7.6 --- Priority Scheduling --- p.79Chapter 7.7 --- Execution Profile --- p.80Chapter 7.8 --- Effect of the Number of Processor Elements on Performance --- p.82Chapter 7.9 --- Change of the Degree of Parallelism During Execution --- p.84Chapter 8 --- Related Work --- p.88Chapter 8.1 --- Vectorization of Prolog --- p.89Chapter 8.2 --- Parallel Clause Matching --- p.90Chapter 8.3 --- Parallel Interpreter --- p.90Chapter 8.4 --- Bounded Quantifications --- p.91Chapter 8.5 --- SIMD MultiLog --- p.91Chapter 9 --- Conclusion --- p.93Chapter 9.1 --- Limitations --- p.94Chapter 9.1.1 --- Data-Parallel Firebird is Specialized --- p.94Chapter 9.1.2 --- Limitations of the Implementation Scheme --- p.95Chapter 9.2 --- Future Work --- p.95Chapter 9.2.1 --- Extending Firebird --- p.95Chapter 9.2.2 --- Improvements Specific to DECmpp --- p.99Chapter 9.2.3 --- Labeling --- p.100Chapter 9.2.4 --- Parallel Domain Consistency --- p.101Chapter 9.2.5 --- Branch and Bound Algorithm --- p.102Chapter 9.2.6 --- Other Possible Future Work --- p.102Bibliography --- p.10
Contribution à la convergence d'infrastructure entre le calcul haute performance et le traitement de données à large échelle
The amount of produced data, either in the scientific community or the commercialworld, is constantly growing. The field of Big Data has emerged to handle largeamounts of data on distributed computing infrastructures. High-Performance Computing (HPC) infrastructures are traditionally used for the execution of computeintensive workloads. However, the HPC community is also facing an increasingneed to process large amounts of data derived from high definition sensors andlarge physics apparati. The convergence of the two fields -HPC and Big Data- iscurrently taking place. In fact, the HPC community already uses Big Data tools,which are not always integrated correctly, especially at the level of the file systemand the Resource and Job Management System (RJMS).In order to understand how we can leverage HPC clusters for Big Data usage, andwhat are the challenges for the HPC infrastructures, we have studied multipleaspects of the convergence: We initially provide a survey on the software provisioning methods, with a focus on data-intensive applications. We contribute a newRJMS collaboration technique called BeBiDa which is based on 50 lines of codewhereas similar solutions use at least 1000 times more. We evaluate this mechanism on real conditions and in simulated environment with our simulator Batsim.Furthermore, we provide extensions to Batsim to support I/O, and showcase thedevelopments of a generic file system model along with a Big Data applicationmodel. This allows us to complement BeBiDa real conditions experiments withsimulations while enabling us to study file system dimensioning and trade-offs.All the experiments and analysis of this work have been done with reproducibilityin mind. Based on this experience, we propose to integrate the developmentworkflow and data analysis in the reproducibility mindset, and give feedback onour experiences with a list of best practices.RésuméLa quantité de données produites, que ce soit dans la communauté scientifiqueou commerciale, est en croissance constante. Le domaine du Big Data a émergéface au traitement de grandes quantités de données sur les infrastructures informatiques distribuées. Les infrastructures de calcul haute performance (HPC) sont traditionnellement utilisées pour l’exécution de charges de travail intensives en calcul. Cependant, la communauté HPC fait également face à un nombre croissant debesoin de traitement de grandes quantités de données dérivées de capteurs hautedéfinition et de grands appareils physique. La convergence des deux domaines-HPC et Big Data- est en cours. En fait, la communauté HPC utilise déjà des outilsBig Data, qui ne sont pas toujours correctement intégrés, en particulier au niveaudu système de fichiers ainsi que du système de gestion des ressources (RJMS).Afin de comprendre comment nous pouvons tirer parti des clusters HPC pourl’utilisation du Big Data, et quels sont les défis pour les infrastructures HPC, nousavons étudié plusieurs aspects de la convergence: nous avons d’abord proposé uneétude sur les méthodes de provisionnement logiciel, en mettant l’accent sur lesapplications utilisant beaucoup de données. Nous contribuons a l’état de l’art avecune nouvelle technique de collaboration entre RJMS appelée BeBiDa basée sur 50lignes de code alors que des solutions similaires en utilisent au moins 1000 fois plus.Nous évaluons ce mécanisme en conditions réelles et en environnement simuléavec notre simulateur Batsim. En outre, nous fournissons des extensions à Batsimpour prendre en charge les entrées/sorties et présentons le développements d’unmodèle de système de fichiers générique accompagné d’un modèle d’applicationBig Data. Cela nous permet de compléter les expériences en conditions réellesde BeBiDa en simulation tout en étudiant le dimensionnement et les différentscompromis autours des systèmes de fichiers.Toutes les expériences et analyses de ce travail ont été effectuées avec la reproductibilité à l’esprit. Sur la base de cette expérience, nous proposons d’intégrerle flux de travail du développement et de l’analyse des données dans l’esprit dela reproductibilité, et de donner un retour sur nos expériences avec une liste debonnes pratiques
Iterative Schedule Optimization for Parallelization in the Polyhedron Model
In high-performance computing, one primary objective is to exploit the performance that the given target hardware can deliver to the fullest. Compilers that have the ability to automatically optimize programs for a specific target hardware can be highly useful in this context. Iterative (or search-based) compilation requires little or no prior knowledge and can adapt more easily to concrete programs and target hardware than static cost models and heuristics. Thereby, iterative compilation helps in situations in which static heuristics do not reflect the combination of input program and target hardware well. Moreover, iterative compilation may enable the derivation of more accurate cost models and heuristics for optimizing compilers. In this context, the polyhedron model is of help as it provides not only a mathematical representation of programs but, more importantly, a uniform representation of complex sequences of program transformations by schedule functions. The latter facilitates the systematic exploration of the set of legal transformations of a given program.
Early approaches to purely iterative schedule optimization in the polyhedron model do not limit their search to schedules that preserve program semantics and, thereby, suffer from the need to explore numbers of illegal schedules. More recent research ensures the legality of program transformations but presumes a sequential rather than a parallel execution of the transformed program. Other approaches do not perform a purely iterative optimization.
We propose an approach to iterative schedule optimization for parallelization and tiling in the polyhedron model. Our approach targets loop programs that profit from data locality optimization and coarse-grained loop parallelization. The schedule search space can be explored either randomly or by means of a genetic algorithm.
To determine a schedule's profitability, we rely primarily on measuring the transformed code's execution time. While benchmarking is accurate, it increases the time and resource consumption of program optimization tremendously and can even make it impractical. We address this limitation by proposing to learn surrogate models from schedules generated and evaluated in previous runs of the iterative optimization and to replace benchmarking by performance prediction to the extent possible.
Our evaluation on the PolyBench 4.1 benchmark set reveals that, in a given setting, iterative schedule optimization yields significantly higher speedups in the execution of the program to be optimized. Surrogate performance models learned from training data that was generated during previous iterative optimizations can reduce the benchmarking effort without strongly impairing the optimization result. A prerequisite for this approach is a sufficient similarity between the training programs and the program to be optimized
Augmented Computational Design: Methodical Application of Artificial Intelligence in Generative Design
This chapter presents methodological reflections on the necessity and utility
of artificial intelligence in generative design. Specifically, the chapter
discusses how generative design processes can be augmented by AI to deliver in
terms of a few outcomes of interest or performance indicators while dealing
with hundreds or thousands of small decisions. The core of the
performance-based generative design paradigm is about making statistical or
simulation-driven associations between these choices and consequences for
mapping and navigating such a complex decision space. This chapter will discuss
promising directions in Artificial Intelligence for augmenting decision-making
processes in architectural design for mapping and navigating complex design
spaces.Comment: This is the author's version of the book chapter Augmented
Computational Design: Methodical Application of Artificial Intelligence in
Generative Design. In Artificial Intelligence in Performance-Driven Design:
Theories, Methods, and Tools Towards Sustainability, edited by Narjes
Abbasabadi and Mehdi Ashayeri. Wiley, 202
Combining Representation Learning with Logic for Language Processing
The current state-of-the-art in many natural language processing and
automated knowledge base completion tasks is held by representation learning
methods which learn distributed vector representations of symbols via
gradient-based optimization. They require little or no hand-crafted features,
thus avoiding the need for most preprocessing steps and task-specific
assumptions. However, in many cases representation learning requires a large
amount of annotated training data to generalize well to unseen data. Such
labeled training data is provided by human annotators who often use formal
logic as the language for specifying annotations. This thesis investigates
different combinations of representation learning methods with logic for
reducing the need for annotated training data, and for improving
generalization.Comment: PhD Thesis, University College London, Submitted and accepted in 201
Research Reports: 1988 NASA/ASEE Summer Faculty Fellowship Program
The basic objectives are to further the professional knowledge of qualified engineering and science faculty members; to stimulate an exchange of ideas between participants and NASA: to enrich and refresh the research and teaching activities of the participants' institutions; and to contribute to the research objectives of the NASA centers. Topics addressed include: cryogenics; thunderstorm simulation; computer techniques; computer assisted instruction; system analysis weather forecasting; rocket engine design; crystal growth; control systems design; turbine pumps for the Space Shuttle Main engine; electron mobility; heat transfer predictions; rotor dynamics; mathematical models; computational fluid dynamics; and structural analysis
Research reports: 1985 NASA/ASEE Summer Faculty Fellowship Program
A compilation of 40 technical reports on research conducted by participants in the 1985 NASA/ASEE Summer Faculty Fellowship Program at Marshall Space Flight Center (MSFC) is given. Weibull density functions, reliability analysis, directional solidification, space stations, jet stream, fracture mechanics, composite materials, orbital maneuvering vehicles, stellar winds and gamma ray bursts are among the topics discussed
Structured parallelism discovery with hybrid static-dynamic analysis and evaluation technique
Parallel computer architectures have dominated the computing landscape for the
past two decades; a trend that is only expected to continue and intensify, with increasing specialization and heterogeneity. This creates huge pressure across the software
stack to produce programming languages, libraries, frameworks and tools which will
efficiently exploit the capabilities of parallel computers, not only for new software, but
also revitalizing existing sequential code. Automatic parallelization, despite decades of
research, has had limited success in transforming sequential software to take advantage
of efficient parallel execution. This thesis investigates three approaches that use commutativity analysis as the enabler for parallelization. This has the potential to overcome
limitations of traditional techniques.
We introduce the concept of liveness-based commutativity for sequential loops.
We examine the use of a practical analysis utilizing liveness-based commutativity in a
symbolic execution framework. Symbolic execution represents input values as groups
of constraints, consequently deriving the output as a function of the input and enabling
the identification of further program properties. We employ this feature to develop an
analysis and discern commutativity properties between loop iterations. We study the
application of this approach on loops taken from real-world programs in the OLDEN
and NAS Parallel Benchmark (NPB) suites, and identify its limitations and related
overheads.
Informed by these findings, we develop Dynamic Commutativity Analysis (DCA), a
new technique that leverages profiling information from program execution with specific
input sets. Using profiling information, we track liveness information and detect loop
commutativity by examining the code’s live-out values. We evaluate DCA against almost
1400 loops of the NPB suite, discovering 86% of them as parallelizable. Comparing
our results against dependence-based methods, we match the detection efficacy of two
dynamic and outperform three static approaches, respectively. Additionally, DCA is
able to automatically detect parallelism in loops which iterate over Pointer-Linked
Data Structures (PLDSs), taken from wide range of benchmarks used in the literature,
where all other techniques we considered failed. Parallelizing the discovered loops, our
methodology achieves an average speedup of 3.6× across NPB (and up to 55×) and up
to 36.9× for the PLDS-based loops on a 72-core host. We also demonstrate that our
methodology, despite relying on specific input values for profiling each program, is able
to correctly identify parallelism that is valid for all potential input sets.
Lastly, we develop a methodology to utilize liveness-based commutativity, as implemented in DCA, to detect latent loop parallelism in the shape of patterns. Our approach
applies a series of transformations which subsequently enable multiple applications
of DCA over the generated multi-loop code section and match its loop commutativity
outcomes against the expected criteria for each pattern. Applying our methodology on
sets of sequential loops, we are able to identify well-known parallel patterns (i.e., maps,
reduction and scans). This extends the scope of parallelism detection to loops, such
as those performing scan operations, which cannot be determined as parallelizable by
simply evaluating liveness-based commutativity conditions on their original form