36 research outputs found

    Automatic Performance Optimization of Stencil Codes

    Get PDF
    A widely used class of codes are stencil codes. Their general structure is very simple: data points in a large grid are repeatedly recomputed from neighboring values. This predefined neighborhood is the so-called stencil. Despite their very simple structure, stencil codes are hard to optimize since only few computations are performed while a comparatively large number of values have to be accessed, i.e., stencil codes usually have a very low computational intensity. Moreover, the set of optimizations and their parameters also depend on the hardware on which the code is executed. To cut a long story short, current production compilers are not able to fully optimize this class of codes and optimizing each application by hand is not practical. As a remedy, we propose a set of optimizations and describe how they can be applied automatically by a code generator for the domain of stencil codes. A combination of a space and time tiling is able to increase the data locality, which significantly reduces the memory-bandwidth requirements: a standard three-dimensional 7-point Jacobi stencil can be accelerated by a factor of 3. This optimization can target basically any stencil code, while others are more specialized. E.g., support for arbitrary linear data layout transformations is especially beneficial for colored kernels, such as a Red-Black Gauss-Seidel smoother. On the one hand, an optimized data layout for such kernels reduces the bandwidth requirements while, on the other hand, it simplifies an explicit vectorization. Other noticeable optimizations described in detail are redundancy elimination techniques to eliminate common subexpressions both in a sequence of statements and across loop boundaries, arithmetic simplifications and normalizations, and the vectorization mentioned previously. In combination, these optimizations are able to increase the performance not only of the model problem given by Poisson’s equation, but also of real-world applications: an optical flow simulation and the simulation of a non-isothermal and non-Newtonian fluid flow

    Data-parallel concurrent constraint programming.

    Get PDF
    by Bo-ming Tong.Thesis (M.Phil.)--Chinese University of Hong Kong, 1994.Includes bibliographical references (leaves 104-[110]).Chapter 1 --- Introduction --- p.1Chapter 1.1 --- Concurrent Constraint Programming --- p.2Chapter 1.2 --- Finite Domain Constraints --- p.3Chapter 2 --- The Firebird Language --- p.5Chapter 2.1 --- Finite Domain Constraints --- p.6Chapter 2.2 --- The Firebird Computation Model --- p.6Chapter 2.3 --- Miscellaneous Features --- p.7Chapter 2.4 --- Clause-Based N on determinism --- p.9Chapter 2.5 --- Programming Examples --- p.10Chapter 2.5.1 --- Magic Series --- p.10Chapter 2.5.2 --- Weak Queens --- p.14Chapter 3 --- Operational Semantics --- p.15Chapter 3.1 --- The Firebird Computation Model --- p.16Chapter 3.2 --- The Firebird Commit Law --- p.17Chapter 3.3 --- Derivation --- p.17Chapter 3.4 --- Correctness of Firebird Computation Model --- p.18Chapter 4 --- Exploitation of Data-Parallelism in Firebird --- p.24Chapter 4.1 --- An Illustrative Example --- p.25Chapter 4.2 --- Mapping Partitions to Processor Elements --- p.26Chapter 4.3 --- Masks --- p.27Chapter 4.4 --- Control Strategy --- p.27Chapter 4.4.1 --- A Control Strategy Suitable for Linear Equations --- p.28Chapter 5 --- Data-Parallel Abstract Machine --- p.30Chapter 5.1 --- Basic DPAM --- p.31Chapter 5.1.1 --- Hardware Requirements --- p.31Chapter 5.1.2 --- Procedure Calling Convention And Process Creation --- p.32Chapter 5.1.3 --- Memory Model --- p.34Chapter 5.1.4 --- Registers --- p.41Chapter 5.1.5 --- Process Management --- p.41Chapter 5.1.6 --- Unification --- p.49Chapter 5.1.7 --- Variable Table --- p.49Chapter 5.2 --- DPAM with Backtracking --- p.50Chapter 5.2.1 --- Choice Point --- p.52Chapter 5.2.2 --- Trailing --- p.52Chapter 5.2.3 --- Recovering the Process Queues --- p.57Chapter 6 --- Implementation --- p.58Chapter 6.1 --- The DECmpp Massively Parallel Computer --- p.58Chapter 6.2 --- Implementation Overview --- p.59Chapter 6.3 --- Constraints --- p.60Chapter 6.3.1 --- Breaking Down Equality Constraints --- p.61Chapter 6.3.2 --- Processing the Constraint 'As Is' --- p.62Chapter 6.4 --- The Wide-Tag Architecture --- p.63Chapter 6.5 --- Register Window --- p.64Chapter 6.6 --- Dereferencing --- p.65Chapter 6.7 --- Output --- p.66Chapter 6.7.1 --- Collecting the Solutions --- p.66Chapter 6.7.2 --- Decoding the solution --- p.68Chapter 7 --- Performance --- p.69Chapter 7.1 --- Uniprocessor Performance --- p.71Chapter 7.2 --- Solitary Mode --- p.73Chapter 7.3 --- Bit Vectors of Domain Variables --- p.75Chapter 7.4 --- Heap Consumption of the Heap Frame Scheme --- p.77Chapter 7.5 --- Eager Nondeterministic Derivation vs Lazy Nondeterministic Deriva- tion --- p.78Chapter 7.6 --- Priority Scheduling --- p.79Chapter 7.7 --- Execution Profile --- p.80Chapter 7.8 --- Effect of the Number of Processor Elements on Performance --- p.82Chapter 7.9 --- Change of the Degree of Parallelism During Execution --- p.84Chapter 8 --- Related Work --- p.88Chapter 8.1 --- Vectorization of Prolog --- p.89Chapter 8.2 --- Parallel Clause Matching --- p.90Chapter 8.3 --- Parallel Interpreter --- p.90Chapter 8.4 --- Bounded Quantifications --- p.91Chapter 8.5 --- SIMD MultiLog --- p.91Chapter 9 --- Conclusion --- p.93Chapter 9.1 --- Limitations --- p.94Chapter 9.1.1 --- Data-Parallel Firebird is Specialized --- p.94Chapter 9.1.2 --- Limitations of the Implementation Scheme --- p.95Chapter 9.2 --- Future Work --- p.95Chapter 9.2.1 --- Extending Firebird --- p.95Chapter 9.2.2 --- Improvements Specific to DECmpp --- p.99Chapter 9.2.3 --- Labeling --- p.100Chapter 9.2.4 --- Parallel Domain Consistency --- p.101Chapter 9.2.5 --- Branch and Bound Algorithm --- p.102Chapter 9.2.6 --- Other Possible Future Work --- p.102Bibliography --- p.10

    Contribution à la convergence d'infrastructure entre le calcul haute performance et le traitement de données à large échelle

    Get PDF
    The amount of produced data, either in the scientific community or the commercialworld, is constantly growing. The field of Big Data has emerged to handle largeamounts of data on distributed computing infrastructures. High-Performance Computing (HPC) infrastructures are traditionally used for the execution of computeintensive workloads. However, the HPC community is also facing an increasingneed to process large amounts of data derived from high definition sensors andlarge physics apparati. The convergence of the two fields -HPC and Big Data- iscurrently taking place. In fact, the HPC community already uses Big Data tools,which are not always integrated correctly, especially at the level of the file systemand the Resource and Job Management System (RJMS).In order to understand how we can leverage HPC clusters for Big Data usage, andwhat are the challenges for the HPC infrastructures, we have studied multipleaspects of the convergence: We initially provide a survey on the software provisioning methods, with a focus on data-intensive applications. We contribute a newRJMS collaboration technique called BeBiDa which is based on 50 lines of codewhereas similar solutions use at least 1000 times more. We evaluate this mechanism on real conditions and in simulated environment with our simulator Batsim.Furthermore, we provide extensions to Batsim to support I/O, and showcase thedevelopments of a generic file system model along with a Big Data applicationmodel. This allows us to complement BeBiDa real conditions experiments withsimulations while enabling us to study file system dimensioning and trade-offs.All the experiments and analysis of this work have been done with reproducibilityin mind. Based on this experience, we propose to integrate the developmentworkflow and data analysis in the reproducibility mindset, and give feedback onour experiences with a list of best practices.RésuméLa quantité de données produites, que ce soit dans la communauté scientifiqueou commerciale, est en croissance constante. Le domaine du Big Data a émergéface au traitement de grandes quantités de données sur les infrastructures informatiques distribuées. Les infrastructures de calcul haute performance (HPC) sont traditionnellement utilisées pour l’exécution de charges de travail intensives en calcul. Cependant, la communauté HPC fait également face à un nombre croissant debesoin de traitement de grandes quantités de données dérivées de capteurs hautedéfinition et de grands appareils physique. La convergence des deux domaines-HPC et Big Data- est en cours. En fait, la communauté HPC utilise déjà des outilsBig Data, qui ne sont pas toujours correctement intégrés, en particulier au niveaudu système de fichiers ainsi que du système de gestion des ressources (RJMS).Afin de comprendre comment nous pouvons tirer parti des clusters HPC pourl’utilisation du Big Data, et quels sont les défis pour les infrastructures HPC, nousavons étudié plusieurs aspects de la convergence: nous avons d’abord proposé uneétude sur les méthodes de provisionnement logiciel, en mettant l’accent sur lesapplications utilisant beaucoup de données. Nous contribuons a l’état de l’art avecune nouvelle technique de collaboration entre RJMS appelée BeBiDa basée sur 50lignes de code alors que des solutions similaires en utilisent au moins 1000 fois plus.Nous évaluons ce mécanisme en conditions réelles et en environnement simuléavec notre simulateur Batsim. En outre, nous fournissons des extensions à Batsimpour prendre en charge les entrées/sorties et présentons le développements d’unmodèle de système de fichiers générique accompagné d’un modèle d’applicationBig Data. Cela nous permet de compléter les expériences en conditions réellesde BeBiDa en simulation tout en étudiant le dimensionnement et les différentscompromis autours des systèmes de fichiers.Toutes les expériences et analyses de ce travail ont été effectuées avec la reproductibilité à l’esprit. Sur la base de cette expérience, nous proposons d’intégrerle flux de travail du développement et de l’analyse des données dans l’esprit dela reproductibilité, et de donner un retour sur nos expériences avec une liste debonnes pratiques

    Iterative Schedule Optimization for Parallelization in the Polyhedron Model

    Get PDF
    In high-performance computing, one primary objective is to exploit the performance that the given target hardware can deliver to the fullest. Compilers that have the ability to automatically optimize programs for a specific target hardware can be highly useful in this context. Iterative (or search-based) compilation requires little or no prior knowledge and can adapt more easily to concrete programs and target hardware than static cost models and heuristics. Thereby, iterative compilation helps in situations in which static heuristics do not reflect the combination of input program and target hardware well. Moreover, iterative compilation may enable the derivation of more accurate cost models and heuristics for optimizing compilers. In this context, the polyhedron model is of help as it provides not only a mathematical representation of programs but, more importantly, a uniform representation of complex sequences of program transformations by schedule functions. The latter facilitates the systematic exploration of the set of legal transformations of a given program. Early approaches to purely iterative schedule optimization in the polyhedron model do not limit their search to schedules that preserve program semantics and, thereby, suffer from the need to explore numbers of illegal schedules. More recent research ensures the legality of program transformations but presumes a sequential rather than a parallel execution of the transformed program. Other approaches do not perform a purely iterative optimization. We propose an approach to iterative schedule optimization for parallelization and tiling in the polyhedron model. Our approach targets loop programs that profit from data locality optimization and coarse-grained loop parallelization. The schedule search space can be explored either randomly or by means of a genetic algorithm. To determine a schedule's profitability, we rely primarily on measuring the transformed code's execution time. While benchmarking is accurate, it increases the time and resource consumption of program optimization tremendously and can even make it impractical. We address this limitation by proposing to learn surrogate models from schedules generated and evaluated in previous runs of the iterative optimization and to replace benchmarking by performance prediction to the extent possible. Our evaluation on the PolyBench 4.1 benchmark set reveals that, in a given setting, iterative schedule optimization yields significantly higher speedups in the execution of the program to be optimized. Surrogate performance models learned from training data that was generated during previous iterative optimizations can reduce the benchmarking effort without strongly impairing the optimization result. A prerequisite for this approach is a sufficient similarity between the training programs and the program to be optimized

    Augmented Computational Design: Methodical Application of Artificial Intelligence in Generative Design

    Full text link
    This chapter presents methodological reflections on the necessity and utility of artificial intelligence in generative design. Specifically, the chapter discusses how generative design processes can be augmented by AI to deliver in terms of a few outcomes of interest or performance indicators while dealing with hundreds or thousands of small decisions. The core of the performance-based generative design paradigm is about making statistical or simulation-driven associations between these choices and consequences for mapping and navigating such a complex decision space. This chapter will discuss promising directions in Artificial Intelligence for augmenting decision-making processes in architectural design for mapping and navigating complex design spaces.Comment: This is the author's version of the book chapter Augmented Computational Design: Methodical Application of Artificial Intelligence in Generative Design. In Artificial Intelligence in Performance-Driven Design: Theories, Methods, and Tools Towards Sustainability, edited by Narjes Abbasabadi and Mehdi Ashayeri. Wiley, 202

    Combining Representation Learning with Logic for Language Processing

    Get PDF
    The current state-of-the-art in many natural language processing and automated knowledge base completion tasks is held by representation learning methods which learn distributed vector representations of symbols via gradient-based optimization. They require little or no hand-crafted features, thus avoiding the need for most preprocessing steps and task-specific assumptions. However, in many cases representation learning requires a large amount of annotated training data to generalize well to unseen data. Such labeled training data is provided by human annotators who often use formal logic as the language for specifying annotations. This thesis investigates different combinations of representation learning methods with logic for reducing the need for annotated training data, and for improving generalization.Comment: PhD Thesis, University College London, Submitted and accepted in 201

    Research Reports: 1988 NASA/ASEE Summer Faculty Fellowship Program

    Get PDF
    The basic objectives are to further the professional knowledge of qualified engineering and science faculty members; to stimulate an exchange of ideas between participants and NASA: to enrich and refresh the research and teaching activities of the participants' institutions; and to contribute to the research objectives of the NASA centers. Topics addressed include: cryogenics; thunderstorm simulation; computer techniques; computer assisted instruction; system analysis weather forecasting; rocket engine design; crystal growth; control systems design; turbine pumps for the Space Shuttle Main engine; electron mobility; heat transfer predictions; rotor dynamics; mathematical models; computational fluid dynamics; and structural analysis

    Research reports: 1985 NASA/ASEE Summer Faculty Fellowship Program

    Get PDF
    A compilation of 40 technical reports on research conducted by participants in the 1985 NASA/ASEE Summer Faculty Fellowship Program at Marshall Space Flight Center (MSFC) is given. Weibull density functions, reliability analysis, directional solidification, space stations, jet stream, fracture mechanics, composite materials, orbital maneuvering vehicles, stellar winds and gamma ray bursts are among the topics discussed

    Structured parallelism discovery with hybrid static-dynamic analysis and evaluation technique

    Get PDF
    Parallel computer architectures have dominated the computing landscape for the past two decades; a trend that is only expected to continue and intensify, with increasing specialization and heterogeneity. This creates huge pressure across the software stack to produce programming languages, libraries, frameworks and tools which will efficiently exploit the capabilities of parallel computers, not only for new software, but also revitalizing existing sequential code. Automatic parallelization, despite decades of research, has had limited success in transforming sequential software to take advantage of efficient parallel execution. This thesis investigates three approaches that use commutativity analysis as the enabler for parallelization. This has the potential to overcome limitations of traditional techniques. We introduce the concept of liveness-based commutativity for sequential loops. We examine the use of a practical analysis utilizing liveness-based commutativity in a symbolic execution framework. Symbolic execution represents input values as groups of constraints, consequently deriving the output as a function of the input and enabling the identification of further program properties. We employ this feature to develop an analysis and discern commutativity properties between loop iterations. We study the application of this approach on loops taken from real-world programs in the OLDEN and NAS Parallel Benchmark (NPB) suites, and identify its limitations and related overheads. Informed by these findings, we develop Dynamic Commutativity Analysis (DCA), a new technique that leverages profiling information from program execution with specific input sets. Using profiling information, we track liveness information and detect loop commutativity by examining the code’s live-out values. We evaluate DCA against almost 1400 loops of the NPB suite, discovering 86% of them as parallelizable. Comparing our results against dependence-based methods, we match the detection efficacy of two dynamic and outperform three static approaches, respectively. Additionally, DCA is able to automatically detect parallelism in loops which iterate over Pointer-Linked Data Structures (PLDSs), taken from wide range of benchmarks used in the literature, where all other techniques we considered failed. Parallelizing the discovered loops, our methodology achieves an average speedup of 3.6× across NPB (and up to 55×) and up to 36.9× for the PLDS-based loops on a 72-core host. We also demonstrate that our methodology, despite relying on specific input values for profiling each program, is able to correctly identify parallelism that is valid for all potential input sets. Lastly, we develop a methodology to utilize liveness-based commutativity, as implemented in DCA, to detect latent loop parallelism in the shape of patterns. Our approach applies a series of transformations which subsequently enable multiple applications of DCA over the generated multi-loop code section and match its loop commutativity outcomes against the expected criteria for each pattern. Applying our methodology on sets of sequential loops, we are able to identify well-known parallel patterns (i.e., maps, reduction and scans). This extends the scope of parallelism detection to loops, such as those performing scan operations, which cannot be determined as parallelizable by simply evaluating liveness-based commutativity conditions on their original form
    corecore