573 research outputs found
On-the-fly memory compression for multibody algorithms.
Memory and bandwidth demands challenge developers of particle-based codes that have to scale on new architectures, as the growth of concurrency outperforms improvements in memory access facilities, as the memory per core tends to stagnate, and as communication networks cannot increase bandwidth arbitrary. We propose to analyse each particle of such a code to find out whether a hierarchical data representation storing data with reduced precision caps the memory demands without exceeding given error bounds. For admissible candidates, we perform this compression and thus reduce the pressure on the memory subsystem, lower the total memory footprint and reduce the data to be exchanged via MPI. Notably, our analysis and transformation changes the data compression dynamically, i.e. the choice of data format follows the solution characteristics, and it does not require us to alter the core simulation code
The Parallelism Motifs of Genomic Data Analysis
Genomic data sets are growing dramatically as the cost of sequencing
continues to decline and small sequencing devices become available. Enormous
community databases store and share this data with the research community, but
some of these genomic data analysis problems require large scale computational
platforms to meet both the memory and computational requirements. These
applications differ from scientific simulations that dominate the workload on
high end parallel systems today and place different requirements on programming
support, software libraries, and parallel architectural design. For example,
they involve irregular communication patterns such as asynchronous updates to
shared data structures. We consider several problems in high performance
genomics analysis, including alignment, profiling, clustering, and assembly for
both single genomes and metagenomes. We identify some of the common
computational patterns or motifs that help inform parallelization strategies
and compare our motifs to some of the established lists, arguing that at least
two key patterns, sorting and hashing, are missing
Communication-Avoiding Algorithms for a High-Performance Hyperbolic PDE Engine
The study of waves has always been an important subject of research. Earthquakes, for example,
have a direct impact on the daily lives of millions of people while gravitational waves reveal
insight into the composition and history of the Universe. These physical phenomena, despite
being tackled traditionally by different fields of physics, have in common that they are modelled
the same way mathematically: as a system of hyperbolic partial differential equations (PDEs).
The ExaHyPE project (“An Exascale Hyperbolic PDE Engine") translates this similarity into
a software engine that can be quickly adapted to simulate a wide range of hyperbolic partial
differential equations. ExaHyPE’s key idea is that the user only specifies the physics while the
engine takes care of the parallelisation and the interplay of the underlying numerical methods.
Consequently, a first simulation code for a new hyperbolic PDE can often be realised within a
few hours. This is a task that traditionally can take weeks, months, even years for researchers
starting from scratch.
My main contribution to ExaHyPE is the development of the core infrastructure. This
comprises the development and implementation of ExaHyPE’s solvers and adaptive mesh
refinement procedures, it’s MPI+X parallelisation as well as high-level aspects of ExaHyPE’s
application-tailored code generation, which allows to adapt ExaHyPE to model many different
hyperbolic PDE systems. Like any high-performance computing code, ExaHyPE has to tackle the
challenges of the coming exascale computing era, notably network communication latencies and
the growing memory wall. In this thesis, I propose memory-efficient realisations of ExaHyPE’s
solvers that avoid data movement together with a novel task-based MPI+X parallelisation
concept that allows to hide network communication behind computation in dynamically adaptive
simulations
Compact connectivity representation for triangle meshes
Many digital models used in entertainment, medical visualization, material science, architecture, Geographic Information Systems (GIS), and mechanical Computer Aided Design (CAD) are defined in terms of their boundaries. These boundaries are often approximated using triangle meshes. The complexity of models, which can be measured by triangle count, increases rapidly with the precision of scanning technologies and with the need for higher resolution. An increase in mesh complexity results in an increase of storage requirement, which in turn increases the frequency of disk access or cache misses during mesh processing, and hence decreases performance. For example, in a test application involving a mesh with 55 million triangles in a machine with 4GB of memory versus a machine with 1GB of memory, performance decreases by a factor of about 6000 because of memory thrashing. To help reduce memory thrashing, we focus on decreasing the average storage requirement per triangle measured in 32-bit integer references per triangle (rpt).
This thesis covers compact connectivity representation for triangle meshes and discusses four data structures:
1. Sorted Opposite Table (SOT), which uses 3 rpt and has been extended to support tetrahedral meshes.
2. Sorted Quad (SQuad), which uses about 2 rpt and has been extended to support streaming.
3. Laced Ring (LR), which uses about 1 rpt and offers an excellent compromise between storage compactness and performance of mesh traversal operators.
4. Zipper, an extension of LR, which uses about 6 bits per triangle (equivalently 0.19 rpt), therefore is the most compact representation.
The triangle mesh data structures proposed in this thesis support the standard set of mesh connectivity operators introduced by the previously proposed Corner Table at an amortized constant time complexity. They can be constructed in linear time and space from the Corner Table or any equivalent representation. If geometry is stored as 16-bit coordinates, using Zipper instead of the Corner Table increases the size of the mesh that can be stored in core memory by a factor of about 8.PhDCommittee Chair: Rossignac, Jarek; Committee Co-Chair: Frost, David; Committee Member: Lindstrom, Peter; Committee Member: Liu, C. Karen; Committee Member: Turk, Gre
Run-time optimization of adaptive irregular applications
Compared to traditional compile-time optimization, run-time optimization could offer significant performance improvements when parallelizing and optimizing adaptive irregular applications, because it performs program analysis and adaptive optimizations during program execution. Run-time techniques can succeed where static techniques fail because they exploit the characteristics of input data, programs' dynamic behaviors, and the underneath execution environment. When optimizing adaptive irregular applications for parallel execution, a common observation is that the effectiveness of the optimizing transformations depends on programs' input data and their dynamic phases. This dissertation presents a set of run-time optimization techniques that match the characteristics of programs' dynamic memory access patterns and the appropriate optimization (parallelization) transformations. First, we present a general adaptive algorithm selection framework to automatically and adaptively select at run-time the best performing, functionally equivalent algorithm for each of its execution instances. The selection process is based on off-line automatically generated prediction models and characteristics (collected and analyzed dynamically) of the algorithm's input data, In this dissertation, we specialize this framework for automatic selection of reduction algorithms. In this research, we have identified a small set of machine independent high-level characterization parameters and then we deployed an off-line, systematic experiment process to generate prediction models. These models, in turn, match the parameters to the best optimization transformations for a given machine. The technique has been evaluated thoroughly in terms of applications, platforms, and programs' dynamic behaviors. Specifically, for the reduction algorithm selection, the selected performance is within 2% of optimal performance and on average is 60% better than "Replicated Buffer," the default parallel reduction algorithm specified by OpenMP standard. To reduce the overhead of speculative run-time parallelization, we have developed an adaptive run-time parallelization technique that dynamically chooses effcient shadow structures to record a program's dynamic memory access patterns for parallelization. This technique complements the original speculative run-time parallelization technique, the LRPD test, in parallelizing loops with sparse memory accesses. The techniques presented in this dissertation have been implemented in an optimizing research compiler and can be viewed as effective building blocks for comprehensive run-time optimization systems, e.g., feedback-directed optimization systems and dynamic compilation systems
On-the-fly memory compression for multibody algorithms
Memory and bandwidth demands challenge developers of particle-based codes that have to scale on new architectures, as the growth of concurrency outperforms improvements in memory access facilities, as the memory per core tends to stagnate, and as communication networks cannot increase bandwidth arbitrary. We propose to analyse each particle of such a code to find out whether a hierarchical data representation storing data with reduced precision caps the memory demands without exceeding given error bounds. For admissible candidates, we perform this compression and thus reduce the pressure on the memory subsystem, lower the total memory footprint and reduce the data to be exchanged via MPI. Notably, our analysis and transformation changes the data compression dynamically, i.e. the choice of data format follows the solution characteristics, and it does not require us to alter the core simulation code
- …