57 research outputs found
Recommended from our members
A High-Performance Domain-Specific Language and Code Generator for General N-body Problems
General N-body problems are a set of problems in which an update to a single element in the system depends on every other element. N-body problems are ubiquitous, with applications in various domains ranging from scientific computing simulations in molecular dynamics, astrophysics, acoustics, and fluid dynamics all the way to computer vision, data mining and machine learning problems. Different N-body algorithms have been designed and implemented in these various fields. However, there is a big gap between the algorithm one designs on paper and the code that runs efficiently on a parallel system. It is time-consuming to write fast, parallel, and scalable code for these problems. On the other hand, the sheer scale and growth of modern scientific datasets necessitate exploiting the power of both parallel and approximation algorithms where there is a potential to trade-off accuracy for performance. The main problem that we are tackling in this thesis is how to automatically generate asymptotically optimal N-body algorithms from the high-level specification of the problem. We combine the body of work in performance optimizations, compilers and the domain of N-body problems to build a unified system where domain scientists can write programs at the high level while attaining performance of code written by an expert at the low level.In order to generate a high-performance, scalable code for this group of problems, we take the following steps in this thesis; first, we propose a unified algorithmic framework named PASCAL in order to address the challenge of designing a general algorithmic template to represent the class of N-body problems. PASCAL utilizes space-partitioning trees and user-controlled pruning/approximations to reduce the asymptotic runtime complexity from linear to logarithmic in the number of data points. In PASCAL, we design an algorithm that automatically generates conditions for pruning or approximation of an N-body problem considering the problem's definition. In order to evaluate PASCAL, we developed tree-based algorithms for six well-known problems: k-nearest neighbors, range search, minimum spanning tree, kernel density estimation, expectation maximization, and Hausdorff distance. We show that applying domain-specific optimizations and parallelization to the algorithms written in PASCAL achieves 10x to 230x speedup compared to state-of-the-art libraries on a dual-socket Intel Xeon processor with 16 cores on real-world datasets. Second, we extend the PASCAL framework to build PASCAL-X that adds support for NUMA-aware parallelization. PASCAL-X also presents insights on the influence of tuning parameters. Tuning parameters such as leaf size (influences the shape of the tree) and cut-off level (controls the granularity of tasks) of the space-partitioning trees result in performance improvement of up to 4.6x. A key goal is to generate scalable and high-performance code automatically without sacrificing productivity. That implies minimizing the effort the users have to put in to generate the desired high-performance code. Another critical factor is the adaptivity, which indicates the amount of effort that is required to extend the high-performance code generation to new N-body problems. Finally, we consider these factors and develop a domain-specific language and code generator named Portal, which is built on top of PASCAL-X. Portal's language design is inspired by the mathematical representation of N-body problems, resulting in an intuitive language for rapid implementation of a variety of problems. Portal's back-end is designed and implemented to generate optimized, parallel, and scalable implementations for multi-core systems. We demonstrate that the performance achieved by using Portal is comparable to that of expert hand-optimized code while providing productivity for domain scientists. For instance, using Portal for the k-nearest neighbors problem gains performance that is similar to the hand-optimized code, while reducing the lines of code by 68x. To the best of our knowledge, there are no known libraries or frameworks that implement parallel asymptotically optimal algorithms for the class of general N-body problems and this thesis primarily aims to fill this gap. Finally, we present a case study of Portal for the real-world problem of face clustering. In this case study, we show that Portal not only provides a fast solution for the face clustering problem with similar accuracy as the state-of-the-art algorithm, but also it provides productivity by implementing the face clustering algorithm in only 14 lines of Portal code
Multi-Architecture Monte-Carlo (MC) Simulation of Soft Coarse-Grained Polymeric Materials: SOft coarse grained Monte-carlo Acceleration (SOMA)
Multi-component polymer systems are important for the development of new
materials because of their ability to phase-separate or self-assemble into
nano-structures. The Single-Chain-in-Mean-Field (SCMF) algorithm in conjunction
with a soft, coarse-grained polymer model is an established technique to
investigate these soft-matter systems. Here we present an im- plementation of
this method: SOft coarse grained Monte-carlo Accelera- tion (SOMA). It is
suitable to simulate large system sizes with up to billions of particles, yet
versatile enough to study properties of different kinds of molecular
architectures and interactions. We achieve efficiency of the simulations
commissioning accelerators like GPUs on both workstations as well as
supercomputers. The implementa- tion remains flexible and maintainable because
of the implementation of the scientific programming language enhanced by
OpenACC pragmas for the accelerators. We present implementation details and
features of the program package, investigate the scalability of our
implementation SOMA, and discuss two applications, which cover system sizes
that are difficult to reach with other, common particle-based simulation
methods
A Fully-Pipelined Hardware Design for Gaussian Mixture Models
Gaussian Mixture Models (GMMs) are widely used in many applications such as data mining, signal processing and computer vision, for probability density modeling and soft clustering. However, the parameters of a GMM need to be estimated from data by, for example, the Expectation-Maximization algorithm for Gaussian Mixture Models (EM-GMM), which is computationally demanding. This paper presents a novel design for the EM-GMM algorithm targeting reconfigurable platforms, with five main contributions. First, a pipeline-friendly EM-GMM with diagonal covariance matrices that can easily be mapped to hardware architectures. Second, a function evaluation unit for Gaussian probability density based on fixed-point arithmetic. Third, our approach is extended to support a wide range of dimensions or/and components by fitting multiple pieces of smaller dimensions onto an FPGA chip. Fourth, we derive a cost and performance model that estimates logic resources. Fifth, our dataflow design targeting the Maxeler MPCX2000 with a Stratix-5SGSD8 FPGA can run over 200 times faster than a 6-core Xeon E5645 processor, and over 39 times faster than a Pascal TITAN-X GPU. Our design provides a practical solution to applications for training and explores better parameters for GMMs with hundreds of millions of high dimensional input instances, for low-latency and high-performance applications
Coupled Kinetic-Fluid Simulations of Ganymede's Magnetosphere and Hybrid Parallelization of the Magnetohydrodynamics Model
The largest moon in the solar system, Ganymede, is the only moon known to possess a strong intrinsic magnetic field.
The interaction between the Jovian plasma and Ganymede's magnetic field creates a mini-magnetosphere with periodically varying upstream conditions, which creates a perfect laboratory in nature for studying magnetic reconnection and magnetospheric physics.
Using the latest version of Space Weather Modeling Framework (SWMF), we study the upstream plasma interactions and dynamics in this subsonic, sub-Alfvénic system.
We have developed a coupled fluid-kinetic Hall Magnetohydrodynamics with embedded Particle-in-Cell (MHD-EPIC) model for Ganymede's magnetosphere, with a self-consistently coupled resistive body representing the electrical properties of the moon's interior, improved inner boundary conditions, and high resolution charge and energy conserved PIC scheme.
I reimplemented the boundary condition setup in SWMF for more versatile control and functionalities, and developed a new user module for Ganymede's simulation.
Results from the models are validated with Galileo magnetometer data of all close encounters and compared with Plasma Subsystem (PLS) data.
The energy fluxes associated with the upstream reconnection in the model is estimated to be about 10^-7 W/cm^2, which accounts for about 40% to the total peak auroral emissions observed by the Hubble Space Telescope.
We find that under steady upstream conditions, magnetopause reconnection in our fluid-kinetic simulations occurs in a non-steady manner.
Flux ropes with length of Ganymede's radius form on the magnetopause at a rate about 3/minute and create spatiotemporal variations in plasma and field properties.
Upon reaching proper grid resolutions, the MHD-EPIC model can resolve both electron and ion kinetics at the magnetopause and show localized crescent shape distribution in both ion and electron phase space, non-gyrotropic and non-isotropic behavior inside the diffusion regions.
The estimated global reconnection rate from the models is about 80 kV with 60% efficiency.
There is weak evidence of minute periodicity in the temporal variations of the reconnection rate due to the dynamic reconnection process.
The requirement of high fidelity results promotes the development of hybrid parallelized numerical model strategy and faster data processing techniques.
The state-of-the-art finite volume/difference MHD code Block Adaptive Tree Solarwind Roe Upwind Scheme (BATS-R-US) was originally designed with pure MPI parallelization.
The maximum problem size achievable was limited by the storage requirements of the block tree structure.
To mitigate this limitation, we have added multithreaded OpenMP parallelization to the previous pure MPI implementation.
We opt to use a coarse-grained approach by making the loops over grid blocks multithreaded and have succeeded in making BATS-R-US an efficient hybrid parallel code with modest changes in the source code while preserving the performance.
Good weak scalings up to 50,0000 and 25,0000 of cores are achieved for the explicit and implicit time stepping schemes, respectively.
This parallelization strategy greatly extends the possible simulation scale by an order of magnitude, and paves the way for future GPU-portable code development.
To improve visualization and data processing, I have developed a whole new data processing workflow with the Julia programming language for efficient data analysis and visualization.
As a summary,
1. I build a single fluid Hall MHD-EPIC model of Ganymede's magnetosphere;
2. I did detailed analysis of the upstream reconnection;
3. I developed a MPI+OpenMP parallel MHD model with BATSRUS;
4. I wrote a package for data analysis and visualization.PHDClimate and Space Sciences and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163032/1/hyzhou_1.pd
Parallelization Strategies for Modern Computing Platforms: Application to Illustrative Image Processing and Computer Vision Applications
RÉSUMÉ
L’évolution spectaculaire des technologies dans le domaine du matériel et du logiciel a permis l’émergence des nouvelles plateformes parallèles très performantes. Ces plateformes ont marqué le début d’une nouvelle ère de la computation et il est préconisé qu’elles vont rester dans le domaine pour une bonne période de temps. Elles sont présentes déjà dans le domaine du calcul de haute performance (en anglais HPC, High Performance Computer) ainsi que dans le domaine des systèmes embarqués. Récemment, dans ces domaines le concept de calcul hétérogène a été adopté pour atteindre des performances élevées. Ainsi, plusieurs types de processeurs sont utilisés, dont les plus populaires sont les unités centrales de traitement ou CPU (de l’anglais Central Processing Unit) et les processeurs graphiques ou GPU (de l’anglais Graphics Processing Units).
La programmation efficace pour ces nouvelles plateformes parallèles amène actuellement non seulement des opportunités mais aussi des défis importants pour les concepteurs. Par conséquent, l’industrie a besoin de l’appui de la communauté de recherche pour assurer le succès de ce nouveau changement de paradigme vers le calcul parallèle. Trois défis principaux présents pour les processeurs GPU massivement parallèles (ou “many-cores”) ainsi que pour les processeurs CPU multi-coeurs sont: (1) la sélection de la meilleure plateforme parallèle pour une application donnée, (2) la sélection de la meilleure stratégie de parallèlisation et (3) le réglage minutieux des performances (ou en anglais performance tuning) pour mieux exploiter les plateformes existantes.
Dans ce contexte, l’objectif global de notre projet de recherche est de définir de nouvelles solutions pour aider à la programmation efficace des applications complexes sur les plateformes parallèles modernes.
Les principales contributions Ă la recherche sont:
1. L’évaluation de l’efficacité d’accélération pour plusieurs plateformes parallèles, dans le cas des applications de calcul intensif.
2. Une analyse quantitative des stratégies de parallèlisation et implantation sur les plateformes à base de processeurs CPU multi-cœur ainsi que pour les plateformes à base de processeurs GPU massivement parallèles.
3. La définition et la mise en place d’une approche de réglage de performances (en Anglais performance tuning) pour les plateformes parallèles.
Les contributions proposées ont été validées en utilisant des applications réelles illustratives et un ensemble varié de plateformes parallèles modernes.----------ABSTRACT
With the technology improvement for both hardware and software, parallel platforms started a new computing era and they are here to stay. Parallel platforms may be found in High Performance Computers (HPC) or embedded computers. Recently, both HPC and embedded computers are moving toward heterogeneous computing platforms. They are employing both Central Processing Units (CPUs) and Graphics Processing Units (GPUs) to achieve the highest performance. Programming efficiently for parallel platforms brings new opportunities but also several challenges. Therefore, industry needs help from the research community to succeed in its recent dramatic shift to parallel computing.
Parallel programing presents several major challenges. These challenges are equally present whether one programs on a many-core GPU or on a multi-core CPU. Three of the main challenges are: (1) Finding the best platform providing the required acceleration (2) Select the best parallelization strategy (3) Performance tuning to efficiently leverage the parallel platforms.
In this context, the overall objective of our research is to propose a new solution helping designers to efficiently program complex applications on modern parallel architectures. The contributions of this thesis are:
1. The evaluation of the efficiency of several target parallel platforms to speedup compute-intensive applications.
2. The quantitative analysis for parallelization and implementation strategies on multicore CPUs and many-core GPUs.
3. The definition and implementation of a new performance tuning framework for heterogeneous parallel platforms.
The contributions were validated using real computation intensive applications and modern parallel platform based on multi-core CPU and many-core GPU
Online Modeling and Tuning of Parallel Stream Processing Systems
Writing performant computer programs is hard. Code for high performance applications is profiled, tweaked, and re-factored for months specifically for the hardware for which it is to run. Consumer application code doesn\u27t get the benefit of endless massaging that benefits high performance code, even though heterogeneous processor environments are beginning to resemble those in more performance oriented arenas. This thesis offers a path to performant, parallel code (through stream processing) which is tuned online and automatically adapts to the environment it is given. This approach has the potential to reduce the tuning costs associated with high performance code and brings the benefit of performance tuning to consumer applications where otherwise it would be cost prohibitive. This thesis introduces a stream processing library and multiple techniques to enable its online modeling and tuning. Stream processing (also termed data-flow programming) is a compute paradigm that views an application as a set of logical kernels connected via communications links or streams. Stream processing is increasingly used by computational-x and x-informatics fields (e.g., biology, astrophysics) where the focus is on safe and fast parallelization of specific big-data applications. A major advantage of stream processing is that it enables parallelization without necessitating manual end-user management of non-deterministic behavior often characteristic of more traditional parallel processing methods. Many big-data and high performance applications involve high throughput processing, necessitating usage of many parallel compute kernels on several compute cores. Optimizing the orchestration of kernels has been the focus of much theoretical and empirical modeling work. Purely theoretical parallel programming models can fail when the assumptions implicit within the model are mis-matched with reality (i.e., the model is incorrectly applied). Often it is unclear if the assumptions are actually being met, even when verified under controlled conditions. Full empirical optimization solves this problem by extensively searching the range of likely configurations under native operating conditions. This, however, is expensive in both time and energy. For large, massively parallel systems, even deciding which modeling paradigm to use is often prohibitively expensive and unfortunately transient (with workload and hardware). In an ideal world, a parallel run-time will re-optimize an application continuously to match its environment, with little additional overhead. This work presents methods aimed at doing just that through low overhead instrumentation, modeling, and optimization. Online optimization provides a good trade-off between static optimization and online heuristics. To enable online optimization, modeling decisions must be fast and relatively accurate. Online modeling and optimization of a stream processing system first requires the existence of a stream processing framework that is amenable to the intended type of dynamic manipulation. To fill this void, we developed the RaftLib C++ template library, which enables usage of the stream processing paradigm for C++ applications (it is the run-time which is the basis of almost all the work within this dissertation). An application topology is specified by the user, however almost everything else is optimizable by the run-time. RaftLib takes advantage of the knowledge gained during the design of several prior streaming languages (notably Auto-Pipe). The resultant framework enables online migration of tasks, auto-parallelization, online buffer-reallocation, and other useful dynamic behaviors that were not available in many previous stream processing systems. Several benchmark applications have been designed to assess the performance gains through our approaches and compare performance to other leading stream processing frameworks. Information is essential to any modeling task, to that end a low-overhead instrumentation framework has been developed which is both dynamic and adaptive. Discovering a fast and relatively optimal configuration for a stream processing application often necessitates solving for buffer sizes within a finite capacity queueing network. We show that a generalized gain/loss network flow model can bootstrap the process under certain conditions. Any modeling effort, requires that a model be selected; often a highly manual task, involving many expensive operations. This dissertation demonstrates that machine learning methods (such as a support vector machine) can successfully select models at run-time for a streaming application. The full set of approaches are incorporated into the open source RaftLib framework
- …