16 research outputs found
Improving the scalability of parallel N-body applications with an event driven constraint based execution model
The scalability and efficiency of graph applications are significantly
constrained by conventional systems and their supporting programming models.
Technology trends like multicore, manycore, and heterogeneous system
architectures are introducing further challenges and possibilities for emerging
application domains such as graph applications. This paper explores the space
of effective parallel execution of ephemeral graphs that are dynamically
generated using the Barnes-Hut algorithm to exemplify dynamic workloads. The
workloads are expressed using the semantics of an Exascale computing execution
model called ParalleX. For comparison, results using conventional execution
model semantics are also presented. We find improved load balancing during
runtime and automatic parallelism discovery improving efficiency using the
advanced semantics for Exascale computing.Comment: 11 figure
Visualization analysis of astrophysics n-bodied problem using image morphological processing techniques
This project鈥檚 primary goal is to detect points of interest within the output data resulting from running a simulation of the Astrophysics N-Bodied problem (GRAPEcluster). Morphological Image Processing techniques will be applied to the visualized data in order to detect areas of interest within the original data. Several Morphological Image Processing techniques will be used and the results compared in the analysis. The final output of the VRAD (Visualization of Raw Astrophysics Data) System will be two-fold: first, the VRAD system will output a text file that contains the x, y and z coordinates of each region of interest in each time slice that is examined; second, the VRAD system will output three image files for each time-slice with the 2D regions of interest highlighted by a bounding box. In this way the VRAD system can act as a stand-alone program or be used in conjunction with the Spiegel visualization framework
Extreme scale parallel NBody algorithm with event driven constraint based execution model
Traditional scientific applications such as Computational Fluid Dynamics, Partial Differential Equations based numerical methods (like Finite Difference Methods, Finite Element Methods) achieve sufficient efficiency on state of the art high performance computing systems and have been widely studied / implemented using conventional programming models. For emerging application domains such as Graph applications scalability and efficiency is significantly constrained by the conventional systems and their supporting programming models. Furthermore technology trends like multicore, manycore, heterogeneous system architectures are introducing new challenges and possibilities. Emerging technologies are requiring a rethinking of approaches to more effectively expose the underlying parallelism to the applications and the end-users. This thesis explores the space of effective parallel execution of ephemeral graphs that are dynamically generated. The standard particle based simulation, solved using the Barnes-Hut algorithm is chosen to exemplify the dynamic workloads. In this thesis the workloads are expressed using sequential execution semantics, a conventional parallel programming model - shared memory semantics and semantics of an innovative execution model designed for efficient scalable performance towards Exascale computing called ParalleX. The main outcomes of this research are parallel processing of dynamic ephemeral workloads, enabling dynamic load balancing during runtime, and using advanced semantics for exposing parallelism in scaling constrained applications
Recommended from our members
A High-Performance Domain-Specific Language and Code Generator for General N-body Problems
General N-body problems are a set of problems in which an update to a single element in the system depends on every other element. N-body problems are ubiquitous, with applications in various domains ranging from scientific computing simulations in molecular dynamics, astrophysics, acoustics, and fluid dynamics all the way to computer vision, data mining and machine learning problems. Different N-body algorithms have been designed and implemented in these various fields. However, there is a big gap between the algorithm one designs on paper and the code that runs efficiently on a parallel system. It is time-consuming to write fast, parallel, and scalable code for these problems. On the other hand, the sheer scale and growth of modern scientific datasets necessitate exploiting the power of both parallel and approximation algorithms where there is a potential to trade-off accuracy for performance. The main problem that we are tackling in this thesis is how to automatically generate asymptotically optimal N-body algorithms from the high-level specification of the problem. We combine the body of work in performance optimizations, compilers and the domain of N-body problems to build a unified system where domain scientists can write programs at the high level while attaining performance of code written by an expert at the low level.In order to generate a high-performance, scalable code for this group of problems, we take the following steps in this thesis; first, we propose a unified algorithmic framework named PASCAL in order to address the challenge of designing a general algorithmic template to represent the class of N-body problems. PASCAL utilizes space-partitioning trees and user-controlled pruning/approximations to reduce the asymptotic runtime complexity from linear to logarithmic in the number of data points. In PASCAL, we design an algorithm that automatically generates conditions for pruning or approximation of an N-body problem considering the problem's definition. In order to evaluate PASCAL, we developed tree-based algorithms for six well-known problems: k-nearest neighbors, range search, minimum spanning tree, kernel density estimation, expectation maximization, and Hausdorff distance. We show that applying domain-specific optimizations and parallelization to the algorithms written in PASCAL achieves 10x to 230x speedup compared to state-of-the-art libraries on a dual-socket Intel Xeon processor with 16 cores on real-world datasets. Second, we extend the PASCAL framework to build PASCAL-X that adds support for NUMA-aware parallelization. PASCAL-X also presents insights on the influence of tuning parameters. Tuning parameters such as leaf size (influences the shape of the tree) and cut-off level (controls the granularity of tasks) of the space-partitioning trees result in performance improvement of up to 4.6x. A key goal is to generate scalable and high-performance code automatically without sacrificing productivity. That implies minimizing the effort the users have to put in to generate the desired high-performance code. Another critical factor is the adaptivity, which indicates the amount of effort that is required to extend the high-performance code generation to new N-body problems. Finally, we consider these factors and develop a domain-specific language and code generator named Portal, which is built on top of PASCAL-X. Portal's language design is inspired by the mathematical representation of N-body problems, resulting in an intuitive language for rapid implementation of a variety of problems. Portal's back-end is designed and implemented to generate optimized, parallel, and scalable implementations for multi-core systems. We demonstrate that the performance achieved by using Portal is comparable to that of expert hand-optimized code while providing productivity for domain scientists. For instance, using Portal for the k-nearest neighbors problem gains performance that is similar to the hand-optimized code, while reducing the lines of code by 68x. To the best of our knowledge, there are no known libraries or frameworks that implement parallel asymptotically optimal algorithms for the class of general N-body problems and this thesis primarily aims to fill this gap. Finally, we present a case study of Portal for the real-world problem of face clustering. In this case study, we show that Portal not only provides a fast solution for the face clustering problem with similar accuracy as the state-of-the-art algorithm, but also it provides productivity by implementing the face clustering algorithm in only 14 lines of Portal code
Estudio e implementaci贸n del algoritmo Barnes-Hut para el c谩lculo de la interacci贸n gravitatoria entre N-cuerpos
[EN] In this work, we provide an original implementation of the Barnes-Hut algorithm in Python 3.7 both
in 2D and 3D. This algorithm solves approximately the N-body problem and is well known for
achieving order by treating nearby bodies as single individuals when observed from a
far enough distance. Besides, we came up with a clever scheme for grouping those bodies: a proof of
concept that turned out to perform accurately. Further, we analyzed the validity range of our
prototype and correlated it to the direct-sum algorithm
by means of a selected set of test
examples. Our implementation of the BH algorithm relies heavily on a deeply nested tree data
structure. As such, its manipulation is fundamentelly recursive and highly complex. This is the reason
why we additionally have included an extended explanation, with drawings and schemes, of the
rather cumbersome bookkeeping strategy involved in its use. Finally, we have uploaded the full code
to the GitHub platform and thereby made it publicly available.[ES] Este trabajo proporciona una implementaci贸n original del algoritmo Barnes-Hut en Python 3.7, tanto
en 2D como en 3D. Este algoritmo resuelve el problema de N-cuerpos de forma aproximada y es
conocido por lograr el orden al agrupar los cuerpos cercanos en un solo individuo
cuando son observados desde una distancia lo suficientemente lejana. Adem谩s, concebimos un
esquema original para agrupar esos cuerpos; esta prueba de concepto result贸 funcionar con
precisi贸n. Asimismo, analizamos el rango de validez de nuestro prototipo y lo contrastamos con el
algoritmo de suma directa
mediante un conjunto seleccionado de ejemplos. Nuestra
implementaci贸n del algoritmo BH depende en gran medida de una estructura de datos de tipo 谩rbol
profundamente anidada. Como tal, su manipulaci贸n es fundamentalmente recursiva y altamente
compleja. Esta es la raz贸n por la que hemos incluido una explicaci贸n extendida, con dibujos y
esquemas, de la estrategia de contabilidad bastante engorrosa involucrada en su uso. Finalmente,
hemos subido el c贸digo completo a la plataforma GitHub y, por lo tanto, lo hemos puesto a
disposici贸n del p煤blico
Estudio e implementaci贸n del algoritmo Barnes-Hut para el c谩lculo de la interacci贸n gravitatoria entre N-cuerpos
[EN] In this work, we provide an original implementation of the Barnes-Hut algorithm in Python 3.7 both
in 2D and 3D. This algorithm solves approximately the N-body problem and is well known for
achieving order by treating nearby bodies as single individuals when observed from a
far enough distance. Besides, we came up with a clever scheme for grouping those bodies: a proof of
concept that turned out to perform accurately. Further, we analyzed the validity range of our
prototype and correlated it to the direct-sum algorithm
by means of a selected set of test
examples. Our implementation of the BH algorithm relies heavily on a deeply nested tree data
structure. As such, its manipulation is fundamentelly recursive and highly complex. This is the reason
why we additionally have included an extended explanation, with drawings and schemes, of the
rather cumbersome bookkeeping strategy involved in its use. Finally, we have uploaded the full code
to the GitHub platform and thereby made it publicly available.[ES] Este trabajo proporciona una implementaci贸n original del algoritmo Barnes-Hut en Python 3.7, tanto
en 2D como en 3D. Este algoritmo resuelve el problema de N-cuerpos de forma aproximada y es
conocido por lograr el orden al agrupar los cuerpos cercanos en un solo individuo
cuando son observados desde una distancia lo suficientemente lejana. Adem谩s, concebimos un
esquema original para agrupar esos cuerpos; esta prueba de concepto result贸 funcionar con
precisi贸n. Asimismo, analizamos el rango de validez de nuestro prototipo y lo contrastamos con el
algoritmo de suma directa
mediante un conjunto seleccionado de ejemplos. Nuestra
implementaci贸n del algoritmo BH depende en gran medida de una estructura de datos de tipo 谩rbol
profundamente anidada. Como tal, su manipulaci贸n es fundamentalmente recursiva y altamente
compleja. Esta es la raz贸n por la que hemos incluido una explicaci贸n extendida, con dibujos y
esquemas, de la estrategia de contabilidad bastante engorrosa involucrada en su uso. Finalmente,
hemos subido el c贸digo completo a la plataforma GitHub y, por lo tanto, lo hemos puesto a
disposici贸n del p煤blico
Distribution independent parallel algorithms and software for hierarchical methods with applications to computational electromagnetics
Octrees are tree data structures used to represent multidimensional points in space. They are widely used in supporting hierarchical methods for scientific applications such as the N-body problem, molecular dynamics and smoothed particle hydrodynamics. The size of an octree is known to be dependent on the spatial distribution of points in the computational domain and is not just a function of the number of points. For this reason, run-time of an algorithm using octree that depends on the size of the octree is unknown for arbitrary distributions. In this thesis, we present the design and implementation of parallel algorithms for construction of compressed octrees and queries that are typically used by hierarchical methods. Our parallel algorithms and implementation strategies perform well irrespective of the spatial distribution of data, are communication efficient, and require no explicit load balancing. We also developed a software library which provides the functionality of parallel tree construction and various queries on compressed octrees. The purpose of the library is to enable rapid development of applications and to allow application developers to use efficient parallel algorithms without necessity of having detailed knowledge of the algorithms or of implementing them. To demonstrate the performance of our algorithms and to show the effectiveness of the library, we developed a complete end-to-end parallel electromagnetics code for computing the scattered electromagnetic fields from a Perfect Electrically Conducting surface. We used the functions provided by the software library to develop a Fast Multipole Method based solution to this problem. Experimental results show that our algorithms scale well and have bounded communication irrespective of the shape of the scatterer