Search CORE

4 research outputs found

GOCE data analysis: optimized brute force solutions of large-scale linear equation systems on parallel computers

Author: Roth Matthias
Publication venue
Publication date: 21/12/2010
Field of study

The satellite mission GOCE (Gravity field and steady-state Ocean Circulation Explorer) was set up to determine the figure of the Earth with unprecedented accuracy. The sampling frequency is 1 Hz which results in a massive amount of data over the one year period the satellite is intended to be functional. From this data we can setup an overdeterminded linear system of equations to estimate the geopotential coefficients which are required for modelling the Earth's gravity field with spherical harmonics in the desired high resolution. The linear system of equations is solved "brute-force" which means that the normal equation matrix has to be kept in memory as a whole. The normal equations matrix has a memory demand of up to 65 GByte, hence we need a computer providing a sufficient amount of memory and fast multiple processors for the computations to get them done in a reasonable time. In this study, a program was written to compute the geopotential coefficients from simulated GOCE data, as GOCE real data were not available yet. As a first step, the program was optimized for the computations to become more efficient. As a second step, the program was parallelized to speed-up the computations by two different techniques: For a first version, the parallelization was done via OpenMP which can be used on shared memory systems which usally have a small number of processors. For the second version, MPI was used which is suited for a distributed memory architecture, hence can incorporate much more processors in the computations. In summation, we gained a huge boost in efficiency of the program due to the optimization. Furthermore, huge speed-up was gained due to the parallelization. The more processors are incorporated in the computation, the more the overall efficiency drops because of increasing communication between the processors. Here we could show that for huge problems the MPI version is running more efficient than the OpenMP version.Die Satellitenmission GOCE (Gravity field and steady-state Ocean Circulation Explorer) wurde gestartet, um die Erdfigur in bisher unerreichter Genauigkeit zu untersuchen. Eine Aufzeichnungsrate von 1 Hz führt zu einer enormen Datenmenge während der Funktionsdauer des Satelliten von einem Jahr (laut Planung). Mit den Daten kann ein überbestimmtes lineares Gleichungssystem aufgestellt werden, um die Gravitationsfeldparameter zu bestimmen, die das Gravitationsfeld der Erde mittels einer Kugelflächenfunktionsentwicklung in der gewünschten Genauigkeit modellieren. Das lineare Gleichungssystem wird mittels der "brute-force"-Methode gelöst. Das heißt, dass die komplette Normalgleichungsmatrix im Speicher gehalten werden muss. Die Normalgleichungsmatrix hat einen Speicherbedarf von bis zu 65 GByte, weswegen ein Computer benötigt wird, der ausreichend Speicher und eine größere Anzahl von schnellen Prozessoren für die Berechnungen zur Verfügung stellt, damit diese in einer angemessenen Zeit ausgeführt werden können. Im Laufe dieser Diplomarbeit wurde ein Programm entwickelt, um die Gravitationsfeldparameter aus simulierten GOCE-Daten zu bestimmen. Real-Daten von GOCE waren noch nicht verfügbar. Zunächst wurde das Programm optimiert, um mehr Berechnungseffizienz zu erhalten. Dann wurde das Programm mit Hilfe von zwei verschiedenen Techniken parallelisiert, um die Berechnungen noch weiter zu beschleunigen. Die erste Programmversion wurde mittels OpenMP parallelisiert, welches auf Systemen mit gemeinsamen Speicher (shared memory) eingesetzt werden kann. Diese Systeme enthalten normalerweise nur wenige Prozessoren. Bei der zweiten Version wurde MPI verwendet, welches besser für Systeme mit verteiltem Speicher (distributed memory) geeignet ist und daher viel mehr Prozessoren in die Berechnungen einbeziehen kann. Zusammengefasst konnte durch die Optimierung eine große Effizienzsteigerung erreicht werden. Darüber hinaus konnte mit der Parallelisierung ein Zuwachs der Ausführungsgeschwindigkeit erzielt werden. Je mehr Prozessoren für die Berechnungen verwendet werden, desto stärker fällt allerdings die Effizienz wegen der zunehmenden Kommunikation zwischen den Prozessen. Es konnte gezeigt werden, dass die MPI-Version für große Probleme effizienter ist als die OpenMP-Version

Context adaptivity for selected computational kernels with applications in optoelectronics and in phylogenetics

Author: Schabauer Hannes
Publication venue
Publication date: 01/01/2010
Field of study

Computational Kernels sind der kritische Teil rechenintensiver Software, wofür der größte Rechenaufwand anfällt; daher müssen deren Design und Implementierung sorgfältig vorgenommen werden. Zwei wissenschaftliche Anwendungsprobleme aus der Optoelektronik und aus der Phylogenetik, sowie dazugehörige Computational Kernels motivieren diese Arbeit. Im ersten Anwendungsproblem werden Komponenten zur Berechnung komplex-symmetrischer Eigenwertprobleme diskutiert, welche in der Simulation von Wellenleitern in der Optoelektronik auftreten. LAPACK und ScaLAPACK beinhalten sehr leistungsfähige Referenzimplementierungen für bestimmte Problemstellungen der linearen Algebra. In Bezug auf Eigenwertprobleme werden ausschließlich reell-symmetrische und komplex-hermitesche Varianten angeboten, daher sind effiziente Codes für komplex-symmetrische (nicht-hermitesche) Eigenwertprobleme sehr wünschenswert. Das zweite Anwendungsproblem behandelt einen parallelen, wissenschaftlichen Workflow zur Rekonstruktion von Phylogenien, welcher entworfen, umgesetzt und evaluiert wird. Die Rekonstruktion von phylogenetischen Bäumen ist ein NP-hartes Problem, welches äußerst viel Rechenkapazität benötigt, wodurch ein paralleler Ansatz erforderlich ist. Die grundlegende Idee dieser Arbeit ist die Untersuchung der Wechselbeziehung zwischen dem Kontext der behandelten Kernels und deren Effizienz. Ein Kontext eines Computational Kernels beinhaltet Modellaspekte (z.B. Struktur der Eingabedaten), Softwareaspekte (z.B. rechenintensive Bibliotheken), Hardwareaspekte (z.B. verfügbarer Hauptspeicher und unterstützte darstellbare Genauigkeit), sowie weitere Anforderungen bzw. Einschränkungen. Einschränkungen sind hinsichtlich Laufzeit, Speicherverbrauch, gelieferte Genauigkeit usw., möglich. Das Konzept der Kontextadaptivität wird für ausgewählte Anwendungsprobleme in Computational Science gezeigt. Die vorgestellte Methode ist ein Meta-Algorithmus, der Aspekte des Kontexts verwendet, um optimale Leistung hinsichtlich der angewandten Metrik zu erzielen. Es ist wichtig, den Kontext einzubeziehen, weil Anforderungen gegeneinander ausgetauscht werden könnten, resultierend in einer höheren Leistung. Zum Beispiel kann im Falle einer niedrigen benötigten Genauigkeit ein schnellerer Algorithmus einer bewährten, aber langsameren, Methode vorgezogen werden. Speziell für komplex-symmetrische Eigenwertprobleme zugeschnittene Codes zielen darauf ab, Genauigkeit gegen Geschwindigkeit einzutauschen. Die Innovation wird durch neue algorithmische Ansätze belegt, welche die algebraische Struktur ausnutzen. Bezüglich der Berechnung von phylogenetischen Bäumen wird die Abbildung eines Workflows auf ein Campusgrid-System gezeigt. Die Innovation besteht in der anpassungsfähigen Implementierung des Workflows, der nebenläufige Instanzen von Computational Kernels in einem verteilten System darstellt. Die Adaptivität bezeichnet hier die Fähigkeit des Workflows, die Rechenlast hinsichtlich verfügbarer Rechner, Zeit und Qualität der phylogenetischen Bäume anzupassen. Kontextadaptivität wird durch die Implementierung und Evaluierung von wissenschaftlichen Problemstellungen aus der Optoelektronik und aus der Phylogenetik gezeigt. Für das Fachgebiet der Optoelektronik zielt eine Familie von Algorithmen auf die Lösung von verallgemeinerten komplex-symmetrischen Eigenwertproblemen ab. Unser alternativer Ansatz nutzt die symmetrische Struktur aus und spielt günstigere Laufzeit gegen eine geringere Genauigkeit aus. Dieser Ansatz ist somit schneller, jedoch (meist) ungenauer als der konventionelle Lösungsweg. Zusätzlich zum sequentiellen Löser wird eine parallele Variante diskutiert und teilweise auf einem Cluster mit bis zu 1024 CPU-Cores evaluiert. Die erzielten Laufzeiten beweisen die Überlegenheit unseres Ansatzes -- allerdings sind weitere Untersuchungen zur Erhöhung der Genauigkeit notwendig. Für das Fachgebiet der Phylogenetik zeigen wir, dass die phylogenetische Baum-Rekonstruktion mittels eines Condor-basierten Campusgrids effizient parallelisiert werden kann. Dieser parallele wissenschaftliche Workflow weist einen geringen parallelen Overhead auf, resultierend in exzellenter Effizienz.Computational kernels are the crucial part of computationally intensive software, where most of the computing time is spent; hence, their design and implementation have to be accomplished carefully. Two scientific application problems from optoelectronics and from phylogenetics and corresponding computational kernels are motivating this thesis. In the first application problem, components for the computational solution of complex symmetric EVPs are discussed, arising in the simulation of waveguides in optoelectronics. LAPACK and ScaLAPACK contain highly effective reference implementations for certain numerical problems in linear algebra. With respect to EVPs, only real symmetric and complex Hermitian codes are available, therefore efficient codes for complex symmetric (non-Hermitian) EVPs are highly desirable. In the second application problem, a parallel scientific workflow for computing phylogenies is designed, implemented, and evaluated. The reconstruction of phylogenetic trees is an NP-hard problem that demands huge scale computing capabilities, and therefore a parallel approach is necessary. One idea underlying this thesis is to investigate the interaction between the context of the kernels considered and their efficiency. The context of a computational kernel comprises model aspects (for instance, structure of input data), software aspects (for instance, computational libraries), hardware aspects (for instance, available RAM and supported precision), and certain requirements or constraints. Constraints may exist with respect to runtime, memory usage, accuracy required, etc.. The concept of context adaptivity is demonstrated to selected computational problems in computational science. The method proposed here is a meta-algorithm that utilizes aspects of the context to result in an optimal performance concerning the applied metric. It is important to consider the context, because requirements may be traded for each other, resulting in a higher performance. For instance, in case of a low required accuracy, a faster algorithmic approach may be favored over an established but slower method. With respect to EVPs, prototypical codes that are especially targeted at complex symmetric EVPs aim at trading accuracy for speed. The innovation is evidenced by the implementation of new algorithmic approaches exploiting structure. Concerning the computation of phylogenetic trees, the mapping of a scientific workflow onto a campus grid system is demonstrated. The adaptive implementation of the workflow features concurrent instances of a computational kernel on a distributed system. Here, adaptivity refers to the ability of the workflow to vary computational load in terms of available computing resources, available time, and quality of reconstructed phylogenetic trees. Context adaptivity is discussed by means of computational problems from optoelectronics and from phylogenetics. For the field of optoelectronics, a family of implemented algorithms aim at solving generalized complex symmetric EVPs. Our alternative approach exploiting structural symmetry trades runtime for accuracy, hence, it is faster but (usually) features a lower accuracy than the conventional approach. In addition to a complete sequential solver, a parallel variant is discussed and partly evaluated on a cluster utilizing up to 1024 CPU cores. Achieved runtimes evidence the superiority of our approach, however, further investigations on improving accuracy are suggested. For the field of phylogenetics, we show that phylogenetic tree reconstruction can efficiently be parallelized on a campus grid infrastructure. The parallel scientific workflow features a moderate parallel overhead, resulting in an excellent efficiency

OTHES

Software Roadmap to Plug and Play Petaflop/s

Author
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date
Field of study

Crossref

Tuning Parallel Applications in Parallel

Author: Tiwari Ananta N.
Publication venue
Publication date: 01/01/2011
Field of study

Auto-tuning has recently received significant attention from the High Performance Computing community. Most auto-tuning approaches are specialized to work either on specific domains such as dense linear algebra and stencil computations, or only at certain stages of program execution such as compile time and runtime. Real scientific applications, however, demand a cohesive environment that can efficiently provide auto-tuning solutions at all stages of application development and deployment. Towards that end, we describe a unified end-to-end approach to auto-tuning scientific applications. Our system, Active Harmony, takes a search-based collaborative approach to auto-tuning. Application programmers, library writers and compilers collaborate to describe and export a set of performance related tunable parameters to the Active Harmony system. These parameters define a tuning search-space. The auto-tuner monitors the program performance and suggests adaptation decisions. The decisions are made by a central controller using a parallel search algorithm. The algorithm leverages parallel architectures to search across a set of optimization parameter values. Different nodes of a parallel system evaluate different configurations at each timestep. Active Harmony supports runtime adaptive code-generation and tuning for parameters that require new code (e.g. unroll factors). Effectively, we merge traditional feedback directed optimization and just-in-time compilation. This feature also enables application developers to write applications once and have the auto-tuner adjust the application behavior automatically when run on new systems. We evaluated our system on multiple large-scale parallel applications and showed that our system can improve the execution time by up to 46% compared to the original version of the program. Finally, we believe that the success of any auto-tuning research depends on how effectively application developers, domain-experts and auto-tuners communicate and work together. To that end, we have developed and released a simple and extensible language that standardizes the parameter space representation. Using this language, developers and researchers can collaborate to export tunable parameters to the tuning frameworks. Relationships (e.g. ordering, dependencies, constraints, ranking) between tunable parameters and search-hints can also be expressed

Digital Repository at the University of Maryland