45 research outputs found

    Heterogeneous multicore systems for signal processing

    Get PDF
    This thesis explores the capabilities of heterogeneous multi-core systems, based on multiple Graphics Processing Units (GPUs) in a standard desktop framework. Multi-GPU accelerated desk side computers are an appealing alternative to other high performance computing (HPC) systems: being composed of commodity hardware components fabricated in large quantities, their price-performance ratio is unparalleled in the world of high performance computing. Essentially bringing “supercomputing to the masses”, this opens up new possibilities for application fields where investing in HPC resources had been considered unfeasible before. One of these is the field of bioelectrical imaging, a class of medical imaging technologies that occupy a low-cost niche next to million-dollar systems like functional Magnetic Resonance Imaging (fMRI). In the scope of this work, several computational challenges encountered in bioelectrical imaging are tackled with this new kind of computing resource, striving to help these methods approach their true potential. Specifically, the following main contributions were made: Firstly, a novel dual-GPU implementation of parallel triangular matrix inversion (TMI) is presented, addressing an crucial kernel in computation of multi-mesh head models of encephalographic (EEG) source localization. This includes not only a highly efficient implementation of the routine itself achieving excellent speedups versus an optimized CPU implementation, but also a novel GPU-friendly compressed storage scheme for triangular matrices. Secondly, a scalable multi-GPU solver for non-hermitian linear systems was implemented. It is integrated into a simulation environment for electrical impedance tomography (EIT) that requires frequent solution of complex systems with millions of unknowns, a task that this solution can perform within seconds. In terms of computational throughput, it outperforms not only an highly optimized multi-CPU reference, but related GPU-based work as well. Finally, a GPU-accelerated graphical EEG real-time source localization software was implemented. Thanks to acceleration, it can meet real-time requirements in unpreceeded anatomical detail running more complex localization algorithms. Additionally, a novel implementation to extract anatomical priors from static Magnetic Resonance (MR) scansions has been included

    Towards Distributed Task-based Visualization and Data Analysis

    Get PDF
    To support scientific work with large and complex data the field of scientific visualization emerged in computer science and produces images through computational analysis of the data. Frameworks for combination of different analysis and visualization modules allow the user to create flexible pipelines for this purpose and set the standard for interactive scientific visualization used by domain scientists. Existing frameworks employ a thread-parallel message-passing approach to parallel and distributed scalability, leaving the field of scientific visualization in high performance computing to specialized ad-hoc implementations. The task-parallel programming paradigm proves promising to improve scalability and portability in high performance computing implementations and thus, this thesis aims towards the creation of a framework for distributed, task-based visualization modules and pipelines. The major contribution of the thesis is the establishment of modules for Merge Tree construction and (based on the former) topological simplification. Such modules already form a necessary first step for most visualization pipelines and can be expected to increase in importance for larger and more complex data produced and/or analysed by high performance computing. To create a task-parallel, distributed Merge Tree construction module the construction process has to be completely revised. We derive a novel property of Merge Tree saddles and introduce a novel task-parallel, distributed Merge Tree construction method that has both good performance and scalability. This forms the basis for a module for topological simplification which we extend by introducing novel alternative simplification parameters that aim to reduce the importance of prior domain knowledge to increase flexibility in typical high performance computing scenarios. Both modules lay the groundwork for continuative analysis and visualization steps and form a fundamental step towards an extensive task-parallel visualization pipeline framework for high performance computing.Wissenschaftliche Visualisierung ist eine Disziplin der Informatik, die durch computergestützte Analyse Bilder aus Datensätzen erzeugt, um das wissenschaftliche Arbeiten mit großen und komplexen Daten zu unterstützen. Softwaresysteme, die dem Anwender die Kombination verschiedener Analyse- und Visualisierungsmodule zu einer flexiblen Pipeline erlauben, stellen den Standard für interaktive wissenschaftliche Visualisierung. Die hierfür bereits existierenden Systeme setzen auf Thread-Parallelisierung mit expliziter Kommunikation, sodass das Feld der wissenschaftlichen Visualisierung auf Hochleistungsrechnern meist spezialisierten Direktlösungen überlassen wird. An dieser Stelle scheint Task-Parallelisierung vielversprechend, um Skalierbarkeit und Übertragbarkeit von Lösungen für Hochleistungsrechner zu verbessern. Daher zielt die vorliegende Arbeit auf die Umsetzung eines Softwaresystems für verteilte und task-parallele Visualisierungsmodule und -pipelines ab. Der zentrale Beitrag den die vorliegende Arbeit leistet ist die Einführung zweier Module für Merge Tree Konstruktion und topologische Datenbereinigung. Solche Module stellen bereits einen notwendigen ersten Schritt für die meisten Visualisierungspipelines dar und werden für größere und komplexere Datensätze, die im Hochleistungsrechnen erzeugt beziehungsweise analysiert werden, erwartungsgemäß noch wichtiger. Um eine Task-parallele, verteilbare Konstruktionsmethode für Merge Trees zu entwickeln musste der etablierte Algorithmus grundlegend überarbeitet werden. In dieser Arbeit leiten wir eine neue Eigenschaft für Merge Tree Knoten her und entwickeln einen neuartigen Konstruktionsalgorithmus, der gute Performance und Skalierbarkeit aufweist. Darauf aufbauend entwickeln wir ein Modul für topologische Datenbereinigung, welche wir durch neue, alternative Bereinigungsparameter erweitern, um die Flexibilität im Einstaz auf Hochleistungsrechnern zu erhöhen. Beide Module ermöglichen weiterführende Analyse und Visualisierung und setzen einen Grundstein für die Entwicklung eines umfassenden Task-parallelen Softwaresystems für Visualisierungspipelines auf Hochleistungsrechnern

    MASSIVELY PARALLEL OIL RESERVOIR SIMULATION FOR HISTORY MATCHING

    Get PDF

    Context adaptivity for selected computational kernels with applications in optoelectronics and in phylogenetics

    Get PDF
    Computational Kernels sind der kritische Teil rechenintensiver Software, wofür der größte Rechenaufwand anfällt; daher müssen deren Design und Implementierung sorgfältig vorgenommen werden. Zwei wissenschaftliche Anwendungsprobleme aus der Optoelektronik und aus der Phylogenetik, sowie dazugehörige Computational Kernels motivieren diese Arbeit. Im ersten Anwendungsproblem werden Komponenten zur Berechnung komplex-symmetrischer Eigenwertprobleme diskutiert, welche in der Simulation von Wellenleitern in der Optoelektronik auftreten. LAPACK und ScaLAPACK beinhalten sehr leistungsfähige Referenzimplementierungen für bestimmte Problemstellungen der linearen Algebra. In Bezug auf Eigenwertprobleme werden ausschließlich reell-symmetrische und komplex-hermitesche Varianten angeboten, daher sind effiziente Codes für komplex-symmetrische (nicht-hermitesche) Eigenwertprobleme sehr wünschenswert. Das zweite Anwendungsproblem behandelt einen parallelen, wissenschaftlichen Workflow zur Rekonstruktion von Phylogenien, welcher entworfen, umgesetzt und evaluiert wird. Die Rekonstruktion von phylogenetischen Bäumen ist ein NP-hartes Problem, welches äußerst viel Rechenkapazität benötigt, wodurch ein paralleler Ansatz erforderlich ist. Die grundlegende Idee dieser Arbeit ist die Untersuchung der Wechselbeziehung zwischen dem Kontext der behandelten Kernels und deren Effizienz. Ein Kontext eines Computational Kernels beinhaltet Modellaspekte (z.B. Struktur der Eingabedaten), Softwareaspekte (z.B. rechenintensive Bibliotheken), Hardwareaspekte (z.B. verfügbarer Hauptspeicher und unterstützte darstellbare Genauigkeit), sowie weitere Anforderungen bzw. Einschränkungen. Einschränkungen sind hinsichtlich Laufzeit, Speicherverbrauch, gelieferte Genauigkeit usw., möglich. Das Konzept der Kontextadaptivität wird für ausgewählte Anwendungsprobleme in Computational Science gezeigt. Die vorgestellte Methode ist ein Meta-Algorithmus, der Aspekte des Kontexts verwendet, um optimale Leistung hinsichtlich der angewandten Metrik zu erzielen. Es ist wichtig, den Kontext einzubeziehen, weil Anforderungen gegeneinander ausgetauscht werden könnten, resultierend in einer höheren Leistung. Zum Beispiel kann im Falle einer niedrigen benötigten Genauigkeit ein schnellerer Algorithmus einer bewährten, aber langsameren, Methode vorgezogen werden. Speziell für komplex-symmetrische Eigenwertprobleme zugeschnittene Codes zielen darauf ab, Genauigkeit gegen Geschwindigkeit einzutauschen. Die Innovation wird durch neue algorithmische Ansätze belegt, welche die algebraische Struktur ausnutzen. Bezüglich der Berechnung von phylogenetischen Bäumen wird die Abbildung eines Workflows auf ein Campusgrid-System gezeigt. Die Innovation besteht in der anpassungsfähigen Implementierung des Workflows, der nebenläufige Instanzen von Computational Kernels in einem verteilten System darstellt. Die Adaptivität bezeichnet hier die Fähigkeit des Workflows, die Rechenlast hinsichtlich verfügbarer Rechner, Zeit und Qualität der phylogenetischen Bäume anzupassen. Kontextadaptivität wird durch die Implementierung und Evaluierung von wissenschaftlichen Problemstellungen aus der Optoelektronik und aus der Phylogenetik gezeigt. Für das Fachgebiet der Optoelektronik zielt eine Familie von Algorithmen auf die Lösung von verallgemeinerten komplex-symmetrischen Eigenwertproblemen ab. Unser alternativer Ansatz nutzt die symmetrische Struktur aus und spielt günstigere Laufzeit gegen eine geringere Genauigkeit aus. Dieser Ansatz ist somit schneller, jedoch (meist) ungenauer als der konventionelle Lösungsweg. Zusätzlich zum sequentiellen Löser wird eine parallele Variante diskutiert und teilweise auf einem Cluster mit bis zu 1024 CPU-Cores evaluiert. Die erzielten Laufzeiten beweisen die Überlegenheit unseres Ansatzes -- allerdings sind weitere Untersuchungen zur Erhöhung der Genauigkeit notwendig. Für das Fachgebiet der Phylogenetik zeigen wir, dass die phylogenetische Baum-Rekonstruktion mittels eines Condor-basierten Campusgrids effizient parallelisiert werden kann. Dieser parallele wissenschaftliche Workflow weist einen geringen parallelen Overhead auf, resultierend in exzellenter Effizienz.Computational kernels are the crucial part of computationally intensive software, where most of the computing time is spent; hence, their design and implementation have to be accomplished carefully. Two scientific application problems from optoelectronics and from phylogenetics and corresponding computational kernels are motivating this thesis. In the first application problem, components for the computational solution of complex symmetric EVPs are discussed, arising in the simulation of waveguides in optoelectronics. LAPACK and ScaLAPACK contain highly effective reference implementations for certain numerical problems in linear algebra. With respect to EVPs, only real symmetric and complex Hermitian codes are available, therefore efficient codes for complex symmetric (non-Hermitian) EVPs are highly desirable. In the second application problem, a parallel scientific workflow for computing phylogenies is designed, implemented, and evaluated. The reconstruction of phylogenetic trees is an NP-hard problem that demands huge scale computing capabilities, and therefore a parallel approach is necessary. One idea underlying this thesis is to investigate the interaction between the context of the kernels considered and their efficiency. The context of a computational kernel comprises model aspects (for instance, structure of input data), software aspects (for instance, computational libraries), hardware aspects (for instance, available RAM and supported precision), and certain requirements or constraints. Constraints may exist with respect to runtime, memory usage, accuracy required, etc.. The concept of context adaptivity is demonstrated to selected computational problems in computational science. The method proposed here is a meta-algorithm that utilizes aspects of the context to result in an optimal performance concerning the applied metric. It is important to consider the context, because requirements may be traded for each other, resulting in a higher performance. For instance, in case of a low required accuracy, a faster algorithmic approach may be favored over an established but slower method. With respect to EVPs, prototypical codes that are especially targeted at complex symmetric EVPs aim at trading accuracy for speed. The innovation is evidenced by the implementation of new algorithmic approaches exploiting structure. Concerning the computation of phylogenetic trees, the mapping of a scientific workflow onto a campus grid system is demonstrated. The adaptive implementation of the workflow features concurrent instances of a computational kernel on a distributed system. Here, adaptivity refers to the ability of the workflow to vary computational load in terms of available computing resources, available time, and quality of reconstructed phylogenetic trees. Context adaptivity is discussed by means of computational problems from optoelectronics and from phylogenetics. For the field of optoelectronics, a family of implemented algorithms aim at solving generalized complex symmetric EVPs. Our alternative approach exploiting structural symmetry trades runtime for accuracy, hence, it is faster but (usually) features a lower accuracy than the conventional approach. In addition to a complete sequential solver, a parallel variant is discussed and partly evaluated on a cluster utilizing up to 1024 CPU cores. Achieved runtimes evidence the superiority of our approach, however, further investigations on improving accuracy are suggested. For the field of phylogenetics, we show that phylogenetic tree reconstruction can efficiently be parallelized on a campus grid infrastructure. The parallel scientific workflow features a moderate parallel overhead, resulting in an excellent efficiency

    Scalable Applications on Heterogeneous System Architectures: A Systematic Performance Analysis Framework

    Get PDF
    The efficient parallel execution of scientific applications is a key challenge in high-performance computing (HPC). With growing parallelism and heterogeneity of compute resources as well as increasingly complex software, performance analysis has become an indispensable tool in the development and optimization of parallel programs. This thesis presents a framework for systematic performance analysis of scalable, heterogeneous applications. Based on event traces, it automatically detects the critical path and inefficiencies that result in waiting or idle time, e.g. due to load imbalances between parallel execution streams. As a prerequisite for the analysis of heterogeneous programs, this thesis specifies inefficiency patterns for computation offloading. Furthermore, an essential contribution was made to the development of tool interfaces for OpenACC and OpenMP, which enable a portable data acquisition and a subsequent analysis for programs with offload directives. At present, these interfaces are already part of the latest OpenACC and OpenMP API specification. The aforementioned work, existing preliminary work, and established analysis methods are combined into a generic analysis process, which can be applied across programming models. Based on the detection of wait or idle states, which can propagate over several levels of parallelism, the analysis identifies wasted computing resources and their root cause as well as the critical-path share for each program region. Thus, it determines the influence of program regions on the load balancing between execution streams and the program runtime. The analysis results include a summary of the detected inefficiency patterns and a program trace, enhanced with information about wait states, their cause, and the critical path. In addition, a ranking, based on the amount of waiting time a program region caused on the critical path, highlights program regions that are relevant for program optimization. The scalability of the proposed performance analysis and its implementation is demonstrated using High-Performance Linpack (HPL), while the analysis results are validated with synthetic programs. A scientific application that uses MPI, OpenMP, and CUDA simultaneously is investigated in order to show the applicability of the analysis

    Development of advanced geometric models and acceleration techniques for Monte Carlo simulation in Medical Physics

    Get PDF
    Els programes de simulació Monte Carlo de caràcter general s'utilitzen actualment en una gran varietat d'aplicacions.Tot i això, els models geomètrics implementats en la majoria de programes imposen certes limitacions a la forma dels objectes que es poden definir. Aquests models no són adequats per descriure les superfícies arbitràries que es troben en estructures anatòmiques o en certs aparells mèdics i, conseqüentment, algunes aplicacions que requereixen l'ús de models geomètrics molt detallats no poden ser acuradament estudiades amb aquests programes.L'objectiu d'aquesta tesi doctoral és el desenvolupament de models geomètrics i computacionals que facilitin la descripció dels objectes complexes que es troben en aplicacions de física mèdica. Concretament, dos nous programes de simulació Monte Carlo basats en PENELOPE han sigut desenvolupats. El primer programa, penEasy, utilitza un algoritme de caràcter general estructurat i inclou diversos models de fonts de radiació i detectors que permeten simular fàcilment un gran nombre d'aplicacions. Les noves rutines geomètriques utilitzades per aquest programa, penVox, extenen el model geomètric estàndard de PENELOPE, basat en superfícices quàdriques, per permetre la utilització d'objectes voxelitzats. Aquests objectes poden ser creats utilitzant la informació anatòmica obtinguda amb una tomografia computeritzada i, per tant, aquest model geomètric és útil per simular aplicacions que requereixen l'ús de l'anatomia de pacients reals (per exemple, la planificació radioterapèutica). El segon programa, penMesh, utilitza malles de triangles per definir la forma dels objectes simulats. Aquesta tècnica, que s'utilitza freqüentment en el camp del disseny per ordinador, permet representar superfícies arbitràries i és útil per simulacions que requereixen un gran detall en la descripció de la geometria, com per exemple l'obtenció d'imatges de raig x del cos humà.Per reduir els inconvenients causats pels llargs temps d'execució, els algoritmes implementats en els nous programes s'han accelerat utilitzant tècniques sofisticades, com per exemple una estructura octree. També s'ha desenvolupat un paquet de programari per a la paral·lelització de simulacions Monte Carlo, anomentat clonEasy, que redueix el temps real de càlcul de forma proporcional al nombre de processadors que s'utilitzen.Els programes de simulació que es presenten en aquesta tesi són gratuïts i de codi lliures. Aquests programes s'han provat en aplicacions realistes de física mèdica i s'han comparat amb altres programes i amb mesures experimentals.Per tant, actualment ja estan llestos per la seva distribució pública i per la seva aplicació a problemes reals.Monte Carlo simulation of radiation transport is currently applied in a large variety of areas. However, the geometric models implemented in most general-purpose codes impose limitations on the shape of the objects that can be defined. These models are not well suited to represent the free-form (i.e., arbitrary) shapes found in anatomic structures or complex medical devices. As a result, some clinical applications that require the use of highly detailed phantoms can not be properly addressed.This thesis is devoted to the development of advanced geometric models and accelration techniques that facilitate the use of state-of-the-art Monte Carlo simulation in medical physics applications involving detailed anatomical phantoms. To this end, two new codes, based on the PENELOPE package, have been developed. The first code, penEasy, implements a modular, general-purpose main program and provides various source models and tallies that can be readily used to simulate a wide spectrum of problems. Its associated geometry routines, penVox, extend the standard PENELOPE geometry, based on quadric surfaces, to allow the definition of voxelised phantoms. This kind of phantoms can be generated using the information provided by a computed tomography and, therefore, penVox is convenient for simulating problems that require the use of the anatomy of real patients (e.g., radiotherapy treatment planning). The second code, penMesh, utilises closed triangle meshes to define the boundary of each simulated object. This approach, which is frequently used in computer graphics and computer-aided design, makes it possible to represent arbitrary surfaces and it is suitable for simulations requiring a high anatomical detail (e.g., medical imaging).A set of software tools for the parallelisation of Monte Carlo simulations, clonEasy, has also been developed. These tools can reduce the simulation time by a factor that is roughly proportional to the number of processors available and, therefore, facilitate the study of complex settings that may require unaffordable execution times in a sequential simulation.The computer codes presented in this thesis have been tested in realistic medical physics applications and compared with other Monte Carlo codes and experimental data. Therefore, these codes are ready to be publicly distributed as free and open software and applied to real-life problems.Postprint (published version

    An insight in cloud computing solutions for intensive processing of remote sensing data

    Get PDF
    The investigation of Earth's surface deformation phenomena provides critical insights into several processes of great interest for science and society, especially from the perspective of further understanding the Earth System and the impact of the human activities. Indeed, the study of ground deformation phenomena can be helpful for the comprehension of the geophysical dynamics dominating natural hazards such as earthquakes, volcanoes and landslide. In this context, the microwave space-borne Earth Observation (EO) techniques represent very powerful instruments for the ground deformation estimation. In particular, Small BAseline Subset (SBAS) is regarded as one of the key techniques, for its ability to investigate surface deformation affecting large areas of the Earth with a centimeter to millimeter accuracy in different scenarios (volcanoes, tectonics, landslides, anthropogenic induced land motions). The current Remote Sensing scenario is characterized by the availability of huge archives of radar data that are going to increase with the advent of Sentinel-1 satellites. The effective exploitation of this large amount of data requires both adequate computing resources as well as advanced algorithms able to properly exploit such facilities. In this work we concentrated on the use of the P-SBAS algorithm (a parallel version of SBAS) within HPC infrastructure, to finally investigate the effectiveness of such technologies for EO applications. In particular we demonstrated that the cloud computing solutions represent a valid alternative for scientific application and a promising research scenario, indeed, from all the experiments that we have conducted and from the results obtained performing Parallel Small Baseline Subset (P-SBAS) processing, the cloud technologies and features result to be absolutely competitive in terms of performance with in-house HPC cluster solution

    Data management in dynamic distributed computing environments

    Get PDF
    Data management in parallel computing systems is a broad and increasingly important research topic. As network speeds have surged, so too has the movement to transition storage and computation loads to wide-area network resources. The Grid, the Cloud, and Desktop Grids all represent different aspects of this movement towards highly-scalable, distributed, and utility computing. This dissertation contends that a peer-to-peer (P2P) networking paradigm is a natural match for data sharing within and between these heterogeneous network architectures. Peer-to-peer methods such as dynamic discovery, fault-tolerance, scalability, and ad-hoc security infrastructures provide excellent mappings for many of the requirements in today’s distributed computing environment. In recent years, volunteer Desktop Grids have seen a growth in data throughput as application areas expand and new problem sets emerge. These increasing data needs require storage networks that can scale to meet future demand while also facilitating expansion into new data-intensive research areas. Current practices are to mirror data from centralized locations, a technique that is not practical for growing data sets, dynamic projects, or data-intensive applications. The fusion of Desktop and Service Grids provides an ideal use-case to research peer-to-peer data distribution strategies in a hybrid environment. Desktop Grids have a data management gap, while integration with Service Grids raises new challenges with regard to cross-platform design. The work undertaken here is two-fold: first it explores how P2P techniques can be leveraged to meet the data management needs of Desktop Grids, and second, it shows how the same distribution paradigm can provide migration paths for Service Grid data. The result of this research is a Peer-to-Peer Architecture for Data-Intensive Cycle Sharing (ADICS) that is capable not only of distributing volunteer computing data, but also of providing a transitional platform and storage space for migrating Service Grid jobs to Desktop Grid environments
    corecore