72 research outputs found

    Data Parallel Line Relaxation (DPLR) Code User Manual: Acadia - Version 4.01.1

    Get PDF
    Data-Parallel Line Relaxation (DPLR) code is a computational fluid dynamic (CFD) solver that was developed at NASA Ames Research Center to help mission support teams generate high-value predictive solutions for hypersonic flow field problems. The DPLR Code Package is an MPI-based, parallel, full three-dimensional Navier-Stokes CFD solver with generalized models for finite-rate reaction kinetics, thermal and chemical non-equilibrium, accurate high-temperature transport coefficients, and ionized flow physics incorporated into the code. DPLR also includes a large selection of generalized realistic surface boundary conditions and links to enable loose coupling with external thermal protection system (TPS) material response and shock layer radiation codes

    A multi-tier cached I/O architecture for massively parallel supercomputers

    Get PDF
    Recent advances in storage technologies and high performance interconnects have made possible in the last years to build, more and more potent storage systems that serve thousands of nodes. The majority of storage systems of clusters and supercomputers from Top 500 list are managed by one of three scalable parallel file systems: GPFS, PVFS, and Lustre. Most large-scale scientific parallel applications are written in Message Passing Interface (MPI), which has become the de-facto standard for scalable distributed memory machines. One part of the MPI standard is related to I/O and has among its main goals the portability and efficiency of file system accesses. All of the above mentioned parallel file systems may be accessed also through the MPI-IO interface. The I/O access patterns of scientific parallel applications often consist of accesses to a large number of small, non-contiguous pieces of data. For small file accesses the performance is dominated by the latency of network transfers and disks. Parallel scientific applications lead to interleaved file access patterns with high interprocess spatial locality at the I/O nodes. Additionally, scientific applications exhibit repetitive behaviour when a loop or a function with loops issues I/O requests. When I/O access patterns are repetitive, caching and prefetching can effectively mask their access latency. These characteristics of the access patterns motivated several researchers to propose parallel I/O optimizations both at library and file system levels. However, these optimizations are not always integrated across different layers in the systems. In this dissertation we propose a novel generic parallel I/O architecture for clusters and supercomputers. Our design is aimed at large-scale parallel architectures with thousands of compute nodes. Besides acting as middleware for existing parallel file systems, our architecture provides on-line virtualization of storage resources. Another objective of this thesis is to factor out the common parallel I/O functionality from clusters and supercomputers in generic modules in order to facilitate porting of scientific applications across these platforms. Our solution is based on a multi-tier cache architecture, collective I/O, and asynchronous data staging strategies hiding the latency of data transfer between cache tiers. The thesis targets to reduce the file access latency perceived by the data-intensive parallel scientific applications by multi-layer asynchronous data transfers. In order to accomplish this objective, our techniques leverage the multi-core architectures by overlapping computation with communication and I/O in parallel threads. Prototypes of our solutions have been deployed on both clusters and Blue Gene supercomputers. Performance evaluation shows that the combination of collective strategies with overlapping of computation, communication, and I/O may bring a substantial performance benefit for access patterns common for parallel scientific applications.-----------------------------------------------------------------------------------------------------------------------------En los últimos años se ha observado un incremento sustancial de la cantidad de datos producidos por las aplicaciones científicas paralelas y de la necesidad de almacenar estos datos de forma persistente. Los sistemas de ficheros paralelos como PVFS, Lustre y GPFS han ofrecido una solución escalable para esta demanda creciente de almacenamiento. La mayoría de las aplicaciones científicas son escritas haciendo uso de la interfaz de paso de mensajes (MPI), que se ha convertido en un estándar de-facto de programación para las arquitecturas de memoria distribuida. Las aplicaciones paralelas que usan MPI pueden acceder a los sistemas de ficheros paralelos a través de la interfaz ofrecida por MPI-IO. Los patrones de acceso de las aplicaciones científicas paralelas consisten en un gran número de accesos pequeños y no contiguos. Para tamaños de acceso pequeños, el rendimiento viene limitado por la latencia de las transferencias de red y disco. Además, las aplicaciones científicas llevan a cabo accesos con una alta localidad espacial entre los distintos procesos en los nodos de E/S. Adicionalmente, las aplicaciones científicas presentan típicamente un comportamiento repetitivo. Cuando los patrones de acceso de E/S son repetitivos, técnicas como escritura demorada y lectura adelantada pueden enmascarar de forma eficiente las latencias de los accesos de E/S. Estas características han motivado a muchos investigadores en proponer optimizaciones de E/S tanto a nivel de biblioteca como a nivel del sistema de ficheros. Sin embargo, actualmente estas optimizaciones no se integran siempre a través de las distintas capas del sistema. El objetivo principal de esta tesis es proponer una nueva arquitectura genérica de E/S paralela para clusters y supercomputadores. Nuestra solución está basada en una arquitectura de caches en varias capas, una técnica de E/S colectiva y estrategias de acceso asíncronas que ocultan la latencia de transferencia de datos entre las distintas capas de caches. Nuestro diseño está dirigido a arquitecturas paralelas escalables con miles de nodos de cómputo. Además de actuar como middleware para los sistemas de ficheros paralelos existentes, nuestra arquitectura debe proporcionar virtualización on-line de los recursos de almacenamiento. Otro de los objeticos marcados para esta tesis es la factorización de las funcionalidades comunes en clusters y supercomputadores, en módulos genéricos que faciliten el despliegue de las aplicaciones científicas a través de estas plataformas. Se han desplegado distintos prototipos de nuestras soluciones tanto en clusters como en supercomputadores. Las evaluaciones de rendimiento demuestran que gracias a la combicación de las estratégias colectivas de E/S y del solapamiento de computación, comunicación y E/S, se puede obtener una sustancial mejora del rendimiento en los patrones de acceso anteriormente descritos, muy comunes en las aplicaciones paralelas de caracter científico

    Belle II Technical Design Report

    Full text link
    The Belle detector at the KEKB electron-positron collider has collected almost 1 billion Y(4S) events in its decade of operation. Super-KEKB, an upgrade of KEKB is under construction, to increase the luminosity by two orders of magnitude during a three-year shutdown, with an ultimate goal of 8E35 /cm^2 /s luminosity. To exploit the increased luminosity, an upgrade of the Belle detector has been proposed. A new international collaboration Belle-II, is being formed. The Technical Design Report presents physics motivation, basic methods of the accelerator upgrade, as well as key improvements of the detector.Comment: Edited by: Z. Dole\v{z}al and S. Un

    Predictive analysis and optimisation of pipelined wavefront applications using reusable analytic models

    Get PDF
    Pipelined wavefront computations are an ubiquitous class of high performance parallel algorithms used for the solution of many scientific and engineering applications. In order to aid the design and optimisation of these applications, and to ensure that during procurement platforms are chosen best suited to these codes, there has been considerable research in analysing and evaluating their operational performance. Wavefront codes exhibit complex computation, communication, synchronisation patterns, and as a result there exist a large variety of such codes and possible optimisations. The problem is compounded by each new generation of high performance computing system, which has often introduced a previously unexplored architectural trait, requiring previous performance models to be rewritten and reevaluated. In this thesis, we address the performance modelling and optimisation of this class of application, as a whole. This differs from previous studies in which bespoke models are applied to specific applications. The analytic performance models are generalised and reusable, and we demonstrate their application to the predictive analysis and optimisation of pipelined wavefront computations running on modern high performance computing systems. The performance model is based on the LogGP parameterisation, and uses a small number of input parameters to specify the particular behaviour of most wavefront codes. The new parameters and model equations capture the key structural and behavioural differences among different wavefront application codes, providing a succinct summary of the operations for each application and insights into alternative wavefront application design. The models are applied to three industry-strength wavefront codes and are validated on several systems including a Cray XT3/XT4 and an InfiniBand commodity cluster. Model predictions show high quantitative accuracy (less than 20% error) for all high performance configurations and excellent qualitative accuracy. The thesis presents applications, projections and insights for optimisations using the model, which show the utility of reusable analytic models for performance engineering of high performance computing codes. In particular, we demonstrate the use of the model for: (1) evaluating application configuration and resulting performance; (2) evaluating hardware platform issues including platform sizing, configuration; (3) exploring hardware platform design alternatives and system procurement and, (4) considering possible code and algorithmic optimisations

    Development and deployment of an Inner Detector Minimum Bias Trigger and analysis of minimum bias data of the ATLAS experiment at the Large Hadron Collider

    Get PDF
    Weiche inelastische QCD Prozesse dominieren am LHC. Über 20 solcher Kollisionen werden innerhalb einer Strahlkreuzung bei ATLAS stattfinden, sobald der LHC die nominelle Luminosität von L = 1034 cm−2 s−1 und die Schwerpunktsenergie von p s = 14 TeV erreicht. Diese inelastischen Wechselwirkungen sind durch einen geringen Impulsübertrag gekennzeichnet, welche theoretisch lediglich durch phänomenologische Modelle angenähernd beschrieben werden können. Zu Beginn des Strahlbetriebs des LHC’s 2009 war die Luminosität relativ niedrig mit L = 1027 bis 1031 cm−2 s−1, was ein sehr gutes Szenario bot, um einzelne Proton-Proton Kollisionen zu selektieren und deren allgemeine Eigenschaften experimentell zu untersuchen. Zunächst wurde ein Minimum-Bias Trigger entwickelt, um Daten mit ATLAS aufzunehmen. Dieser Trigger, mbSpTrk, verarbeitet Signale der Silizium-Spurdetektoren und verwirft effizient Ereignisse ohne eine Proton-Wechselwirkung, wobei zugleich eine mögliche Verschiebung zu bestimmten Ereignistypen hin minimier wird. Um einen flexiblen Einsatz des Triggers zu gewährleisten, wurde er mit einer Sequenz ausgestattet, welche effizient Machinenuntergrund unterdrückt. Im zweiten Teil der Arbeit wurden geladenen Teilchenmultiplizitäten im zentralen Bereich in zwei kinematisch definierten Phasenräumen gemessen. Mindestens ein geladenes Teilchen mit einer Pseudorapidität kleiner als 0.8 und einem Transversalimpuls von pT > 0.5 bzw. 1 GeV musste vorhanden sein. Vier typische Minimum-Bias Verteilungen wurden bei zwei Schwerpunktsenergien von p s = 0.9 und 7 TeV gemessen. Die Ergebnisse sind derart präsentiert, dass sie nur minimal von Monte Carlo Modellen abhängen. Die vorgestellten Messungen stellen zudem den Beitrag der ATLAS Kollaboration dar für die erste, LHC-weit durchgeführte Analyse, der auch die CMS und ALICE Kollaborationen zustimmten. Ein Vergleich konnte mit den Pseudorapiditätsverteilungen angestellt werden.Soft inelastic QCD processes are the dominant proton-proton interaction type at the LHC. More than 20 of such collisions pile up within a single bunch-crossing at ATLAS, when the LHC is operated at design luminosity of L = 1034 cm−2 s−1 colliding proton bunches with an energy of p s = 14 TeV. Inelastic interactions are characterised by a small transverse momemtum transfer and can only be approximated by phenomenological models that need experimental data as input. The initial phase of LHC beam operation in 2009, with luminosites ranging from L = 1027 to 1031 cm−2 s−1, offered an ideal period to select single proton-proton interactions and study general aspects of their properties. As first part of this thesis, a Minimum Bias trigger was developed and used for data-taking in ATLAS. This trigger, mbSpTrk, processes signals of the silicon tracking detectors of ATLAS and was designed to fulfill efficiently reject empty events, while possible biases in the selection of proton-proton collisions is reduced to a minimum. The trigger is flexible enough to cope also with changing background conditions allowing to retain low-pT events while machine background is highly suppressed. As second part, measurements of inelastic charged particles were performed in two phase-space regions. Centrally produced charged particles were considered with a pseudorapidity smaller than 0.8 and a transverse momentum of pT > 0.5 or 1 GeV. Four characteristic distributions were measured at two centre-of-mass energies of p s = 0.9 and 7 TeV. The results are presented with minimal model dependency to compare them to predictions of different Monte Carlo models for soft particle production. This analysis represents also the ATLAS contribution for the first common LHC analysis to which the ATLAS, CMS and ALICE collaborations agreed. The pseudorapidity distributions for both energies and phase-space regions are compared to the respective results of ALICE and CMS

    Simulation driven machine learning methods to optimise design of physical experiments and enhance data analysis for testing of fusion energy heat exchanger components

    Get PDF
    Plasma facing components (PFCs) must be designed to routinely withstand the harsh environment of a fusion device, where temperatures at the core of the plasma exceed 150,000,000 °C. The heat by induction to verify extremes (HIVE) experimental facility was established to replicate the thermal loads a PFC is subjected to during normal operation of a fusion device.To maximise its impact on the design of PFCs, HIVE must deliver smarter testing and improved component insight. Currently, the experimental parameters required to deliver a certain response to the component are decided at the point of testing through a combination of previous experience, intuition, and trial & error, which is both time-consuming and unreliable. To assess a PFC’s suitability, knowledge of its mechanical performance while operating at high temperatures is desirable, however HIVE only records pointwise temperature measurements on the component’s surface using thermocouples. Currently, HIVE has no method of inferring a component’s mechanical response using the temperature measure-ments.Both the challenges of smarter testing and improved component insight can be achieved through the identification of inverse solutions. A popular approach to solving engineering inverse problems is surrogate assisted optimisation, where a machine learning model is trained using finite element (FE) simulation data. Much of the work in literature use single value surrogate models on quite simplistic problems, however HIVE is a real-world, multi-physics problem which requires full field (FF) surrogate models to solve its multitude of inverse problems.The development of a method which can easily construct FE data driven FF surrogates would be invaluable for a variety of tasks in engineering, as well as solving inverse problems. In this work, it demonstrates that it can provide a much more robust and comprehensive method of characterising a PFC’s strengths and limitations, enabling more informed decisions to be made during its design cycle

    Portable Checkpointing for Parallel Applications

    Full text link
    High Performance Computing (HPC) systems represent the peak of modern computational capability. As ever-increasing demands for computational power have fuelled the demand for ever-larger computing systems, modern HPC systems have grown to incorporate hundreds, thousands or as many as 130,000 processors. At these scales, the huge number of individual components in a single system makes the probability that a single component will fail quite high, with today's large HPC systems featuring mean times between failures on the order of hours or a few days. As many modern computational tasks require days or months to complete, fault tolerance becomes critical to HPC system design. The past three decades have seen significant amounts of research on parallel system fault tolerance. However, as most of it has been either theoretical or has focused on low-level solutions that are embedded into a particular operating system or type of hardware, this work has had little impact on real HPC systems. This thesis attempts to address this lack of impact by describing a high-level approach for implementing checkpoint/restart functionality that decouples the fault tolerance solution from the details of the operating system, system libraries and the hardware and instead connects it to the APIs implemented by the above components. The resulting solution enables applications that use these APIs to become self-checkpointing and self-restarting regardless of the the software/hardware platform that may implement the APIs. The particular focus of this thesis is on the problem of checkpoint/restart of parallel applications. It presents two theoretical checkpointing protocols, one for the message passing communication model and one for the shared memory model. The former is the first protocol to be compatible with application-level checkpointing of individual processes, while the latter is the first protocol that is compatible with arbitrary shared memory models, APIs, implementations and consistency protocols. These checkpointing protocols are used to implement checkpointing systems for applications that use the MPI and OpenMP parallel APIs, respectively, and are first in providing checkpoint/restart to arbitrary implementations of these popular APIs. Both checkpointing systems are extensively evaluated on multiple software/hardware platforms and are shown to feature low overheads
    corecore