1,319 research outputs found

    At the Locus of Performance: A Case Study in Enhancing CPUs with Copious 3D-Stacked Cache

    Full text link
    Over the last three decades, innovations in the memory subsystem were primarily targeted at overcoming the data movement bottleneck. In this paper, we focus on a specific market trend in memory technology: 3D-stacked memory and caches. We investigate the impact of extending the on-chip memory capabilities in future HPC-focused processors, particularly by 3D-stacked SRAM. First, we propose a method oblivious to the memory subsystem to gauge the upper-bound in performance improvements when data movement costs are eliminated. Then, using the gem5 simulator, we model two variants of LARC, a processor fabricated in 1.5 nm and enriched with high-capacity 3D-stacked cache. With a volume of experiments involving a board set of proxy-applications and benchmarks, we aim to reveal where HPC CPU performance could be circa 2028, and conclude an average boost of 9.77x for cache-sensitive HPC applications, on a per-chip basis. Additionally, we exhaustively document our methodological exploration to motivate HPC centers to drive their own technological agenda through enhanced co-design

    Communication-Efficient Probabilistic Algorithms: Selection, Sampling, and Checking

    Get PDF
    Diese Dissertation behandelt drei grundlegende Klassen von Problemen in Big-Data-Systemen, fĂŒr die wir kommunikationseffiziente probabilistische Algorithmen entwickeln. Im ersten Teil betrachten wir verschiedene Selektionsprobleme, im zweiten Teil das Ziehen gewichteter Stichproben (Weighted Sampling) und im dritten Teil die probabilistische KorrektheitsprĂŒfung von Basisoperationen in Big-Data-Frameworks (Checking). Diese Arbeit ist durch einen wachsenden Bedarf an Kommunikationseffizienz motiviert, der daher rĂŒhrt, dass der auf das Netzwerk und seine Nutzung zurĂŒckzufĂŒhrende Anteil sowohl der Anschaffungskosten als auch des Energieverbrauchs von Supercomputern und der Laufzeit verteilter Anwendungen immer weiter wĂ€chst. Überraschend wenige kommunikationseffiziente Algorithmen sind fĂŒr grundlegende Big-Data-Probleme bekannt. In dieser Arbeit schließen wir einige dieser LĂŒcken. ZunĂ€chst betrachten wir verschiedene Selektionsprobleme, beginnend mit der verteilten Version des klassischen Selektionsproblems, d. h. dem Auffinden des Elements von Rang kk in einer großen verteilten Eingabe. Wir zeigen, wie dieses Problem kommunikationseffizient gelöst werden kann, ohne anzunehmen, dass die Elemente der Eingabe zufĂ€llig verteilt seien. Hierzu ersetzen wir die Methode zur Pivotwahl in einem schon lange bekannten Algorithmus und zeigen, dass dies hinreichend ist. Anschließend zeigen wir, dass die Selektion aus lokal sortierten Folgen – multisequence selection – wesentlich schneller lösbar ist, wenn der genaue Rang des Ausgabeelements in einem gewissen Bereich variieren darf. Dies benutzen wir anschließend, um eine verteilte PrioritĂ€tswarteschlange mit Bulk-Operationen zu konstruieren. SpĂ€ter werden wir diese verwenden, um gewichtete Stichproben aus Datenströmen zu ziehen (Reservoir Sampling). Schließlich betrachten wir das Problem, die global hĂ€ufigsten Objekte sowie die, deren zugehörige Werte die grĂ¶ĂŸten Summen ergeben, mit einem stichprobenbasierten Ansatz zu identifizieren. Im Kapitel ĂŒber gewichtete Stichproben werden zunĂ€chst neue Konstruktionsalgorithmen fĂŒr eine klassische Datenstruktur fĂŒr dieses Problem, sogenannte Alias-Tabellen, vorgestellt. Zu Beginn stellen wir den ersten Linearzeit-Konstruktionsalgorithmus fĂŒr diese Datenstruktur vor, der mit konstant viel Zusatzspeicher auskommt. Anschließend parallelisieren wir diesen Algorithmus fĂŒr Shared Memory und erhalten so den ersten parallelen Konstruktionsalgorithmus fĂŒr Aliastabellen. Hiernach zeigen wir, wie das Problem fĂŒr verteilte Systeme mit einem zweistufigen Algorithmus angegangen werden kann. Anschließend stellen wir einen ausgabesensitiven Algorithmus fĂŒr gewichtete Stichproben mit ZurĂŒcklegen vor. Ausgabesensitiv bedeutet, dass die Laufzeit des Algorithmus sich auf die Anzahl der eindeutigen Elemente in der Ausgabe bezieht und nicht auf die GrĂ¶ĂŸe der Stichprobe. Dieser Algorithmus kann sowohl sequentiell als auch auf Shared-Memory-Maschinen und verteilten Systemen eingesetzt werden und ist der erste derartige Algorithmus in allen drei Kategorien. Wir passen ihn anschließend an das Ziehen gewichteter Stichproben ohne ZurĂŒcklegen an, indem wir ihn mit einem SchĂ€tzer fĂŒr die Anzahl der eindeutigen Elemente in einer Stichprobe mit ZurĂŒcklegen kombinieren. Poisson-Sampling, eine Verallgemeinerung des Bernoulli-Sampling auf gewichtete Elemente, kann auf ganzzahlige Sortierung zurĂŒckgefĂŒhrt werden, und wir zeigen, wie ein bestehender Ansatz parallelisiert werden kann. FĂŒr das Sampling aus Datenströmen passen wir einen sequentiellen Algorithmus an und zeigen, wie er in einem Mini-Batch-Modell unter Verwendung unserer im Selektionskapitel eingefĂŒhrten Bulk-PrioritĂ€tswarteschlange parallelisiert werden kann. Das Kapitel endet mit einer ausfĂŒhrlichen Evaluierung unserer Aliastabellen-Konstruktionsalgorithmen, unseres ausgabesensitiven Algorithmus fĂŒr gewichtete Stichproben mit ZurĂŒcklegen und unseres Algorithmus fĂŒr gewichtetes Reservoir-Sampling. Um die Korrektheit verteilter Algorithmen probabilistisch zu verifizieren, schlagen wir Checker fĂŒr grundlegende Operationen von Big-Data-Frameworks vor. Wir zeigen, dass die ÜberprĂŒfung zahlreicher Operationen auf zwei „Kern“-Checker reduziert werden kann, nĂ€mlich die PrĂŒfung von Aggregationen und ob eine Folge eine Permutation einer anderen Folge ist. WĂ€hrend mehrere AnsĂ€tze fĂŒr letzteres Problem seit geraumer Zeit bekannt sind und sich auch einfach parallelisieren lassen, ist unser Summenaggregations-Checker eine neuartige Anwendung der gleichen Datenstruktur, die auch zĂ€hlenden Bloom-Filtern und dem Count-Min-Sketch zugrunde liegt. Wir haben beide Checker in Thrill, einem Big-Data-Framework, implementiert. Experimente mit absichtlich herbeigefĂŒhrten Fehlern bestĂ€tigen die von unserer theoretischen Analyse vorhergesagte Erkennungsgenauigkeit. Dies gilt selbst dann, wenn wir hĂ€ufig verwendete schnelle Hash-Funktionen mit in der Theorie suboptimalen Eigenschaften verwenden. Skalierungsexperimente auf einem Supercomputer zeigen, dass unsere Checker nur sehr geringen Laufzeit-Overhead haben, welcher im Bereich von 2 %2\,\% liegt und dabei die Korrektheit des Ergebnisses nahezu garantiert wird

    SKIRT: hybrid parallelization of radiative transfer simulations

    Full text link
    We describe the design, implementation and performance of the new hybrid parallelization scheme in our Monte Carlo radiative transfer code SKIRT, which has been used extensively for modeling the continuum radiation of dusty astrophysical systems including late-type galaxies and dusty tori. The hybrid scheme combines distributed memory parallelization, using the standard Message Passing Interface (MPI) to communicate between processes, and shared memory parallelization, providing multiple execution threads within each process to avoid duplication of data structures. The synchronization between multiple threads is accomplished through atomic operations without high-level locking (also called lock-free programming). This improves the scaling behavior of the code and substantially simplifies the implementation of the hybrid scheme. The result is an extremely flexible solution that adjusts to the number of available nodes, processors and memory, and consequently performs well on a wide variety of computing architectures.Comment: 21 pages, 20 figure

    All-Atom Modeling of Protein Folding and Aggregation

    Get PDF
    Theoretical investigations of biorelevant processes in the life-science research require highly optimized simulation methods. Therefore, massively parallel Monte Carlo algorithms, namely MTM, were successfully developed and applied to the field of reversible protein folding allowing the thermodynamic characterization of proteins on an atomistic level. Further, the formation process of trans-membrane pores in the TatA system could be elucidated and the structure of the complex could be predicted

    Scalability Engineering for Parallel Programs Using Empirical Performance Models

    Get PDF
    Performance engineering is a fundamental task in high-performance computing (HPC). By definition, HPC applications should strive for maximum performance. As HPC systems grow larger and more complex, the scalability of an application has become of primary concern. Scalability is the ability of an application to show satisfactory performance even when the number of processors or the problems size is increased. Although various analysis techniques for scalability were suggested in past, engineering applications for extreme-scale systems still occurs ad hoc. The challenge is to provide techniques that explicitly target scalability throughout the whole development cycle, thereby allowing developers to uncover bottlenecks earlier in the development process. In this work, we develop a number of fundamental approaches in which we use empirical performance models to gain insights into the code behavior at higher scales. In the first contribution, we propose a new software engineering approach for extreme-scale systems. Specifically, we develop a framework that validates asymptotic scalability expectations of programs against their actual behavior. The most important applications of this method, which is especially well suited for libraries encapsulating well-studied algorithms, include initial validation, regression testing, and benchmarking to compare implementation and platform alternatives. We supply a tool-chain that automates large parts of the framework, thus allowing it to be continuously applied throughout the development cycle with very little effort. We evaluate the framework with MPI collective operations, a data-mining code, and various OpenMP constructs. In addition to revealing unexpected scalability bottlenecks, the results also show that it is a viable approach for systematic validation of performance expectations. As the second contribution, we show how the isoefficiency function of a task-based program can be determined empirically and used in practice to control the efficiency. Isoefficiency, a concept borrowed from theoretical algorithm analysis, binds efficiency, core count, and the input size in one analytical expression, thereby allowing the latter two to be adjusted according to given (realistic) efficiency objectives. Moreover, we analyze resource contention by modeling the efficiency of contention-free execution. This allows poor scaling to be attributed either to excessive resource contention overhead or structural conflicts related to task dependencies or scheduling. Our results, obtained with applications from two benchmark suites, demonstrate that our approach provides insights into fundamental scalability limitations or excessive resource overhead and can help answer critical co-design questions. Our contributions for better scalability engineering can be used not only in the traditional software development cycle, but also in other, related fields, such as algorithm engineering. It is a field that uses the software engineering cycle to produce algorithms that can be utilized in applications more easily. Using our contributions, algorithm engineers can make informed design decisions, get better insights, and save experimentation time

    Numerical optimization design of advanced transonic wing configurations

    Get PDF
    A computationally efficient and versatile technique for use in the design of advanced transonic wing configurations has been developed. A reliable and fast transonic wing flow-field analysis program, TWING, has been coupled with a modified quasi-Newton method, unconstrained optimization algorithm, QNMDIF, to create a new design tool. Fully three-dimensional wing designs utilizing both specified wing pressure distributions and drag-to-lift ration minimization as design objectives are demonstrated. Because of the high computational efficiency of each of the components of the design code, in particular the vectorization of TWING and the high speed of the Cray X-MP vector computer, the computer time required for a typical wing design is reduced by approximately an order of magnitude over previous methods. In the results presented here, this computed wave drag has been used as the quantity to be optimized (minimized) with great success, yielding wing designs with nearly shock-free (zero wave drag) pressure distributions and very reasonable wing section shapes

    Seismic full-waveform inversion of the crust-mantle structure beneath China and adjacent regions

    Get PDF
    Since the late 1970s, seismic tomography has been emerging as the pre-eminent tool for imaging the Earth’s interior from the meter to the global scale. Significant recent advances in seismic data acquisition, high-performance computing, and modern numerical methods have drastically progressed tomographic methods. Today it is technically feasible to accurately simulate seismic wave propagation through realistically heterogeneous Earth models across a range of scales. When seismic waves propagate inside the Earth and encounter structural heterogeneities with a certain scale, wave propagation speed changes, reflection, and scattering phenomena occur, and interconversions between compressional and shear waves happen. The combined effect of multiple heterogeneities produces a highly complicated wavefield recorded in the form of three-component (vertical, radial, transverse) seismograms. The ultimate objective of seismic imaging is to utilize the full information from waveforms recorded at seismic stations distributed around the globe in a broad frequency range to characterize detailed tomographic images of Earth’s interior by fitting synthetic seismograms to recorded seismograms. The full-waveform inversion technique based on adjoint and spectral-element methods can be employed to maximumly exploit the information contained in these seismic wavefield complexities to determine the fine-scale structural heterogeneities from which they originated across various orders of magnitude in frequency and wavelength. Over the past two decades, the three-dimensional crust and mantle structure beneath the broad Asian region has attracted much attention in seismic studies due to its complicated and unique geological setting involving active lithospheric deformation, intracontinental rifting, intraplate seismotectonics, volcanism and magmatism, continent-continent collision, oceanic plate deep subduction, and mantle dynamics. To enhance our understanding of the subsurface behavior of cold subducting slabs and hot mantle flows and their dynamic relation to the tectonic evolution of the overriding plates as mentioned above, we construct the first-generation full-waveform tomographic model SinoScope 1.0 for the crust-mantle structure beneath China and adjacent regions. The three-component seismograms from 410 earthquakes recorded at 2,427 seismographic stations are employed in iterative gradient-based inversions for three successively broadened period bands of 70 - 120 s, 50 - 120 s, and 30 - 120 s. Synthetic seismograms were computed using GPU-accelerated spectral-element simulations of seismic wave propagation in 3-D anelastic models, and FrĂ©chet derivatives were calculated based on an adjoint-state method facilitated by a checkpointing algorithm. The inversion involved 352 iterations, which required 18,600 wavefield simulations. The simulations were performed on the Xeon Platinum ‘SuperMUC-NG’ at the Leibniz Supercomputing Center and the Xeon E5 ‘Piz Daint’ at the Swiss National Supercomputing Center with a total cost of ~8 million CPU hours. SinoScope 1.0 is described in terms of isotropic P-wave velocity (Vp), horizontally and vertically polarized S-wave velocities (Vsh and Vsv), and mass density (ρ) variations, which are independently constrained with the same data set coupled with a stochastic L-BFGS quasi-Newton optimization scheme. It systematically reduced differences between observed and synthetic full-length seismograms. We performed a detailed resolution analysis by repairing input random parametric perturbations, indicating that resolution lengths can approach the half-propagated wavelength within the well-covered areas. SinoScope 1.0 exhibits strong lateral heterogeneities in the crust and uppermost mantle, and features correlate well with geological observations, such as sedimentary basins, Holocene volcanoes, Tibetan Plateau, Philippine Sea Plate, orogenic belts, and subduction zones. Estimating lithospheric thickness from seismic velocity reductions at depth reveals significant variations underneath the different tectonic units:~50 km in Northeast and North China, ~70 km in South China, ~90 km in the South China Sea, Philippine Sea Plate, and Caroline Plate, and 150-250 km in the Indian Plate. The thickest (200-250 km) cratonic roots are present beneath the Sichuan, Ordos, and Tarim basins. The large-scale lithospheric deformation is imaged as low-velocity zones from the Himalayan orogen to the Baikal rift system in central Asia. A thin horizontal layer of ~100-200 km depth extent below the lithosphere points toward the existence of the asthenosphere beneath East and Southeast Asia, with heterogeneous anisotropy indicative of channel flows. This provides independent, high-resolution evidence for the low-viscosity asthenosphere that partially decouples plates from mantle flow beneath and allows plate tectonics to work above. Beneath the Indo-Australian Plate, we observe distinct low-velocity anomalies from a depth of ~200 km to the bottom of the mantle transition zone (MTZ), continuously extending northward below western China from the lower MTZ down to the top of the lower mantle. Furthermore, we observe an enhanced image of well-known slabs along strongly curved subduction zones, including Kurile, Japan, Izu-Bonin, Mariana, Ryukyu, Philippines, Indonesia, and Burma. Broad high-velocity bodies persist from the lower MTZ to 1000 km depth or deeper beneath the north of the Indo-Australian Plate. They might be pieces of the ancient Tethyan slab sinking down to the lower mantle before the Indo-Australian Plate and Eurasian Plate collision. The deep geodynamic processes controlling the large-scale tectonic activity of the broad Asian region are very complicated and not yet well understood, which is a source of much debate. In this thesis, the main focus is on deciphering the three-dimensional seismic structure and dynamics of the lithosphere and mantle beneath China and adjacent regions with the help of the high-resolution full-waveform tomographic model. More importantly, in the subsequent works of geodynamic inversion, it provides the improved seismological basis for a sequential reconstruction of the late Mesozoic and Cenozoic plate motion history of the broad Asian region and the present-day mantle heterogeneity state estimates that, in turn, allow one to track material back in time from any given sampling location through retrodicting past mantle states
    • 

    corecore