Search CORE

20 research outputs found

HARE: Final Report

Author: Mckie Jim
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 09/01/2012
Field of study

This report documents the results of work done over a 6 year period under the FAST-OS programs. The first effort was called Right-Weight Kernels, (RWK) and was concerned with improving measurements of OS noise so it could be treated quantitatively; and evaluating the use of two operating systems, Linux and Plan 9, on HPC systems and determining how these operating systems needed to be extended or changed for HPC, while still retaining their general-purpose nature. The second program, HARE, explored the creation of alternative runtime models, building on RWK. All of the HARE work was done on Plan 9. The HARE researchers were mindful of the very good Linux and LWK work being done at other labs and saw no need to recreate it. Even given this limited funding, the two efforts had outsized impact: _ Helped Cray decide to use Linux, instead of a custom kernel, and provided the tools needed to make Linux perform well _ Created a successor operating system to Plan 9, NIX, which has been taken in by Bell Labs for further development _ Created a standard system measurement tool, Fixed Time Quantum or FTQ, which is widely used for measuring operating systems impact on applications _ Spurred the use of the 9p protocol in several organizations, including IBM _ Built software in use at many companies, including IBM, Cray, and Google _ Spurred the creation of alternative runtimes for use on HPC systems _ Demonstrated that, with proper modifications, a general purpose operating systems can provide communications up to 3 times as effective as user-level libraries Open source was a key part of this work. The code developed for this project is in wide use and available at many places. The core Blue Gene code is available at https://bitbucket.org/ericvh/hare. We describe details of these impacts in the following sections. The rest of this report is organized as follows: First, we describe commercial impact; next, we describe the FTQ benchmark and its impact in more detail; operating systems and runtime research follows; we discuss infrastructure software; and close with a description of the new NIX operating system, future work, and conclusions

Crossref

UNT Digital Library

A Discrete Adapted Hierarchical Basis Solver For Radial Basis Function Interpolation

Author: Castrillon-Candas Julio Enrique
Eijkhout Victor
Li Jun
Publication venue
Publication date: 27/07/2012
Field of study

In this paper we develop a discrete Hierarchical Basis (HB) to efficiently solve the Radial Basis Function (RBF) interpolation problem with variable polynomial order. The HB forms an orthogonal set and is adapted to the kernel seed function and the placement of the interpolation nodes. Moreover, this basis is orthogonal to a set of polynomials up to a given order defined on the interpolating nodes. We are thus able to decouple the RBF interpolation problem for any order of the polynomial interpolation and solve it in two steps: (1) The polynomial orthogonal RBF interpolation problem is efficiently solved in the transformed HB basis with a GMRES iteration and a diagonal, or block SSOR preconditioner. (2) The residual is then projected onto an orthonormal polynomial basis. We apply our approach on several test cases to study its effectiveness, including an application to the Best Linear Unbiased Estimator regression problem

arXiv.org e-Print Archive

Crossref

Scalable system software for high performance large-scale applications

Author: Morari Alessadro
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2014
Field of study

In the last decades, high-performance large-scale systems have been a fundamental tool for scientific discovery and engineering advances. The sustained growth of supercomputing performance and the concurrent reduction in cost have made this technology available for a large number of scientists and engineers working on many different problems. The design of next-generation supercomputers will include traditional HPC requirements as well as the new requirements to handle data-intensive computations. Data intensive applications will hence play an important role in a variety of fields, and are the current focus of several research trends in HPC. Due to the challenges of scalability and power efficiency, next-generation of supercomputers needs a redesign of the whole software stack. Being at the bottom of the software stack, system software is expected to change drastically to support the upcoming hardware and to meet new application requirements. This PhD thesis addresses the scalability of system software. The thesis start at the Operating System level: first studying general-purpose OS (ex. Linux) and then studying lightweight kernels (ex. CNK). Then, we focus on the runtime system: we implement a runtime system for distributed memory systems that includes many of the system services required by next-generation applications. Finally we focus on hardware features that can be exploited at user-level to improve applications performance, and potentially included into our advanced runtime system. The thesis contributions are the following: Operating System Scalability: We provide an accurate study of the scalability problems of modern Operating Systems for HPC. We design and implement a methodology whereby detailed quantitative information may be obtained for each OS noise event. We validate our approach by comparing it to other well-known standard techniques to analyze OS noise, such FTQ (Fixed Time Quantum). Evaluation of the address translation management for a lightweight kernel: we provide a performance evaluation of different TLB management approaches ¿ dynamic memory mapping, static memory mapping with replaceable TLB entries, and static memory mapping with fixed TLB entries (no TLB misses) on a IBM BlueGene/P system. Runtime System Scalability: We show that a runtime system can efficiently incorporate system services and improve scalability for a specific class of applications. We design and implement a full-featured runtime system and programming model to execute irregular appli- cations on a commodity cluster. The runtime library is called Global Memory and Threading library (GMT) and integrates a locality-aware Partitioned Global Address Space communication model with a fork/join program structure. It supports massive lightweight multi-threading, overlapping of communication and computation and small messages aggregation to tolerate network latencies. We compare GMT to other PGAS models, hand-optimized MPI code and custom architectures (Cray XMT) on a set of large scale irregular applications: breadth first search, random walk and concurrent hash map access. Our runtime system shows performance orders of magnitude higher than other solutions on commodity clusters and competitive with custom architectures. User-level Scalability Exploiting Hardware Features: We show the high complexity of low-level hardware optimizations for single applications, as a motivation to incorporate this logic into an adaptive runtime system. We evaluate the effects of controllable hardware-thread priority mechanism that controls the rate at which each hardware-thread decodes instruction on IBM POWER5 and POWER6 processors. Finally, we show how to effectively exploits cache locality and network-on-chip on the Tilera many-core architecture to improve intra-core scalability

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura

Capturing the impact of external interference on HPC application performance

Author: Shah Aamer
Publication venue
Publication date: 01/01/2020
Field of study

HPC applications are large software packages with high computation and storage requirements. To meet these requirements, the architectures of supercomputers are continuously evolving and their capabilities are continuously increasing. Present-day supercomputers have achieved petaflops of computational power by utilizing thousands to millions of compute cores, connected through specialized communication networks, and are equipped with petabytes of storage using a centralized I/O subsystem. While fulfilling the high resource demands of HPC applications, such a design also entails its own challenges. Applications running on these systems own the computation resources exclusively, but share the communication interconnect and the I/O subsystem with other concurrently running applications. Simultaneous access to these shared resources causes contention and inter-application interference, leading to degraded application performance. Inter-application interference is one of the sources of run-to-run variation. While other sources of variation, such as operating system jitter, have been investigated before, this doctoral thesis specifically focuses on inter-application interference and studies it from the perspective of an application. Variation in execution time not only causes uncertainty and affects user expectations (especially during performance analysis), but also causes suboptimal usage of HPC resources. Therefore, this thesis aims to evaluate inter-application interference, establish trends among applications under contention, and approximate the impact of external influences on the runtime of an application. To this end, this thesis first presents a method to correlate the performance of applications running side-by-side. The method divides the runtime of a system into globally synchronized, fine-grained time slices for which application performance data is recorded separately. The evaluation of the method demonstrates that correlating application performance data can identify inter-application interference. The thesis further uses the method to study I/O interference and shows that file access patterns are a significant factor in determining the interference potential of an application. This thesis also presents a technique to estimate the impact of external influences on an application run. The technique introduces the concept of intrinsic performance characteristics to cluster similar application execution segments. Anomalies in the cluster are the result of external interference. An evaluation with several benchmarks shows high accuracy in estimating the impact of interference from a single application run. The contributions of this thesis will help establish interference trends and devise interference mitigation techniques. Similarly, estimating the impact of external interference will restore user expectations and help performance analysts separate application performance from external influence

tuprints

Multigrid methods for structured grids and their application in particle simulation

Author: Bolten Matthias
Publication venue
Publication date: 01/01/2008
Field of study

Mehrgitterverfahren sind optimale, d.h. linear skalierende, Verfahren zur Lösung einer großen Zahl von Problemen, z.B. von elliptischen partiellen Differentialgleichungen. Viele dieser Anwendungen, z.B. partielle Differentialgleichungen die auf strukturierten Gittern diskretisiert sind, besitzen viel Struktur, die eine effiziente Implementierung des Mehrgitterverfahrens auf seriellen und parallelen Computern erlaubt. Neben der Standard-Theorie für geometrische Mehrgitterverfahren wurde eine variationelle Theorie von einer Vielzahl von Autoren entwickelt, wobei die variationelle Theorie auf der Verwendung des Galerkin-Grobgitteroperators basiert. In dieser Arbeit werden Mehrgitterverfahren zur Lösung linearer Gleichungssysteme mit strukturierten Koeffizientenmatrizen analysiert. Ein modifiziertes Schema zur Lösung der Poissongleichung in unbeschränkten Gebieten wird vorgestellt und sein Fehlerverhalten im Detail analysiert. Diese Methode nutzt eine endliche Anzahl von hierarchischen Vergröberungen und Vergrößerungen des Diskretisierungsgitters und Einsetzen von speziellen Randbedingungen auf dem gröbsten Level. Ein FAS-artiges Mehrgitterverfahren eignet sich gut zur Lösung dieses Problems. Partielle Differentialgleichungen mit konstanten Koeffizienten und periodischen Randbedingungen die auf equidistanten Gittern diskretisiert sind führen auf zirkulante Koeffizientenmatrizen. Daher kann in diesem Fall die vorhandene Theorie für zirkulante Matrizen angewandt werden. Diese Theorie basiert auf der klassischen Theorie für algebraische Mehrgitterverfahren, die in diesem Zusammenhang um die Verwendung von nicht-Galerkin-Grobgitteroperatoren erweitert wird. Ein Parallelisierungsansatz für zirkulante Matrizen basierend auf Gebietszerlegung wird vorgestellt. Die entwickelten Verfahren werden in Partikelsimulationsmethoden, wie sie in vielen Feldern der Physik gebraucht werden, angewandt. Das Problem wird vorgestellt und konsistent in einer Art formuliert, die es erlaubt das Problem durch Lösen der Poissongleichung in unbeschränkten oder periodischen Gebieten und eine Nahfeldkorrektur zu behandeln. Motiviert durch Methoden wie P3M basiert diese Umformulierung auf dem Ersatz von Punktladungen durch Distributionen mit beschränkten Träger. Durch die Umformulierung werden keine weiteren Fehler eingeführt. Daher ist der Fehler des Gesamtverfahrens nur durch den Diskretisierungsfehler und durch den Interpolationsfehler der verwendeten numerischen Schemata verursacht. Es werden numerische Beispiele für das FAS-artige Mehrgitterverfahren für das hierarchisch vergröberten Gitter, für das Mehrgitterverfahren für zirkulante Matrizen mit Ersatz des Galerkin-Grobgitteroperators, für sein paralleles Skalierungsverhalten auf Blue Gene/L und Blue Gene/P und für die Partikelsimulationsmethode, die das Mehrgitterverfahren nutzt, vorgestellt.Multigrid methods are optimal, i.e. linearly scaling, methods for the solution of a broad range of problems, including elliptic partial differential equations. Many of these applications, e.g. partial differential equations discretized on structured grids, posses a lot of structure that allows an efficient implementation of multigrid methods on serial and parallel computers. Besides the standard geometric multigrid theory, a variational theory was developed by a number of authors, where the variational theory is based on the use of the Galerkin coarse grid operator. In this work multigrid methods for the solution of linear systems of equations with structured coefficient matrices are analyzed. A modified discretization scheme for the solution of the Poisson equation in unbounded domains is presented and its error behavior is analyzed in detail. The method involves hierarchically coarsening and extending the discretization grid finitely often and imposing special boundary conditions on the boundary on the coarsest level. FAS-type multigrid is well suited to solve this problem. Partial differential equations with constant coefficients and periodic boundary conditions that are discretized on equispaced grids lead to circulant coefficient matrices. Therefor the theory of multigrid methods for circulant matrices is applicable to that case. The theory is based on the classical algebraic multigrid theory that is extended in this context to include non-Galerkin coarse grid operators, as well. A parallelization strategy based on domain decomposition is presented for circulant matrices. The developed methods are applied in particle simulation methods, as they are needed in various fields of physics. The problem is introduced and consistently reformulated in a way that allows to treat the problem with the solution of the Poisson equation in either unbounded or periodic domains with an added near-field correction. Motivated by methods like P3M this reformulation is based on replacing the point charges by finitely supported distributions. No additional errors are introduced by the reformulation. So the error of the resulting method is caused by the discretization error and the interpolation error of the used numerical schemes, only. Numerical examples are presented for the FAS-type multigrid solver for the hierarchically coarsened grid, for the multigrid solver for circulant matrices including the replacement of the Galerkin coarse grid operator, for its parallel scaling behavior on Blue Gene/L and Blue Gene/P and for the particle simulation method that uses the multigrid solver

Elektronische Veröffentlichungen der Universitätsbibliothek Wuppertal

Juelich Shared Electronic Resources

A multi-tier cached I/O architecture for massively parallel supercomputers

Author: García Blas Francisco Javier
Publication venue
Publication date: 25/05/2010
Field of study

Recent advances in storage technologies and high performance interconnects have made possible in the last years to build, more and more potent storage systems that serve thousands of nodes. The majority of storage systems of clusters and supercomputers from Top 500 list are managed by one of three scalable parallel file systems: GPFS, PVFS, and Lustre. Most large-scale scientific parallel applications are written in Message Passing Interface (MPI), which has become the de-facto standard for scalable distributed memory machines. One part of the MPI standard is related to I/O and has among its main goals the portability and efficiency of file system accesses. All of the above mentioned parallel file systems may be accessed also through the MPI-IO interface. The I/O access patterns of scientific parallel applications often consist of accesses to a large number of small, non-contiguous pieces of data. For small file accesses the performance is dominated by the latency of network transfers and disks. Parallel scientific applications lead to interleaved file access patterns with high interprocess spatial locality at the I/O nodes. Additionally, scientific applications exhibit repetitive behaviour when a loop or a function with loops issues I/O requests. When I/O access patterns are repetitive, caching and prefetching can effectively mask their access latency. These characteristics of the access patterns motivated several researchers to propose parallel I/O optimizations both at library and file system levels. However, these optimizations are not always integrated across different layers in the systems. In this dissertation we propose a novel generic parallel I/O architecture for clusters and supercomputers. Our design is aimed at large-scale parallel architectures with thousands of compute nodes. Besides acting as middleware for existing parallel file systems, our architecture provides on-line virtualization of storage resources. Another objective of this thesis is to factor out the common parallel I/O functionality from clusters and supercomputers in generic modules in order to facilitate porting of scientific applications across these platforms. Our solution is based on a multi-tier cache architecture, collective I/O, and asynchronous data staging strategies hiding the latency of data transfer between cache tiers. The thesis targets to reduce the file access latency perceived by the data-intensive parallel scientific applications by multi-layer asynchronous data transfers. In order to accomplish this objective, our techniques leverage the multi-core architectures by overlapping computation with communication and I/O in parallel threads. Prototypes of our solutions have been deployed on both clusters and Blue Gene supercomputers. Performance evaluation shows that the combination of collective strategies with overlapping of computation, communication, and I/O may bring a substantial performance benefit for access patterns common for parallel scientific applications.-----------------------------------------------------------------------------------------------------------------------------En los últimos años se ha observado un incremento sustancial de la cantidad de datos producidos por las aplicaciones científicas paralelas y de la necesidad de almacenar estos datos de forma persistente. Los sistemas de ficheros paralelos como PVFS, Lustre y GPFS han ofrecido una solución escalable para esta demanda creciente de almacenamiento. La mayoría de las aplicaciones científicas son escritas haciendo uso de la interfaz de paso de mensajes (MPI), que se ha convertido en un estándar de-facto de programación para las arquitecturas de memoria distribuida. Las aplicaciones paralelas que usan MPI pueden acceder a los sistemas de ficheros paralelos a través de la interfaz ofrecida por MPI-IO. Los patrones de acceso de las aplicaciones científicas paralelas consisten en un gran número de accesos pequeños y no contiguos. Para tamaños de acceso pequeños, el rendimiento viene limitado por la latencia de las transferencias de red y disco. Además, las aplicaciones científicas llevan a cabo accesos con una alta localidad espacial entre los distintos procesos en los nodos de E/S. Adicionalmente, las aplicaciones científicas presentan típicamente un comportamiento repetitivo. Cuando los patrones de acceso de E/S son repetitivos, técnicas como escritura demorada y lectura adelantada pueden enmascarar de forma eficiente las latencias de los accesos de E/S. Estas características han motivado a muchos investigadores en proponer optimizaciones de E/S tanto a nivel de biblioteca como a nivel del sistema de ficheros. Sin embargo, actualmente estas optimizaciones no se integran siempre a través de las distintas capas del sistema. El objetivo principal de esta tesis es proponer una nueva arquitectura genérica de E/S paralela para clusters y supercomputadores. Nuestra solución está basada en una arquitectura de caches en varias capas, una técnica de E/S colectiva y estrategias de acceso asíncronas que ocultan la latencia de transferencia de datos entre las distintas capas de caches. Nuestro diseño está dirigido a arquitecturas paralelas escalables con miles de nodos de cómputo. Además de actuar como middleware para los sistemas de ficheros paralelos existentes, nuestra arquitectura debe proporcionar virtualización on-line de los recursos de almacenamiento. Otro de los objeticos marcados para esta tesis es la factorización de las funcionalidades comunes en clusters y supercomputadores, en módulos genéricos que faciliten el despliegue de las aplicaciones científicas a través de estas plataformas. Se han desplegado distintos prototipos de nuestras soluciones tanto en clusters como en supercomputadores. Las evaluaciones de rendimiento demuestran que gracias a la combicación de las estratégias colectivas de E/S y del solapamiento de computación, comunicación y E/S, se puede obtener una sustancial mejora del rendimiento en los patrones de acceso anteriormente descritos, muy comunes en las aplicaciones paralelas de caracter científico

Universidad Carlos III de Madrid e-Archivo

Thread assignment in multicore/multithreaded processors: A statistical approach

Author: Cakarevic Vladimir
Carpenter Paul M.
Cazorla Almeida Francisco Javier
Moreto Planas Miquel
Nemirovsky Mario
Pajuelo González Manuel Alejandro
Radojković Petar
Valero Cortés Mateo
Verdú Mulà Javier
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

© 2015 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.The introduction of multicore/multithreaded processors, comprised of a large number of hardware contexts (virtual CPUs) that share resources at multiple levels, has made process scheduling, in particular assignment of running threads to available hardware contexts, an important aspect of system performance. Nevertheless, thread assignment of applications running on state-of-the art processors is an NP-complete problem. Over the years, numerous studies have proposed heuristic-based algorithms for thread assignment. Since the thread assignment problem is intractable, it is in general impossible to know the performance of the optimal assignment, so the room for improvement of a given algorithm is also unknown. It is therefore hard to decide whether to invest more effort and time to improve an algorithm that may already be close to optimal. In this paper, we present a statistical approach to the thread assignment problem. First, we present a method that predicts the performance of the optimal thread assignment, based on the observed performance of each thread assignment in a random sample. The method is based on Extreme Value Theory (EVT), a branch of statistics that analyses extreme deviations from the population mean. We also propose sample pruning, a method that significantly reduces the time required to apply the statistical method by reducing the number of candidate solutions that need to be measured. Finally, we show that, if no suitable heuristic-based algorithm is available, a sample of several thousand random thread assignments is enough to obtain, with high confidence, an assignment with performance close to optimal. The presented approach is architecture and application independent, and it can be used to address the thread assignment problem in various domains. It is especially well suited for systems in which the workload seldom changes. An example is network systems, which typically provide a constant set of services that are known in advance, with network applications performing a similar processing algorithm for each packet in the system. In this paper, we validate our methods with an industrial case study for a set of multithreaded network applications on an UltraSPARC T2 processor. This article is an extension of our previous work [ 44], which was published in Proceedings of 17th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-2012).This work has been supported by the Spanish Ministry of Science and Innovation under grant TIN2012-34557, the HiPEAC Network of Excellence, and by the European Research Council under the European Union’s 7th FP, ERC Grant Agreement number 321253. Miquel Moreto has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047.Peer ReviewedPostprint (author's final draft

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Efficient Task-Local I/O Operations of Massively Parallel Applications

Author: Frings Wolfgang
Publication venue: Forschungszentrum Jülich GmbH Zentralbibliothek, Verlag
Publication date: 01/01/2016
Field of study

Applications on current large-scale HPC systems use enormous numbers of processing elements for their computation and have access to large amounts of main memory for their data. Nevertheless, they still need file-system access to maintain program and application data persistently. Characteristic I/O patterns that produce a high load on the file system often occurduring access to checkpoint and restart files, which have to be frequently stored to allow the application to be restarted after program termination or system failure. On large-scale HPC systems with distributed memory, each application task will often perform such I/O individually by creating task-local file objects on the file system. At large scale, these I/O patterns impose substantial stress on the metadata management components of the I/O subsystem. For example, the simultaneous creation of thousands of task-local files in the same directory can cause delays of several minutes. Also at the startup of dynamically linked applications, such metadata contention occurs while searching for library files and induces a comparably high metadata load on the file system. Even mid-scale applications cause in such load scenarios startup delays of ten minutes or more. Therefore, dynamic linking and loading is nowadays not applied on large HPC systems, although dynamic linking has many advantages for managing large code bases. The reason for these limitations is that POSIX I/O and the dynamic loader are implemented as serial components of the operating system and do not take advantage of the parallel nature of the I/O operations. To avoid the above bottlenecks, this work describes two novel approaches for the integration of locality awareness (e.g., through aggregation or caching) into the serial I/O operations of parallel applications. The underlying methods are implemented in two tools,

\textit{SIONlib}

and

\textit{Spindle}

, which exploit the knowledge of application parallelism to coordinate access to file-system objects. In addition, the applied methods also use knowledge of the underlying I/O subsystem structure, the parallel file system configuration, and the network betweenHPC-system and I/O system to optimize application I/O. Both tools add layers between the parallel application and the POSIX-based standard interfaces of the operating system for I/O and dynamic loading, eliminating the need for modifying the underlying system software. SIONlib is already applied in several applications, including PEPC, muphi, and MP2C, to implement efficient checkpointing. In addition, SIONlib is integrated in the performance-analysis tools Scalasca and Score-P to efficiently store and read trace data. Latest benchmarks on the Blue Gene/Q in Jülich demonstrate that SIONlib solves the metadata problem at large scale by running efficiently up to 1.8 million tasks while maintaining high I/O bandwidths of 60-80% of file-system peak with a negligible file-creation time. The scalability of Spindle could be demonstrated by running the Pynamic benchmark, a proxy benchmark for a real application, on a cluster of Lawrence Livermore National Laboratory at large scale. The results show that the startup of dynamically linked applications is now feasible on more than 15000 tasks, whereas the overhead of Spindle is nearly constantly low. With SIONlib and Spindle, this work demonstrates how scalability of operating system components can be improved without modifying them and without changing the I/O patterns of applications. In this way, SIONlib and Spindle represent prototype implementations of functionality needed by next-generation runtime systems

Publikationsserver der RWTH Aachen University

Juelich Shared Electronic Resources

Software for Exascale Computing - SPPEXA 2016-2019

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer’s series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA’s first funding phase, and provides an overview of SPPEXA’s contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

OAPEN Library

Optimisation of the first principle code Octopus for massive parallel architectures: application to light harvesting complexes

Author: Alberdi-Rodríguez Joseba
Publication venue: Universidad del País Vasco
Publication date: 01/01/2015
Field of study

[EN]: Computer simulation has become a powerful technique for assisting scientists in developing novel insights into the basic phenomena underlying a wide variety of complex physical systems. The work reported in this thesis is concerned with the use of massively parallel computers to simulate the fundamental features at the electronic structure level that control the initial stages of harvesting and transfer of solar energy in green plants which initiate the photosynthetic process. Currently available supercomputer facilities offer the possibility of using hundred of thousands of computing cores. However, obtaining a linear speed-up from HPC systems is far from trivial. Thus, great efforts must be devoted to understand the nature of the scientific code, the methods of parallel execution, data communication requirements in multi-process calculations, the efficient use of available memory, etc. This thesis deals with all of these themes, with a clear objective in mind: the electronic structure simulation of complete macro-molecular complexes, namely the Light Harvesting Complex II, with the aim of understanding its physical behaviour. In order to simulate this complex, we have used (with the assistance of the PRACE consortium) some of the most powerful supercomputers in Europe to run Octopus, a scientific software package for Density Functional Theory and TimeDependent Density Functional Theory calculations. Results obtained with Octopus have been analysed in depth in order to identify the main obstacles to optimal scaling using thousands of cores. Many problems have emerged, mainly the poor performance of the Poisson solver, high memory requirements, the transfer of high quantities of complex data structures among processes, and so on. Finally, all of these problems have been overcome, and the new version reaches a very high performance in massively parallel systems. Tests run efficiently up to 128K processors and thus we have been able to complete the largest TDDFT calculations performed to date. At the conclusion of this work it has been possible to study the Light Harvesting Complex II as originally envisioned.[EU]: Konputagailu bidezko simulazioa da, gaur egun, zientzialariek eskura duten tresnarik ahaltsuenetako bat sistema fisiko konplexuen portaera ulertzen saiatzeko. Oinarrizko fenomeno fisiko horiek simulatzeko superkonputagailuak erabili dira tesi honetan aurkezten den lanean. Konkretuki, punta-puntako konputagailuak erabili dira fotosintesiaren lehen urratsak ulertzeko, landare berdeetan eguzki-energiaren xurgatze-prozesua kontrolatzen duen molekula simulatuz. Superkonputazio-zentroek ehunka milaka prozesatze-nukleo dituzten makinak erabiltzeko aukera eskaintzen dute, baina ez da batere erraza azelerazio-faktore linealak lortzea halako konputagailuetan. Hori dela eta, ahalegin handiak egin behar dira, informatikaren ikuspegitik, sistema osoaren ezagutza ahalik eta sakonena lortzeko: kode zientifikoen izaera, beraren exekuzio paraleloen aukerak, prozesuen arteko datu-transmisioaren beharrak, sistemaren memoriaren erabilera eraginkorrena, eta abar. Tesi honek aurreko arazo guztiei aurre egiten die, helburu argi batekin: konplexu makromolekular osoen simulazioa, konkretuki Light Harvesting Complex II sistemaren egitura elektronikoaren simulazioa, beraren portaera fisikoa ulertu ahal izateko. Sistema hori simulatu ahal izateko bidean, Europako superkonputagailu azkarrenak erabili dira (PRACE partzuergoari esker) Octopus software-paketea exekutatzeko, zeina Density Functional Theory eta Time-Dependent Density Functional Theory izeneko teorien araberako simulazio elektronikoak egiten baititu. Lortutako emaitzak sakonki analizatu dira, milaka konputazio-nukleo eraginkorki erabiltzea oztopatzen zuten arazoak aurkitzeko. Problema ugari azaldu dira bide horretan, nagusiki Poisson ebazlearen errendimendu baxua, memoria eskaera handiak, datu-egitura konplexuen kopuru handiko transferentziak, eta abar. Azkenean, problema horiek guztiak ebatzi dira, eta bertsio berriak errendimendu handia lortu du superkonputagailu paraleloetan. Exekuzio eraginkorrak frogatu ahal izan ditugu 128K prozesadorera arte eta, ondorioz, inoizko TDDFT simulaziorik handienak egin ahal izan ditugu. Hala, lan honen amaieran, hasierako helburua bete ahal izan da: Light Harvesting Complex II sistema molekularraren azterketa egitea.University of the Basque Country, UPV/EHU, University of Coimbra, Red Española de Supercomputación (RES), Jülich Supercomputing Centre (JSC), Rechenzentrum Garching, Cineca, Barcelona Supercomputing Center (BSC), CeSViMa, European Research Council Advanced Grant DYNamo (ERC-2010-AdG-267374), Spanish Grant (FIS2013-46159-C3-1-P), Grupos Consolidados UPV/EHU del Gobierno Vasco (IT578-13), Grupos Consolidados UPV/EHU del Gobierno Vasco (IT395-10), European Community FP7 project CRONOS (Grant number 280879-2), COST Actions CM1204 (XLIC) and MP1306 (EUSpec), ALDAPA research group belongs to the Basque Advanced Informatics Laboratory (BAILab) supported by the University of the Basque Country UPV/EHU (grant UFI11/45).Peer Reviewe

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Archivo Digital para la Docencia y la Investigación

Digital.CSIC