84 research outputs found

    A 2D algorithm with asymmetric workload for the UPC conjugate gradient method

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-014-1300-0[Abstract] This paper examines four different strategies, each one with its own data distribution, for implementing the parallel conjugate gradient (CG) method and how they impact communication and overall performance. Firstly, typical 1D and 2D distributions of the matrix involved in CG computations are considered. Then, a new 2D version of the CG method with asymmetric workload, based on leaving some threads idle during part of the computation to reduce communication, is proposed. The four strategies are independent of sparse storage schemes and are implemented using Unified Parallel C (UPC), a Partitioned Global Address Space (PGAS) language. The strategies are evaluated on two different platforms through a set of matrices that exhibit distinct sparse patterns, demonstrating that our asymmetric proposal outperforms the others except for one matrix on one platform.Ministerio de Economía y Competitividad; TIN2013-42148-PXunta de Galicia; GRC2013/055United States. Department of Energy; DEAC03-76SF0009

    Heterogeneous parallel algorithms for computational fluid dynamics on unstructured meshes

    Get PDF
    Frontiers of computational fluid dynamics (CFD) are constantly expanding and eagerly demanding more computational resources. Currently, we are experiencing an rapid evolution in the high performance computing systems driven by power consumption constraints. New HPC nodes incorporate accelerators that are used as math co-processors for increasing the throughput and the FLOP per watt ratio. On the other hand, multi-core CPUs have turned into energy efficient system-on-chip architectures. By doing so, the main components of the node are fused and integrated into a single chip reducing the energy costs. Nowadays, several institutions and governments are investing in the research and development of different aspects of HPC that could lead to the next generations of supercomputers. This initiatives have entitled the problem as the exascale challenge. This goal can only be achieved by incorporating major changes in computer architecture, memory design and network interfaces. The CFD community faces an important challenge: keep the pace at the rapid changes in the HPC resources. The codes and formulations need to be re-design in other to exploit the different levels of parallelism and complex memory hierarchies of the new heterogeneous systems. The main characteristics demanded to the new CFD software are: memory awareness, extreme concurrency, modularity and portability. This thesis is devoted to the study of a CFD algorithm re-factoring for the adoption of new technologies. Our application context is the solution of incompressible flows (DNS or LES) on unstructured meshes. The first approach was using GPUs for accelerating the Poisson solver, that is the most computational intensive part of our application. The positive results obtained in this first step motivated us to port the complete time integration phase of our application. This requires a major redesign of the code. We propose a portable implementation model for CFD applications. The main idea was substituting stencil data structures and kernels by algebraic storage formats and operators. By doing so, the algorithm was restructured into a minimal set of algebraic operations. The implementation strategy consisted in the creation of a low-level algebraic layer for computations on CPUs and GPUs, and a high-level user-friendly discretization layer for CPUs that is fully localized at the preprocessing stage where performance does not play an important role. As a result, at the time-integration phase the code relies only on three algebraic kernels: sparse-matrix-vector product (SpMV), linear combination of two vectors (AXPY) and dot product (DOT). Such a simple set of basic linear algebra operations naturally provides the desired portability to any computing architecture. Special attention was paid at the development of data structures compatibles with the stream processing model. A detailed performance analysis was studied in both sequential and parallel execution engaging up to 128 GPUs in a hybrid CPU/GPU supercomputer. Moreover, we tested the portable implementation model of TermoFluids code in the Mont-Blanc mobile-based supercomputer. The re-design of the kernels exploits a heterogeneous execution model using both computing devices CPU and GPU of the ARM-based nodes. The load balancing between the two computing devices exploits a tabu search strategy that tunes the workload distribution during the preprocessing stage. A comparison of the Mont-Blanc prototypes with high-end supercomputers in terms of the achieved net performance and energy consumption provided some guidelines of the behavior of CFD applications in ARM-based architectures. Finally, we present a memory aware auto-tuned Poisson solver for problems with one Fourier diagonalizable direction. This work was developed and tested in the BlueGene/Q Vesta supercomputer, and aims at demonstrating the relevance of vectorization and memory awareness for fully exploiting the modern energy efficient CPUs.Las fronteras de la dinámica de fluidos computacional (CFD) están en constante expansión y demandan más y más recursos computacionales. Actualmente, estamos experimentando una evolución en los sistemas de computación de alto rendimiento (HPC) impulsado por restricciones de consumo de energía. Los nuevos nodos HPC incorporan aceleradores que se utilizan como co-procesadores para incrementar el rendimiento y la relación FLOP por vatio. Por otro lado, CPUs multi-core se han convertido en arquitecturas system-on-chip. Hoy en día, varias instituciones y gobiernos están invirtiendo en la investigación y desarrollo de los diferentes aspectos de HPC que podrían llevar a las próximas generaciones de superordenadores. Estas iniciativas han titulado el problema como el "exascale challenge". Este objetivo sólo puede lograrse mediante la incorporación de cambios importantes en: la arquitectura de ordenador, diseño de la memoria y las interfaces de red. La comunidad de CFD se enfrenta a un reto importante: mantener el ritmo a los rápidos cambios en las infraestructuras de HPC. Los códigos y formulaciones necesitan ser rediseñados para explotar los diferentes niveles de paralelismo y complejas jerarquías de memoria de los nuevos sistemas heterogéneos. Las principales características exigidas al nuevo software CFD son: estructuras de datos, la concurrencia extrema, modularidad y portabilidad. Esta tesis está dedicada al estudio de un modelo de implementation CFD para la adopción de nuevas tecnologías. Nuestro contexto de aplicación es la solución de los flujos incompresibles (DNS o LES) en mallas no estructuradas. El primer enfoque se basó en utilizar GPUs para acelerar el solver de Poisson. Los resultados positivos obtenidos en este primer paso nos motivaron a la portabilidad completa de la fase de integración temporal de nuestra aplicación. Esto requiere un importante rediseño del código. Proponemos un modelo de implementacion portable para aplicaciones de CFD. La idea principal es sustituir las estructuras de datos de los stencils y kernels por formatos de almacenamiento algebraicos y operadores. La estrategia de implementación consistió en la creación de una capa algebraica de bajo nivel para los cálculos de CPU y GPU, y una capa de discretización fácil de usar de alto nivel para las CPU. Como resultado, la fase de integración temporal del código se basa sólo en tres funciones algebraicas: producto de una matriz dispersa con un vector (SPMV), combinación lineal de dos vectores (AXPY) y producto escalar (DOT). Además, se prestó especial atención en el desarrollo de estructuras de datos compatibles con el modelo stream processing. Un análisis detallado de rendimiento se ha estudiado tanto en ejecución secuencial y paralela utilizando hasta 128 GPUs en un superordenador híbrido CPU / GPU. Por otra parte, hemos probado el nuevo modelo de TermoFluids en el superordenador Mont-Blanc basado en tecnología móvil. El rediseño de las funciones explota un modelo de ejecución heterogénea utilizando tanto la CPU y la GPU de los nodos basados en arquitectura ARM. El equilibrio de carga entre las dos unidades de cálculo aprovecha una estrategia de búsqueda tabú que sintoniza la distribución de carga de trabajo durante la etapa de preprocesamiento. Una comparación de los prototipos Mont-Blanc con superordenadores de alta gama en términos de rendimiento y consumo de energía nos proporcionó algunas pautas del comportamiento de las aplicaciones CFD en arquitecturas basadas en ARM. Por último, se presenta una estructura de datos auto-sintonizada para el solver de Poisson en problemas con una dirección diagonalizable mediante una descomposicion de Fourier. Este trabajo fue desarrollado y probado en la superordenador BlueGene / Q Vesta, y tiene por objeto demostrar la relevancia de vectorización y las estructuras de datos para aprovechar plenamente las CPUs de los superodenadores modernos

    Large Scale Computing and Storage Requirements for High Energy Physics

    Get PDF
    The National Energy Research Scientific Computing Center (NERSC) is the leading scientific computing facility for the Department of Energy's Office of Science, providing high-performance computing (HPC) resources to more than 3,000 researchers working on about 400 projects. NERSC provides large-scale computing resources and, crucially, the support and expertise needed for scientists to make effective use of them. In November 2009, NERSC, DOE's Office of Advanced Scientific Computing Research (ASCR), and DOE's Office of High Energy Physics (HEP) held a workshop to characterize the HPC resources needed at NERSC to support HEP research through the next three to five years. The effort is part of NERSC's legacy of anticipating users needs and deploying resources to meet those demands. The workshop revealed several key points, in addition to achieving its goal of collecting and characterizing computing requirements. The chief findings: (1) Science teams need access to a significant increase in computational resources to meet their research goals; (2) Research teams need to be able to read, write, transfer, store online, archive, analyze, and share huge volumes of data; (3) Science teams need guidance and support to implement their codes on future architectures; and (4) Projects need predictable, rapid turnaround of their computational jobs to meet mission-critical time constraints. This report expands upon these key points and includes others. It also presents a number of case studies as representative of the research conducted within HEP. Workshop participants were asked to codify their requirements in this case study format, summarizing their science goals, methods of solution, current and three-to-five year computing requirements, and software and support needs. Participants were also asked to describe their strategy for computing in the highly parallel, multi-core environment that is expected to dominate HPC architectures over the next few years. The report includes a section that describes efforts already underway or planned at NERSC that address requirements collected at the workshop. NERSC has many initiatives in progress that address key workshop findings and are aligned with NERSC's strategic plans

    Parallel architectures and runtime systems co-design for task-based programming models

    Get PDF
    The increasing parallelism levels in modern computing systems has extolled the need for a holistic vision when designing multiprocessor architectures taking in account the needs of the programming models and applications. Nowadays, system design consists of several layers on top of each other from the architecture up to the application software. Although this design allows to do a separation of concerns where it is possible to independently change layers due to a well-known interface between them, it is hampering future systems design as the Law of Moore reaches to an end. Current performance improvements on computer architecture are driven by the shrinkage of the transistor channel width, allowing faster and more power efficient chips to be made. However, technology is reaching physical limitations were the transistor size will not be able to be reduced furthermore and requires a change of paradigm in systems design. This thesis proposes to break this layered design, and advocates for a system where the architecture and the programming model runtime system are able to exchange information towards a common goal, improve performance and reduce power consumption. By making the architecture aware of runtime information such as a Task Dependency Graph (TDG) in the case of dataflow task-based programming models, it is possible to improve power consumption by exploiting the critical path of the graph. Moreover, the architecture can provide hardware support to create such a graph in order to reduce the runtime overheads and making possible the execution of fine-grained tasks to increase the available parallelism. Finally, the current status of inter-node communication primitives can be exposed to the runtime system in order to perform a more efficient communication scheduling, and also creates new opportunities of computation and communication overlap that were not possible before. An evaluation of the proposals introduced in this thesis is provided and a methodology to simulate and characterize the application behavior is also presented.El aumento del paralelismo proporcionado por los sistemas de cómputo modernos ha provocado la necesidad de una visión holística en el diseño de arquitecturas multiprocesador que tome en cuenta las necesidades de los modelos de programación y las aplicaciones. Hoy en día el diseño de los computadores consiste en diferentes capas de abstracción con una interfaz bien definida entre ellas. Las limitaciones de esta aproximación junto con el fin de la ley de Moore limitan el potencial de los futuros computadores. La mayoría de las mejoras actuales en el diseño de los computadores provienen fundamentalmente de la reducción del tamaño del canal del transistor, lo cual permite chips más rápidos y con un consumo eficiente sin apenas cambios fundamentales en el diseño de la arquitectura. Sin embargo, la tecnología actual está alcanzando limitaciones físicas donde no será posible reducir el tamaño de los transistores motivando así un cambio de paradigma en la construcción de los computadores. Esta tesis propone romper este diseño en capas y abogar por un sistema donde la arquitectura y el sistema de tiempo de ejecución del modelo de programación sean capaces de intercambiar información para alcanzar una meta común: La mejora del rendimiento y la reducción del consumo energético. Haciendo que la arquitectura sea consciente de la información disponible en el modelo de programación, como puede ser el grafo de dependencias entre tareas en los modelos de programación dataflow, es posible reducir el consumo energético explotando el camino critico del grafo. Además, la arquitectura puede proveer de soporte hardware para crear este grafo con el objetivo de reducir el overhead de construir este grado cuando la granularidad de las tareas es demasiado fina. Finalmente, el estado de las comunicaciones entre nodos puede ser expuesto al sistema de tiempo de ejecución para realizar una mejor planificación de las comunicaciones y creando nuevas oportunidades de solapamiento entre cómputo y comunicación que no eran posibles anteriormente. Esta tesis aporta una evaluación de todas estas propuestas, así como una metodología para simular y caracterizar el comportamiento de las aplicacionesPostprint (published version

    Study for the numerical resolution of conservation equations of mass, momentum and energy and a first approach to large problems using computational performance enhancement techniques

    Get PDF
    Les equacions de conservació de la massa, quantitat de moviment i energia són de gran importància a l’hora de comprendre un sistema físic. D’aquestes equacions en deriva, per exemple, el sistema de Navier-Stokes, un dels més estudiats en el camp de la física per entendre el comportament de qualsevol fluid. La gran complexitat d’aquests sistemes d’equacions en derivades parcials no lineals i la falta de solucions analítiques ha obligat a científics i enginyers a enfocar el seu estudi fent ús de mètodes numèrics. És a dir, convertir una realitat física de medi continu en un sistema discret que pot ser processat eficientment per un ordinador. En la primera part d’aquest projecte es presenten un seguit de problemes termofísics acadèmics de transmissió de calor i distribucions de pressions i velocitat en un fluid. Les equacions que regeixen el comportament d’aquestes propietats són discretitzades per ser implementades en programes de C++. Els resultats obtinguts són analitzats, contrastats i verificats. El segon bloc del treball presenta un enfocament cap a la resolució de problemes molt més grans i complexos. Escenaris on les discretitzacions són de tal magnitud que un ordinador no pot processar el resultat en un període de temps acceptable. En aquests casos cal conèixer el funcionament del càlcul en un ordinador per poder maximitzar l’ús dels seus recursos. Es presenten algunes de les tècniques d’optimització emprades avui en dia i es valora la millora de rendiment en resolucions de problemes reals.The conservation equations of mass, momentum and energy are of great importance in comprehending a physical system. From these equations derives, for example, the Navier Stokes system, one of the most studied in the field of physics to determine the motion and properties of any fluid. The great complexity of these systems of non-linear partial differential equations and the lack of analytical solutions has forced scientists and engineers to focus their study using numerical methods. That is, to convert a physical reality of a continuous environment into a discrete system that can be efficiently processed by a computer. The first part of this project presents a set of academic thermophysical problems of heat transfer and pressure and velocity distribution in a fluid. The equations governing these properties are discretised to be implemented in C++ programs. The results obtained are analysed, contrasted and verified by making use of the existing literature. The second half of the thesis presents an approach towards solving larger and more complex problems that can describe reality more accurately. Scenarios where the discretisation meshes are of such magnitude that a computer cannot process the result in an acceptable period of time. In these cases, it is necessary to be aware of how calculations are carried out by a computer in order to maximise the use of its resources. Some of the optimisation techniques used today are discussed and the performance improvement in solving real problems is evaluated

    Parallel Asynchronous Matrix Multiplication for a Distributed Pipelined Neural Network

    Get PDF
    Machine learning is an approach to devise algorithms that compute an output without a given rule set but based on a self-learning concept. This approach is of great importance for several fields of applications in science and industry where traditional programming methods are not sufficient. In neural networks, a popular subclass of machine learning algorithms, commonly previous experience is used to train the network and produce good outputs for newly introduced inputs. By increasing the size of the network more complex problems can be solved which again rely on a huge amount of training data. Increasing the complexity also leads to higher computational demand and storage requirements and to the need for parallelization. Several parallelization approaches of neural networks have already been considered. Most approaches use special purpose hardware whilst other work focuses on using standard hardware. Often these approaches target the problem by parallelizing the training data. In this work a new parallelization method named poadSGD is proposed for the parallelization of fully-connected, largescale feedforward networks on a compute cluster with standard hardware. poadSGD is based on the stochastic gradient descent algorithm. A block-wise distribution of the network's layers to groups of processes and a pipelining scheme for batches of the training samples are used. The network is updated asynchronously without interrupting ongoing computations of subsequent batches. For this task a one-sided communication scheme is used. A main algorithmic part of the batch-wise pipelined version consists of matrix multiplications which occur for a special distributed setup, where each matrix is held by a different process group. GASPI, a parallel programming model from the field of "Partitioned Global Address Spaces" (PGAS) models is introduced and compared to other models from this class. As it mainly relies on one-sided and asynchronous communication it is a perfect candidate for the asynchronous update task in the poadSGD algorithm. Therefore, the matrix multiplication is also implemented based GASPI. In order to efficiently handle upcoming synchronizations within the process groups and achieve a good workload distribution, a two-dimensional block-cyclic data distribution is applied for the matrices. Based on this distribution, the multiplication algorithm is computed by diagonally iterating over the sub blocks of the resulting matrix and computing the sub blocks in subgroups of the processes. The sub blocks are computed by sharing the workload between the process groups and communicating mostly in pairs or in subgroups. The communication in pairs is set up to be overlapped by other ongoing computations. The implementations provide a special challenge, since the asynchronous communication routines must be handled with care as to which processor is working at what point in time with which data in order to prevent an unintentional dual use of data. The theoretical analysis shows the matrix multiplication to be superior to a naive implementation when the dimension of the sub blocks of the matrices exceeds 382. The performance achieved in the test runs did not withstand the expectations the theoretical analysis predicted. The algorithm is executed on up to 512 cores and for matrices up to a size of 131,072 x 131,072. The implementation using the GASPI API was found not be straightforward but to provide a good potential for overlapping communication with computations whenever the data dependencies of an application allow for it. The matrix multiplication was successfully implemented and can be used within an implementation of the poadSGD method that is yet to come. The poadSGD method seems to be very promising, especially as nowadays, with the larger amount of data and the increased complexity of the applications, the approaches to parallelization of neural networks are increasingly of interest

    Optimization of communication intensive applications on HPC networks

    Get PDF
    Communication is a necessary but overhead inducing component of parallel programming. Its impact on application design and performance is due to several related aspects of a parallel job execution: network topology, routing protocol, suitability of algorithm being used to the network, job placement, etc. This thesis is aimed at developing an understanding of how communication plays out on networks of high performance computing systems and exploring methods that can be used to improve communication performance of large scale applications. Broadly speaking, three topics have been studied in detail in this thesis. The first of these topics is task mapping and job placement on practical installations of torus and dragonfly networks. Next, use of supervised learning algorithms for conducting diagnostic studies of how communication evolves on networks is explored. Finally, efficacy of packet-level simulations for prediction-based studies of communication performance on different networks using different network parameters is analyzed. The primary contribution of this thesis is development of scalable diagnostic and prediction methods that can assist in the process of network designing, adapting applications to future systems, and optimizing execution of applications on existing systems. These meth- ods include a supervised learning approach, a functional modeling tool (called Damselfly), and a PDES-based packet level simulator (called TraceR), all of which are described in this thesis

    Generating and auto-tuning parallel stencil codes

    Get PDF
    In this thesis, we present a software framework, Patus, which generates high performance stencil codes for different types of hardware platforms, including current multicore CPU and graphics processing unit architectures. The ultimate goals of the framework are productivity, portability (of both the code and performance), and achieving a high performance on the target platform. A stencil computation updates every grid point in a structured grid based on the values of its neighboring points. This class of computations occurs frequently in scientific and general purpose computing (e.g., in partial differential equation solvers or in image processing), justifying the focus on this kind of computation. The proposed key ingredients to achieve the goals of productivity, portability, and performance are domain specific languages (DSLs) and the auto-tuning methodology. The Patus stencil specification DSL allows the programmer to express a stencil computation in a concise way independently of hardware architecture-specific details. Thus, it increases the programmer productivity by disburdening her or him of low level programming model issues and of manually applying hardware platform-specific code optimization techniques. The use of domain specific languages also implies code reusability: once implemented, the same stencil specification can be reused on different hardware platforms, i.e., the specification code is portable across hardware architectures. Constructing the language to be geared towards a special purpose makes it amenable to more aggressive optimizations and therefore to potentially higher performance. Auto-tuning provides performance and performance portability by automated adaptation of implementation-specific parameters to the characteristics of the hardware on which the code will run. By automating the process of parameter tuning — which essentially amounts to solving an integer programming problem in which the objective function is the number representing the code's performance as a function of the parameter configuration, — the system can also be used more productively than if the programmer had to fine-tune the code manually. We show performance results for a variety of stencils, for which Patus was used to generate the corresponding implementations. The selection includes stencils taken from two real-world applications: a simulation of the temperature within the human body during hyperthermia cancer treatment and a seismic application. These examples demonstrate the framework's flexibility and ability to produce high performance code

    Coordinated power management in heterogeneous processors

    Get PDF
    Coordinated Power Management in Heterogeneous Processors Indrani Paul 164 pages Directed by Dr. Sudhakar Yalamanchili With the end of Dennard scaling, the scaling of device feature size by itself no longer guarantees sustaining the performance improvement predicted by Moore’s Law. As industry moves to increasingly small feature sizes, performance scaling will become dominated by the physics of the computing environment and in particular by the transient behavior of interactions between power delivery, power management and thermal fields. Consequently, performance scaling must be improved by managing interactions between physical properties, which we refer to as processor physics, and system level performance metrics, thereby improving the overall efficiency of the system. The industry shift towards heterogeneous computing is in large part motivated by energy efficiency. While such tightly coupled systems benefit from reduced latency and improved performance, they also give rise to new management challenges due to phenomena such as physical asymmetry in thermal and power signatures between the diverse elements and functional asymmetry in performance. Power-performance tradeoffs in heterogeneous processors are determined by coupled behaviors between major components due to the i) on-die integration, ii) programming model and the iii) processor physics. Towards this end, this thesis demonstrates the needs for coordinated management of functional and physical resources of a heterogeneous system across all major compute and memory elements. It shows that the interactions among performance, power delivery and different types of coupling phenomena are not an artifact of an architecture instance, but is fundamental to the operation of many core and heterogeneous architectures. Managing such coupling effects is a central focus of this dissertation. This awareness has the potential to exert significant influence over the design of future power and performance management algorithms. The high-level contributions of this thesis are i) in-depth examination of characteristics and performance demands of emerging applications using hardware measurements and analysis from state-of-the-art heterogeneous processors and high-performance GPUs, ii) analysis of the effects of processor physics such as power and thermals on system level performance, iii) identification of a key set of run-time metrics that can be used to manage these effects, and iv) development and detailed evaluation of online coordinated power management techniques to optimize system level global metrics in heterogeneous CPU-GPU-memory processors.Ph.D
    • …
    corecore