15 research outputs found

    Energy Efficient Seismic Wave Propagation Simulation on a Low-power Manycore Processor.

    No full text
    International audienceLarge-scale simulation of seismic wave propagation is an active research topic. Its high demand for processing power makes it a good match for High Performance Computing (HPC). Although we have observed a steady increase on the processing capabilities of HPC platforms, their energy efficiency is still lacking behind. In this paper, we analyze the use of a low-power manycore processor, the MPPA-256, for seismic wave propagation simulations. First we look at its peculiar characteristics such as limited amount of on-chip memory and describe the intricate solution we brought forth to deal with this processor's idiosyncrasies. Next, we compare the performance and energy efficiency of seismic wave propagation on MPPA-256 to other commonplace platforms such as general-purpose processors and a GPU. Finally, we wrap up with the conclusion that, even if MPPA-256 presents an increased software development complexity, it can indeed be used as an energy efficient alternative to current HPC platforms, resulting in up to 71% and 5.18x less energy than a GPU and a general-purpose processor, respectively

    Otimização do framework PSkel para o processador manycore MPPA-256

    Get PDF
    TCC(graduação) - Universidade Federal de Santa Catarina. Centro Tecnológico. Ciências da Computação.A new class of highly parallel low-power chips that deal with a energy restriction was developed. The Sunway SW26010 and Kalray processors are some exemples of them, giving more than two hundred processing cores in a single low-power chip. Despite presenting a better energy efficiency than the general purpose multi core processors, the architects features such as the limited amount of distributed memory on the chip makes the development of efficient scientific applications a challenging task. In this term paper were proposed optimizations to the framework PSkel MPPA, which provides an unique, high-level abstraction for stencil programming in the MPPA-256 processor, exempting programmers from being responsible for the task of explicitly handling with communication and with the parallel hybrid programming model of the MPPA-256.Uma nova classe de chips altamente paralelos de baixo consumo energético que lidam com a restrição de energia foi desenvolvida. Os processadores Sunway SW26010 e Kalray MPPA-256 são exemplos deles, entregando mais de duzentos núcleos de processamento em um único chip. Apesar de apresentarem melhor eficiência energética do que os processadores multicore de propósito geral, características arquiteturais como a limitada quantidade de memória distribuída no chip torna o desenvolvimento de aplicações científicas paralelas eficientes uma tarefa desafiadora. Neste projeto foram propostas otimizações ao framework PSkelMPPA, que provê uma abstração única e de alto nível para programação estêncil no processador MPPA-256, eximindo os programadores de serem responsáveis pela tarefa de explicitamente lidar com a comunicação e com o modelo de programação paralela híbrida do MPPA-256

    CAP Bench: a benchmark suite for performance and energy evaluation of low-power many-core processors

    Get PDF
    International audienceSUMMARY The constant need for faster and more energy-efficient processors has been stimulating the development of new architectures, such as low-power many-core architectures. Researchers aiming to study these architectures are challenged by peculiar characteristics of some components such as Networks-on-Chip and lack of specific tools to evaluate their performance. In this context, the goal of this paper is to present a benchmark suite to evaluate state-of-the-art low-power many-core architectures such as the Kalray MPPA-256 low-power processor, which features 256 compute cores in a single chip. The benchmark was designed and used to highlight important aspects and details that need to be considered when developing parallel applications for emerging low-power many-core architectures. As a result, this paper demonstrates that the benchmark offers a diverse suite of programs with regard to parallel patterns, job types, communication intensity and task load strategies, suitable for a broad understanding of performance and energy consumption of MPPA-256 and upcoming many-core architectures

    PSkel-MPPA: Uma Adaptação do Framework PSkel para o Processador Manycore MPPA-256

    Get PDF
    TCC(graduação) - Universidade Federal de Santa Catarina. Centro Tecnológico. Ciências da Computação.Aplicações paralelas podem ser classificadas de acordo com o padrão de computação e coordenação. Dentre os padrões mais conhecidos destacam-se o map, reduce, pipeline, scan e stencil. Este último é muito utilizado em diversas áreas, como simulação física de partículas, previsão meteorológica, termodinâmica, resolução de funções diferenciais, manipulação de imagens, entre outras. O PSkel é um framework de programação paralela desenvolvido para simplificar o desenvolvimento de aplicações que seguem o padrão stencil. Utilizando uma abstração de alto nível, programador define o kernel da computação, enquanto o framework se encarrega de executar a computação paralela em multicores e em Graphics Processing Units (GPUs) de maneira eficiente. O objetivo deste trabalho é propor uma adaptação do framework PSkel para o processador manycore emergente MPPA-256, batizada de PSkel-MPPA. A motivação para tal adaptação está relacionada à dificuldade de desenvolvimento de aplicações do padrão stencil para o MPPA-256, tendo em vista as suas características arquiteturais intrínsecas que tornam o desenvolvimento de aplicações paralelas onerosas e suscetíveis a erros. A adaptação do framework permite simplificar o desenvolvimento de aplicações stencil para o MPPA-256, escondendo do desenvolvedor detalhes de implementação, tais como a necessidade de comunicação explícita entre as memórias do chip e a distribuição de computações entre os núcleos de processamento. Diversos experimentos foram efetuados com a solução proposta para o MPPA-256, comparando-a com a solução para multicores já existente. Os resultados mostraram que a solução proposta para o MPPA-256 permite reduzir o consumo de energia das aplicações stencil em até 1.45x apesar de apresentar uma perda de desempenho de até 3.3x

    A Simple MPI Library for Lightweight Manycore Processors

    Get PDF
    TCC(graduação) - Universidade Federal de Santa Catarina. Centro Tecnológico. Ciências da Computação.Nas últimas décadas, melhorar o desempenho de núcleos individuais e aumentar o nú- mero de núcleos de alta potência por chip foram as principais tendências na construção de processadores. No entanto, esta combinação levou não apenas a um aumento no poder computacional, mas também a um aumento considerável no seu consumo de energia. Há uma preocupação crescente entre a comunidade científica a respeito da eficiência ener- gética dos supercomputadores modernos. Nos últimos anos, muitos esforços têm sido feitos em pesquisas, buscando soluções alternativas capazes de resolver este problema de escalabilidade e eficiência energética. O desempenho e a eficiência energética providos pelos manycores leves são inegáveis. Contudo, a falta de suporte avançado e portátil para esses processadores, como interfaces padrão de alto desempenho para o desenvolvi- mento de código portável, torna o desenvolvimento de software um desafio. Atualmente, duas abordagens são empregadas tentando aumentar a programabilidade em manycores leves: Sistemas operacionais (SOs) e sistemas de execução (runtimes). A primeira fornece portabilidade mas expõe interfaces de programação complexas no nível do SO aos desen- volvedores. Já a segunda se concentra em fornecer interfaces ricas e de alto desempenho, as quais são específicas do fabricante e resultam em software não portável. Portanto, as soluções existentes forçam os desenvolvedores a escolher entre a portabilidade do software ou um processo de desenvolvimento mais rápido. Para resolver esse dilema, neste traba- lho é proposta uma biblioteca MPI leve e portável (LWMPI) projetada do zero para lidar com as restrições e complexidades dos manycores leves. A LWMPI foi integrada a um SO direcionado a esses processadores, oferecendo assim uma melhor programabilidade e portabilidade implícita para manycores leves, sem incorrer em sobrecargas de desempe- nho excessivas que inviabilizariam o seu uso. Para fornecer uma avaliação abrangente da LWMPI, foram utilizadas três aplicações de uma suíte de benchmarking representativa, usada para avaliar o desempenho de manycores leves, além de um benchmark sintético. Os resultados obtidos no processador Kalray MPPA-256 revelaram que a LWMPI atinge uma performance e uma escalabilidade de desempenho melhor do que uma solução feita especificamente para essa análise e que se utiliza puramente das abstrações de IPC do Nanvix, ao mesmo tempo em que oferece uma interface de programação mais rica.In the last decades, improving the performance of individual cores and increasing the number of high power cores per chip were the main trends in the construction of proces- sors. However, this combination led not only to an increase in the computing capacity, but also to a considerable growth in energy consumption. There is a crescent concern among the scientific community about the energy efficiency of modern supercomputers. In the last years, many efforts have been made in research, searching for alternative solutions capable of solving this problem of scalability and energy efficiency. The performance and energy efficiency provided by lightweight manycores is undeniable. Although, the lack of rich and portable support for these processors, such as high-performance standard inter- faces that deliver portable source codes, makes software development a challenging task. Currently, two approaches are employed trying to improve programmability in lightweight manycores: Operating Systems (OSes) and baremetal runtime systems. The former pro- vides portability but exposes complex OS-level programming interfaces to developers. The latter focuses on providing rich and high performance interfaces, which are vendor- specific and yield to non-portable software. Thus, the existing solutions force software engineers to choose between software portability or a faster development process. To address this dilemma, we propose a portable and lightweight MPI library (LWMPI) de- signed from scratch to cope with restrictions and intricacies of lightweight manycores. We integrated LWMPI into a distributed OS that targets these processors, thus featuring bet- ter programmability and implicit portability for lightweight manycores, without incurring excessive performance overheads that could hinder its use. To deliver a comprehensive evaluation of LWMPI, we relied on three applications from a representative benchmark suite used to assess the performance of lightweight manycores, and a synthetic benchmark. Our results obtained on the Kalray MPPA-256 processor unveiled that LWMPI present better performance and scalability when compared with a specifically made solution that uses the raw Nanvix Inter-Process Communication (IPC) abstractions, while exposing a richer programming interface

    High performance scientific computing in applications with direct finite element simulation

    Get PDF
    xiii, 133 p.La predicción del flujo separado, incluida la pérdida de un avión completo mediantela dinámica de fluidos computacional (CFD) se considera uno de los grandes desaf¿¿os que seresolverán en 2030, según NASA. Las ecuaciones no lineales de Navier-Stokes proporcionan laformulación matemática para flujo de fluidos en espacios tridimensionales. Sin embargo, todaviafaltan soluciones clásicas, existencia y singularidad. Ya que el cálculo de la fuerza bruta esintratable para realizar simulación predictiva para un avión completo, uno puede usar la simulaciónnumérica directa (DNS); sin embargo, prohibitivamente caro ya que necesita resolver laturbulencia a escala de magnitud Re power (9/4). Considerando otros métodos como el estad¿¿sticopromedio Reynolds¿s Average Navier Stokes (RANS), spatial average Large Eddy Simulation(LES), y Hybrid Detached Eddy Simulation (DES), que requieren menos cantidad de grados delibertad. Todos estos métodos deben ajustarse a los problemas de referencia y, además, cerca las paredes, la malla tieneque ser muy fina para resolver las capas l¿¿mite (lo cual significa que el costo computacional es muycostoso). Por encima de todo, los resultados son sensibles a, por ejemplo, parámetros expl¿¿citos enel método, la malla, etc.Como una solución al desaf¿¿o, aqu¿¿ presentamos la adaptación Metodolog¿¿a de solución directa deFEM (DFS) con resolución numérica disparo, como una familia predictiva, libre de parámetros demétodos para flujo turbulento. Resolvimos el modelo de avión JAXA Standard Model (JSM) ennúmero realista de Reynolds, presentado como parte del High Lift Taller de predicción 3.Predijimos un aumento de Cl dentro de un error de 5 % vs experimento, arrastre Cd dentro de 10 %error y detenga 1 ¿ dentro del ángulo de ataque.El taller identificó un probable experimento error depedido 10 % para los resultados de arrastre. La simulación es 10 veces más rápido y más barato encomparación con CFD tradicional o existente enfoques. La eficiencia proviene principalmente dell¿¿mite de deslizamiento condición que permite mallas gruesas cerca de las paredes, orientada aobjetivos control de error adaptativo que refina la malla solo donde es necesario y grandes pasos detiempo utilizando un método de iteración de punto fijo tipo Schur, sin comprometer la precisión delos resultados de la simulación.También presentamos una generalización de DFS a densidad variable y validado contra el problemade referencia MARIN bien establecido. los Los resultados muestran un buen acuerdo con losresultados experimentales en forma de sensores de presión. Más tarde, usamos esta metodolog¿¿apara resolver dos aplicaciones en problemas de flujo multifásico. Uno tiene que ver con un flashtanque de almacenamiento de agua de lluvia (consorcio de agua de Bilbao), y el segundo es sobre eldiseño de una boquilla para impresión 3D. En el agua de lluvia tanque de almacenamiento,predijimos que la altura del agua en el tanque tiene un influencia significativa sobre cómo secomporta el flujo aguas abajo de la puerta del tanque (válvula). Para la impresión 3D,desarrollamos un diseño eficiente con El flujo de chorro enfocado para evitar la oxidación y elcalentamiento en la punta del boquilla durante un proceso de fusión.Finalmente, presentamos aqu¿¿ el paralelismo en múltiples GPU y el incrustado sistema dearquitectura Kalray. Casi todas las supercomputadoras de hoy tienen arquitecturas heterogéneas,1 See the UNESCO Internacional Standard nomenclature for fields of Science and Technologyacomo CPU+GPU u otros aceleradores, y, por lo tanto, es esencial desarrollar marcoscomputacionales para aprovecha de ellos. Como lo hemos visto antes, se comienza a desarrollar eseCFD más tarde en la década de 1060 cuando podemos tener poder computacional, por lo tanto, Esesencial utilizar y probar estos aceleradores para los cálculos de CFD. Las GPU tienen unaarquitectura diferente en comparación con las CPU tradicionales. Técnicamente, la GPU tienemuchos núcleos en comparación con las CPU que hacen de la GPU una buena opción para elcómputo paralelo.Para múltiples GPU, desarrollamos un cálculo de plantilla, aplicado a simulación depliegues geológicos. Exploramos la computación de halo y utilizamos Secuencias CUDA paraoptimizar el tiempo de computación y comunicación. La ganancia de rendimiento resultante fue de23 % para cuatro GPU con arquitectura Fermi, y la mejora correspondiente obtenida en cuatro LasGPU Kepler fueron de 47 %.This research was carried out at the Basque Center for Applied Mathematics (BCAM) within the CFD Computational Technology (CFDCT) and also at the School of Electrical Engineering and Computer Science(Royal Institue of Technology, Stockholm, Sweden). Which is suported by Fundacion Obra Social “la Caixa“, Severo Ochoa Excellence research centre 2014-2018 SEV-2013-0323, Severo Ochoa Excellence research centre 2018-2022 SEV-2017-0718, BERC program 2014-2017, BERC program 2018-2021, MSO4SC European project, Elkartek. This work has been performed using the computing infrastructure from SNIC (Swedish National Infrastructure for Computing)

    High Performance Scientific Computing in Applications with Direct Finite Element Simulation

    Get PDF
    To predict separated flow including stall of a full aircraft with Computational Fluid Dynamics (CFD) is considered one of the problems of the grand challenges to be solved by 2030, according to NASA [1]. The nonlinear Navier- Stokes equations provide the mathematical formulation for fluid flow in 3- dimensional spaces. However, classical solutions, existence, and uniqueness are still missing. Since brute-force computation is intractable, to perform predictive simulation for a full aircraft, one can use Direct Numerical Simulation (DNS); however, it is prohibitively expensive as it needs to resolve the turbulent scales of order Re4 . Considering other methods such as statistical average Reynolds’s Average Navier Stokes (RANS), spatial average Large Eddy Simulation (LES), and hybrid Detached Eddy Simulation (DES), which require less number of degrees of freedom. All of these methods have to be tuned to benchmark problems, and moreover, near the walls, the mesh has to be very fine to resolve boundary layers (which means the computational cost is very expensive). Above all, the results are sensitive to, e.g. explicit parameters in the method, the mesh, etc. As a resolution to the challenge, here we present the adaptive time- resolved Direct FEM Solution (DFS) methodology with numerical tripping, as a predictive, parameter-free family of methods for turbulent flow. We solved the JAXA Standard Model (JSM) aircraft model at realistic Reynolds number, presented as part of the High Lift Prediction Workshop 3. We predicted lift Cl within 5% error vs. experiment, drag Cd within 10% error and stall 1◦ within the angle of attack. The workshop identified a likely experimental error of order 10% for the drag results. The simulation is 10 times faster and cheaper when compared to traditional or existing CFD approaches. The efficiency mainly comes from the slip boundary condition that allows coarse meshes near walls, goal-oriented adaptive error control that refines the mesh only where needed and large time steps using a Schur-type fixed-point iteration method, without compromising the accuracy of the simulation results. As a follow-up, we were invited to the Fifth High Order CFD Workshop, where the approach was validated for a tandem sphere problem (low Reynolds number turbulent flow) wherein a second sphere is placed a certain distance downstream from a first sphere. The results capture the expected slipstream phenomenon, with appx. 2% error. A comparison with the higher-order frameworks Nek500 and PyFR was done. The PyFR framework has demonstrated high effectiveness for GPUs with an unstructured mesh, which is a hard problem in this field. This is achieved by an explicit time-stepping approach. Our study showed that our large time step approach enabled appx. 3 orders of magnitude larger time steps than the explicit time steps in PyFR, which made our method more effective for solving the whole problem. We also presented a generalization of DFS to variable density and validated against the well-established MARIN benchmark problem. The results show good agreement with experimental results in the form of pressure sensors. Later, we used this methodology to solve two applications in multiphase flow problems. One has to do with a flash rainwater storage tank (Bilbao water consortium), and the second is about designing a nozzle for 3D printing. In the flash rainwater storage tank, we predicted that the water height in the tank has a significant influence on how the flow behaves downstream of the tank door (valve). For the 3D printing, we developed an efficient design with the focused jet flow to prevent oxidation and heating at the tip of the nozzle during a melting process. Finally, we presented here the parallelism on multiple GPUs and the embedded system Kalray architecture. Almost all supercomputers today have heterogeneous architectures, such as CPU+GPU or other accelerators, and it is, therefore, essential to develop computational frameworks to take advantage of them. For multiple GPUs, we developed a stencil computation, applied to geological folds simulation. We explored halo computation and used CUDA streams to optimize computation and communication time. The resulting performance gain was 23% for four GPUs with Fermi architecture, and the corresponding improvement obtained on four Kepler GPUs were 47%. The Kalray architecture is designed to have low energy consumption. Here we tested the Jacobi method with different communication strategies. Additionally, visualization is a crucial area when we do scientific simulations. We developed an automated visualization framework, where we could see that task parallelization is more than 10 times faster than data parallelization. We have also used our DFS in the cloud computing setting to validate the simulation against the local cluster simulation. Finally, we recommend the easy pre-processing tool to support DFS simulation.La Caixa 201

    Etude de l'adéquation des machines Exascale pour les algorithmes implémentant la méthode du Reverse Time Migation

    Get PDF
    As we are expecting Exascale systems for the 2018-2020 time frame, performance analysis and characterization of applications for new processor architectures and large scale systems are important tasks that permit to anticipate the required changes to efficiently exploit the future HPC systems. This thesis focuses on seismic imaging applications used for modeling complex physical phenomena, in particular the depth imaging application called Reverse Time Migration (RTM). My first contribution consists in characterizing and modeling the performance of the computational core of RTM which is based on finite-difference time-domain (FDTD) computations. I identify and explore the major tuning parameters influencing performance and the interaction between the architecture and the application. The second contribution is an analysis to identify the challenges for a hybrid and heterogeneous implementation of FDTD for manycore architectures. We target Intel’s first Xeon Phi co-processor, the Knights Corner. This architecture is an interesting proxy for our study since it contains some of the expected features of an Exascale system: concurrency and heterogeneity.My third contribution is an extension of the performance analysis and modeling to the full RTM. This adds communications and IOs to the computation part. RTM is a data intensive application and requires the storage of intermediate values of the computational field resulting in expensive IO accesses. My fourth contribution is the final measurement and model validation of my hybrid RTM implementation on a large system. This has been done on Stampede, a machine of the Texas Advanced Computing Center (TACC), which allows us to test the scalability up to 64 nodes each containing one 61-core Xeon Phi and two 8-core CPUs for a total close to 5000 heterogeneous coresLa caractérisation des applications en vue de les préparer pour les nouvelles architectures et les porter sur des systèmes très étendus est une étape importante pour pouvoir anticiper les modifications nécessaires. Comme les machines Exascale sont prévues pour la période 2018-2020, l'étude des applications et leur préparation pour ces machines s'avèrent donc essentielles. Nous nous intéressons aux applications d'imagerie sismique et en particulier à l'application Reverse Time Migration (RTM) car elle est très utilisée par les pétroliers dans le cadre de l'exploration sismique.La première partie de nos travaux a porté sur l'étude du cœur de calcul de l'application RTM qui consiste en un calcul de différences finies dans le domaine temporel (FDTD). Nous avons caractérisé cette partie de l'application en soulevant les aspects architecturaux des machines actuelles ayant un fort impact sur la performance, notamment les caches, les bandes passantes et le prefetching. Cette étude a abouti à l'élaboration d'un modèle de performance permettant de prédire le trafic DRAM des FDTD. La deuxième partie de la thèse se focalise sur l'impact de l'hétérogénéité et le parallélisme sur la FDTD et sur RTM. Nous avons choisi l'architecture manycore d’Intel, Xeon Phi, et nous avons étudié une implémentation "native" et une implémentation hétérogène et hybride, la version "symmetric". Enfin, nous avons porté l'application RTM sur un cluster hétérogène, Stampede du Texas Advanced Computing Center (TACC), où nous avons effectué des tests de scalabilité allant jusqu'à 64 nœuds contenant des coprocesseurs Xeon Phi et des processeurs Sandy Bridge ce qui correspond à presque 5000 cœur

    Optimization Techniques for Parallel Programming of Embedded Many-Core Computing Platforms

    Get PDF
    Nowadays many-core computing platforms are widely adopted as a viable solution to accelerate compute-intensive workloads at different scales, from low-cost devices to HPC nodes. It is well established that heterogeneous platforms including a general-purpose host processor and a parallel programmable accelerator have the potential to dramatically increase the peak performance/Watt of computing architectures. However the adoption of these platforms further complicates application development, whereas it is widely acknowledged that software development is a critical activity for the platform design. The introduction of parallel architectures raises the need for programming paradigms capable of effectively leveraging an increasing number of processors, from two to thousands. In this scenario the study of optimization techniques to program parallel accelerators is paramount for two main objectives: first, improving performance and energy efficiency of the platform, which are key metrics for both embedded and HPC systems; second, enforcing software engineering practices with the aim to guarantee code quality and reduce software costs. This thesis presents a set of techniques that have been studied and designed to achieve these objectives overcoming the current state-of-the-art. As a first contribution, we discuss the use of OpenMP tasking as a general-purpose programming model to support the execution of diverse workloads, and we introduce a set of runtime-level techniques to support fine-grain tasks on high-end many-core accelerators (devices with a power consumption greater than 10W). Then we focus our attention on embedded computer vision (CV), with the aim to show how to achieve best performance by exploiting the characteristics of a specific application domain. To further reduce the power consumption of parallel accelerators beyond the current technological limits, we describe an approach based on the principles of approximate computing, which implies modification to the program semantics and proper hardware support at the architectural level
    corecore