14 research outputs found

    Devito: Towards a generic Finite Difference DSL using Symbolic Python

    Full text link
    Domain specific languages (DSL) have been used in a variety of fields to express complex scientific problems in a concise manner and provide automated performance optimization for a range of computational architectures. As such DSLs provide a powerful mechanism to speed up scientific Python computation that goes beyond traditional vectorization and pre-compilation approaches, while allowing domain scientists to build applications within the comforts of the Python software ecosystem. In this paper we present Devito, a new finite difference DSL that provides optimized stencil computation from high-level problem specifications based on symbolic Python expressions. We demonstrate Devito's symbolic API and performance advantages over traditional Python acceleration methods before highlighting its use in the scientific context of seismic inversion problems.Comment: pyHPC 2016 conference submissio

    Architectural explorations for streaming accelerators with customized memory layouts

    Get PDF
    El concepto b谩sico de la arquitectura mono-nucleo en los procesadores de prop贸sito general se ajusta bien a un modelo de programaci贸n secuencial. La integraci贸n de multiples n煤cleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotaci贸n del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es dif铆cil de conseguir usando unicamente multicores de prop贸sito general. La aparici贸n de aceleradores tipo streaming y de los correspondientes modelos de programaci贸n han mejorado esta situaci贸n proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea b谩sica detr谩s del dise帽o de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computaci贸n paralela y comunicaci贸n entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas caracter铆sticas especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran n煤mero de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computaci贸n en forma de flujos. La disposici贸n de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenaci贸n de los patrones de las aplicaciones de acceso a datos puede que no sean muy 煤tiles para lograr un rendimiento 贸ptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tama帽o y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podr铆a ser eliminado empleando t茅cnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitect贸nicas de los aceleradores streaming con dise帽os de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Dise帽o de aceleradores de aplicaciones espec铆ficas con dise帽os de memoria personalizados, ii) dise帽o de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de dise帽o para dispositivos orientados a flujos con las memorias est谩ndar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computaci贸n Blacksmith permite la adopci贸n a nivel de hardware de un front-end de aplicaci贸n espec铆fico utilizando una GPU como back-end. Esto permite maximizar la explotaci贸n de la localidad de datos y el paralelismo a nivel de datos de una aplicaci贸n mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el dise帽o de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicaci贸n en la forma de plantillas.The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators 驴 similar to the other processors 驴 consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture

    Architectural explorations for streaming accelerators with customized memory layouts

    Get PDF
    El concepto b谩sico de la arquitectura mono-nucleo en los procesadores de prop贸sito general se ajusta bien a un modelo de programaci贸n secuencial. La integraci贸n de multiples n煤cleos en un solo chip ha permitido a los procesadores correr partes del programa en paralelo. Sin embargo, la explotaci贸n del enorme paralelismo disponible en muchas aplicaciones de alto rendimiento y de los datos correspondientes es dif铆cil de conseguir usando unicamente multicores de prop贸sito general. La aparici贸n de aceleradores tipo streaming y de los correspondientes modelos de programaci贸n han mejorado esta situaci贸n proporcionando arquitecturas orientadas al proceso de flujos de datos. La idea b谩sica detr谩s del dise帽o de estas arquitecturas responde a la necesidad de procesar conjuntos enormes de datos. Estos dispositivos de alto rendimiento orientados a flujos permiten el procesamiento rapido de datos mediante el uso eficiente de computaci贸n paralela y comunicaci贸n entre procesos. Los aceleradores streaming orientados a flujos, igual que en otros procesadores, consisten en diversos componentes micro-arquitectonicos como por ejemplo las estructuras de memoria, las unidades de computo, las unidades de control, los canales de Entrada/Salida y controles de Entrada/Salida, etc. Sin embargo, los requisitos del flujo de datos agregan algunas caracter铆sticas especiales e imponen otras restricciones que afectan al rendimiento. Estos dispositivos, por lo general, ofrecen un gran n煤mero de recursos computacionales, pero obligan a reorganizar los conjuntos de datos en paralelo, maximizando la independiencia para alimentar los recursos de computaci贸n en forma de flujos. La disposici贸n de datos en conjuntos independientes de flujos paralelos no es una tarea sencilla. Es posible que se tenga que cambiar la estructura de un algoritmo en su conjunto o, incluso, puede requerir la reescritura del algoritmo desde cero. Sin embargo, todos estos esfuerzos para la reordenaci贸n de los patrones de las aplicaciones de acceso a datos puede que no sean muy 煤tiles para lograr un rendimiento 贸ptimo. Esto es debido a las posibles limitaciones microarquitectonicas de la plataforma de destino para los mecanismos hardware de prefetch, el tama帽o y la granularidad del almacenamiento local, y la flexibilidad para disponer de forma serial los datos en el interior del almacenamiento local. Las limitaciones de una plataforma de streaming de proposito general para el prefetching de datos, almacenamiento y demas procedimientos para organizar y mantener los datos en forma de flujos paralelos e independientes podr铆a ser eliminado empleando t茅cnicas a nivel micro-arquitectonico. Esto incluye el uso de memorias personalizadas especificamente para las aplicaciones en el front-end de una arquitectura streaming. El objetivo de esta tesis es presentar exploraciones arquitect贸nicas de los aceleradores streaming con dise帽os de memoria personalizados. En general, la tesis cubre tres aspectos principales de tales aceleradores. Estos aspectos se pueden clasificar como: i) Dise帽o de aceleradores de aplicaciones espec铆ficas con dise帽os de memoria personalizados, ii) dise帽o de aceleradores con memorias personalizadas basados en plantillas, y iii) exploraciones del espacio de dise帽o para dispositivos orientados a flujos con las memorias est谩ndar y personalizadas. Esta tesis concluye con la propuesta conceptual de una Blacksmith Streaming Architecture (BSArc). El modelo de computaci贸n Blacksmith permite la adopci贸n a nivel de hardware de un front-end de aplicaci贸n espec铆fico utilizando una GPU como back-end. Esto permite maximizar la explotaci贸n de la localidad de datos y el paralelismo a nivel de datos de una aplicaci贸n mientras que proporciona un flujo mayor de datos al back-end. Consideramos que el dise帽o de estos procesadores con memorias especializadas debe ser proporcionado por expertos del dominio de aplicaci贸n en la forma de plantillas.The basic concept behind the architecture of a general purpose CPU core conforms well to a serial programming model. The integration of more cores on a single chip helped CPUs in running parts of a program in parallel. However, the utilization of huge parallelism available from many high performance applications and the corresponding data is hard to achieve from these general purpose multi-cores. Streaming accelerators and the corresponding programing models improve upon this situation by providing throughput oriented architectures. The basic idea behind the design of these architectures matches the everyday increasing requirements of processing huge data sets. These high-performance throughput oriented devices help in high performance processing of data by using efficient parallel computations and streaming based communications. The throughput oriented streaming accelerators 驴 similar to the other processors 驴 consist of numerous types of micro-architectural components including the memory structures, compute units, control units, I/O channels and I/O controls etc. However, the throughput requirements add some special features and impose other restrictions for the performance purposes. These devices, normally, offer a large number of compute resources but restrict the applications to arrange parallel and maximally independent data sets to feed the compute resources in the form of streams. The arrangement of data into independent sets of parallel streams is not an easy and simple task. It may need to change the structure of an algorithm as a whole or even it can require to write a new algorithm from scratch for the target application. However, all these efforts for the re-arrangement of application data access patterns may still not be very helpful to achieve the optimal performance. This is because of the possible micro-architectural constraints of the target platform for the hardware pre-fetching mechanisms, the size and the granularity of the local storage and the flexibility in data marshaling inside the local storage. The constraints of a general purpose streaming platform on the data pre-fetching, storing and maneuvering to arrange and maintain it in the form of parallel and independent streams could be removed by employing micro-architectural level design approaches. This includes the usage of application specific customized memories in the front-end of a streaming architecture. The focus of this thesis is to present architectural explorations for the streaming accelerators using customized memory layouts. In general the thesis covers three main aspects of such streaming accelerators in this research. These aspects can be categorized as : i) Design of Application Specific Accelerators with Customized Memory Layout ii) Template Based Design Support for Customized Memory Accelerators and iii) Design Space Explorations for Throughput Oriented Devices with Standard and Customized Memories. This thesis concludes with a conceptual proposal on a Blacksmith Streaming Architecture (BSArc). The Blacksmith Computing allow the hardware-level adoption of an application specific front-end with a GPU like streaming back-end. This gives an opportunity to exploit maximum possible data locality and the data level parallelism from an application while providing a throughput natured powerful back-end. We consider that the design of these specialized memory layouts for the front-end of the device are provided by the application domain experts in the form of templates. These templates are adjustable according to a device and the problem size at the device's configuration time. The physical availability of such an architecture may still take time. However, simulation framework helps in architectural explorations to give insight into the proposal and predicts potential performance benefits for such an architecture.Postprint (published version

    Green Wave : A Semi Custom Hardware Architecture for Reverse Time Migration

    Get PDF
    Over the course of the last few decades the scientific community greatly benefited from steady advances in compute performance. Until the early 2000's this performance improvement was achieved through rising clock rates. This enabled plug-n-play performance improvements for all codes. In 2005 the stagnation of CPU clock rates drove the computing hardware manufactures to attain future performance through explicit parallelism. Now the HPC community faces a new, even bigger challenge. So far performance gains were achieved through replication of general-purpose cores and nodes. Unfortunately, rising cluster sizes resulted in skyrocketing energy costs - a paradigm change in HPC architecture design is inevitable. In combination with the increasing costs of data movement, the HPC community started exploring alternatives like GPUs and large arrays of simple, low-power cores (e.g. BlueGene) to offer the better performance per Watt and greatest scalability. As in general science, the seismic community faces large-scale, complex computational challenges that can only be limited solved with available compute capabilities. Such challenges include the physically correct modeling of subsurface rock layers. This thesis analyzes the requirements and performance of isotropic (ISO), vertical transverse isotropic (VTI) and tilted transverse isotropic (TTI) wave propagation kernels as they appear in the Reverse Time Migration (RTM) imaging method. It finds that even with leading-edge, commercial off-the-shelf hardware, large-scale survey sizes cannot be imaged within reasonable time and power constraints. This thesis uses a novel architecture design method leveraging a hardware/software co-design approach, adopted from the mobile- and embedded market, for HPC. The methodology tailors an architecture design to a class of applications without loss of generality like in full custom designs. This approach was first applied in the Green Flash project, which proved that the co-design approach has the potential for high energy efficiency gains. This thesis presents the novel Green Wave architecture that is derived from the Green Flash project. Rather than focusing on climate codes, like Green Flash, Green Wave chooses RTM wave propagation kernels as its target application. Thus, the goal of the application-driven, co-design Green Wave approach, is to enable full programmability while allowing greater computational efficiency than general-purpose processors or GPUs by offering custom extensions to the processor's ISA and correctly sizing software-managed memories and an efficient on-chip network interconnect. The lowest level building blocks of the Green Wave design are pre-verified IP components. This minimizes the amount of custom logic in the design, which in turn reduces verification costs and design uncertainty. In this thesis three Green Wave architecture designs derived from ISO, VTI and TTI kernel analysis are introduced. Further, a programming model is proposed capable of hiding all communication latencies. With production-strength, cycle-accurate hardware simulators Green Wave's performance is benchmarked and its performance compared to leading on-market systems from Intel, AMD and NVidia. Based on a large-scale example survey, the results show that Green Wave has the potential of an energy efficiency improvement of 5x compared to x86 and 1.4x-4x to GPU-based clusters for ISO, VTI and TTI kernels

    Software for Exascale Computing - SPPEXA 2016-2019

    Get PDF
    This open access book summarizes the research done and results obtained in the second funding phase of the Priority Program 1648 "Software for Exascale Computing" (SPPEXA) of the German Research Foundation (DFG) presented at the SPPEXA Symposium in Dresden during October 21-23, 2019. In that respect, it both represents a continuation of Vol. 113 in Springer鈥檚 series Lecture Notes in Computational Science and Engineering, the corresponding report of SPPEXA鈥檚 first funding phase, and provides an overview of SPPEXA鈥檚 contributions towards exascale computing in today's sumpercomputer technology. The individual chapters address one or more of the research directions (1) computational algorithms, (2) system software, (3) application software, (4) data management and exploration, (5) programming, and (6) software tools. The book has an interdisciplinary appeal: scholars from computational sub-fields in computer science, mathematics, physics, or engineering will find it of particular interest

    Seismic Waves

    Get PDF
    The importance of seismic wave research lies not only in our ability to understand and predict earthquakes and tsunamis, it also reveals information on the Earth's composition and features in much the same way as it led to the discovery of Mohorovicic's discontinuity. As our theoretical understanding of the physics behind seismic waves has grown, physical and numerical modeling have greatly advanced and now augment applied seismology for better prediction and engineering practices. This has led to some novel applications such as using artificially-induced shocks for exploration of the Earth's subsurface and seismic stimulation for increasing the productivity of oil wells. This book demonstrates the latest techniques and advances in seismic wave analysis from theoretical approach, data acquisition and interpretation, to analyses and numerical simulations, as well as research applications. A review process was conducted in cooperation with sincere support by Drs. Hiroshi Takenaka, Yoshio Murai, Jun Matsushima, and Genti Toyokuni

    Overhauling Sound Diffusion in Auditoria Using Deep-Subwavelength Acoustic Metamaterials

    Get PDF
    The reducedamount of space available in critical listening environments, such as orchestra pits, rehearsal rooms or even recording studios, often impairs the installation of helpful, but sizeable, acoustic treatments on their boundaries. This can be a problem as such acoustic treatments, mainly used for sound absorption and diffusion, are key for controlling the physical aspects of sound propagation in the environment. This research thus proposes to study experimentally and numerically a cutting-edge metamaterial-inspired approach designed to provide ultra-thin and adaptable alternatives to traditional acoustic treatments, with a particular focus on sound diffusion, and how these can be integrated in practical computational frameworks. These novel deep-subwavelength acoustic metamaterials, termed metadiffusers, allow for efficient sound diffusion within dimensions 1/10th to 1/20th thinner than ordinary sound diffusers. Moreover, the optimization potential of metadiffusers brings a vast panel of variable configurations depending on the situation requirements. Results presented throughout this thesis outline several of these configurations with experimental and/or numerical validations in free-field scattering scenarios as well as numerical room acoustic applications. Very good agreement is found all through between the analytical and experimental/numerical scattering and diffusion datasets, thus demonstrating the outstanding and versatile potential of metadiffusers to be applied in many critical listening environments where space is at a premium, such as orchestra pits or recording studios
    corecore