110 research outputs found

    Energy Transparency for Deeply Embedded Programs

    Get PDF
    Energy transparency is a concept that makes a program's energy consumption visible, from hardware up to software, through the different system layers. Such transparency can enable energy optimizations at each layer and between layers, and help both programmers and operating systems make energy-aware decisions. In this paper, we focus on deeply embedded devices, typically used for Internet of Things (IoT) applications, and demonstrate how to enable energy transparency through existing Static Resource Analysis (SRA) techniques and a new target-agnostic profiling technique, without hardware energy measurements. Our novel mapping technique enables software energy consumption estimations at a higher level than the Instruction Set Architecture (ISA), namely the LLVM Intermediate Representation (IR) level, and therefore introduces energy transparency directly to the LLVM optimizer. We apply our energy estimation techniques to a comprehensive set of benchmarks, including single- and also multi-threaded embedded programs from two commonly used concurrency patterns, task farms and pipelines. Using SRA, our LLVM IR results demonstrate a high accuracy with a deviation in the range of 1% from the ISA SRA. Our profiling technique captures the actual energy consumption at the LLVM IR level with an average error of 3%.Comment: 33 pages, 7 figures. arXiv admin note: substantial text overlap with arXiv:1510.0709

    OTAWA: An Open Toolbox for Adaptive WCET Analysis

    Get PDF
    International audienceThe analysis of worst-case execution times has become mandatory in the design of hard real-time systems: it is absolutely necessary to know an upper bound of the execution time of each task to determine a task schedule that insures that deadlines will all be met. The OTAWA toolbox presented in this paper has been designed to host algorithms resulting from research in the domain of WCET analysis so that they can be combined to compute tight WCET estimates. It features an abstraction layer that decouples the analyses from the target hardware and from the instruction set architecture, as well as a set of functionalities that facilitate the implementation of new approaches

    ACOTES project: Advanced compiler technologies for embedded streaming

    Get PDF
    Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version

    Astaroth: Ohjelmistokirjasto stensiililaskentaan grafiikkasuorittimilla

    Get PDF
    Graphics processing units (GPUs) are coprocessors, which offer higher throughput and better power efficiency than central processing units in dataparallel tasks. For this reason, graphics processors provide a good platform for high-performance computing. However, programming GPUs such that all the available performance is utilized requires in-depth knowledge of the architecture of the hardware. Additionally, the problem of high-order stencil computations on GPUs in challenging multiphysics applications has not been adequately explored in previous work. In this thesis, we address these issues by presenting a library, an efficient algorithm and a domain-specific language for solving stencil computations within a structured grid. We tested our implementation by simulating magnetohydrodynamics, which involved the computation of first, second, and cross partial derivatives using second-, fourth-, sixth-, and eight-order finite differences with single and double precision. The running time of our integration kernel was 2.8–9.1 times slower than the theoretical minimum time, which it would take to read the computational domain and write it back to device memory exactly once, without taking into account the effects of finite caches or arithmetic operations on performance. Additionally, we made a performance comparison with a CPU solver widely used for scientific computations, which we benchmarked on a total of 24 cores of two Intel Xeon E5-2690 v3 processors. Our solver, benchmarked on a Tesla P100 PCIe GPU, outperformed the CPU solver by factors of 6.7 and 10.4 when using single and double precision, respectively.Grafiikkasuorittimet ovat apusuorittimia, jotka tarjoavat rinnakkain laskettavissa tehtävissä parempaa suoritus- ja energiatehokkuutta kuin keskussuorittimet. Tästä syystä grafiikkasuorittimet tarjoavat hyvän alustan suurteholaskennan tarpeisiin. Toisaalta grafiikkasuorittimen ohjelmointi siten, että kaikki tarjolla oleva suorituskyky saadaan hyödynnettyä, vaatii syvällistä asiantuntemusta ohjelmoitavan laitteiston arkkitehtuurista. Korkean asteen stensiililaskentaa haastavissa fysiikkasovelluksissa ei ole myöskään tutkittu laajalti aiemmissa julkaisuissa. Tässä työssä otamme kantaa näihin ongelmiin esittelemällä ohjelmistokirjaston, tehokkaan algoritmin, sekä tehtävään räätälöidyn ohjelmointikielen stensiililaskujen ratkaisemiseen säännöllisessä hilassa. Testasimme toteutustamme simuloimalla magnetohydrodynamiikkaa, johon kuului ensimmäisen ja toisen kertaluvun derivaattojen lisäksi ristiderivaattojen ratkaisutoisen, neljännen, kuudennen ja kahdeksannen kertaluvun differenssimenetelmällä käyttäen sekä 32- että 64-bittisiä liukulukuja. Integrointifunktiomme suoritusaika oli 2.8–9.1 kertaa hitaampi kuin teoreettinen vähimmäisajoaika, joka menisi laskennallisen alueen lukemiseen ja kirjoittamiseen apusuorittimen muistista täsmälleen kerran, ottamatta huomioon äärellisen välimuistin tai laskennan vaikutusta suoritusaikaan. Vertasimme kirjastomme suoritusaikaa laajalti tieteellisessä laskennassa käytettyyn keskussuorittimille tarkoitettuun ratkaisijaan, jonka ajoimme kokonaisuudessaan 24:llä ytimellä kahdella Intel Xeon E5-2690 v3 -suorittimella. Tähän ratkaisijaan verrattuna Tesla P100 PCIe -grafiikkasuorittimella ajettu ratkaisijamme oli 6.7 ja 10.4 kertaa nopeampi 32- ja 64-bittisillä liukuluvuilla laskettaessa, tässä järjestyksessä

    ASAM : Automatic Architecture Synthesis and Application Mapping; dl. 3.2: Instruction set synthesis

    Get PDF
    No abstract

    Un framework pour l'exécution efficace d'applications sur GPU et CPU+GPU

    Get PDF
    Technological limitations faced by the semi-conductor manufacturers in the early 2000's restricted the increase in performance of the sequential computation units. Nowadays, the trend is to increase the number of processor cores per socket and to progressively use the GPU cards for highly parallel computations. Complexity of the recent architectures makes it difficult to statically predict the performance of a program. We describe a reliable and accurate parallel loop nests execution time prediction method on GPUs based on three stages: static code generation, offline profiling, and online prediction. In addition, we present two techniques to fully exploit the computing resources at disposal on a system. The first technique consists in jointly using CPU and GPU for executing a code. In order to achieve higher performance, it is mandatory to consider load balance, in particular by predicting execution time. The runtime uses the profiling results and the scheduler computes the execution times and adjusts the load distributed to the processors. The second technique, puts CPU and GPU in a competition: instances of the considered code are simultaneously executed on CPU and GPU. The winner of the competition notifies its completion to the other instance, implying the termination of the latter.Les verrous technologiques rencontrés par les fabricants de semi-conducteurs au début des années deux-mille ont abrogé la flambée des performances des unités de calculs séquentielles. La tendance actuelle est à la multiplication du nombre de cœurs de processeur par socket et à l'utilisation progressive des cartes GPU pour des calculs hautement parallèles. La complexité des architectures récentes rend difficile l'estimation statique des performances d'un programme. Nous décrivons une méthode fiable et précise de prédiction du temps d'exécution de nids de boucles parallèles sur GPU basée sur trois étapes : la génération de code, le profilage offline et la prédiction online. En outre, nous présentons deux techniques pour exploiter l'ensemble des ressources disponibles d'un système pour la performance. La première consiste en l'utilisation conjointe des CPUs et GPUs pour l'exécution d'un code. Afin de préserver les performances il est nécessaire de considérer la répartition de charge, notamment en prédisant les temps d'exécution. Le runtime utilise les résultats du profilage et un ordonnanceur calcule des temps d'exécution et ajuste la charge distribuée aux processeurs. La seconde technique présentée met le CPU et le GPU en compétition : des instances du code cible sont exécutées simultanément sur CPU et GPU. Le vainqueur de la compétition notifie sa complétion à l'autre instance, impliquant son arrêt

    Generating and auto-tuning parallel stencil codes

    Get PDF
    In this thesis, we present a software framework, Patus, which generates high performance stencil codes for different types of hardware platforms, including current multicore CPU and graphics processing unit architectures. The ultimate goals of the framework are productivity, portability (of both the code and performance), and achieving a high performance on the target platform. A stencil computation updates every grid point in a structured grid based on the values of its neighboring points. This class of computations occurs frequently in scientific and general purpose computing (e.g., in partial differential equation solvers or in image processing), justifying the focus on this kind of computation. The proposed key ingredients to achieve the goals of productivity, portability, and performance are domain specific languages (DSLs) and the auto-tuning methodology. The Patus stencil specification DSL allows the programmer to express a stencil computation in a concise way independently of hardware architecture-specific details. Thus, it increases the programmer productivity by disburdening her or him of low level programming model issues and of manually applying hardware platform-specific code optimization techniques. The use of domain specific languages also implies code reusability: once implemented, the same stencil specification can be reused on different hardware platforms, i.e., the specification code is portable across hardware architectures. Constructing the language to be geared towards a special purpose makes it amenable to more aggressive optimizations and therefore to potentially higher performance. Auto-tuning provides performance and performance portability by automated adaptation of implementation-specific parameters to the characteristics of the hardware on which the code will run. By automating the process of parameter tuning — which essentially amounts to solving an integer programming problem in which the objective function is the number representing the code's performance as a function of the parameter configuration, — the system can also be used more productively than if the programmer had to fine-tune the code manually. We show performance results for a variety of stencils, for which Patus was used to generate the corresponding implementations. The selection includes stencils taken from two real-world applications: a simulation of the temperature within the human body during hyperthermia cancer treatment and a seismic application. These examples demonstrate the framework's flexibility and ability to produce high performance code

    Code Generation and Global Optimization Techniques for a Reconfigurable PRAM-NUMA Multicore Architecture

    Full text link
    • …
    corecore