Search CORE

14 research outputs found

A Formally Verified Compiler for Lustre

Author: Bourke Timothy
Brun Lélio
Dagand Pierre-Evariste
Leroy Xavier
Pouzet Marc
Rieg Lionel
Publication venue: HAL CCSD
Publication date: 18/06/2017
Field of study

International audienceThe correct compilation of block diagram languages like Lustre, Scade, and a discrete subset of Simulink is important since they are used to program critical embedded control software. We describe the specification and verification in an Interactive Theorem Prover of a compilation chain that treats the key aspects of Lustre: sampling, nodes, and delays. Building on CompCert, we show that repeated execution of the generated assembly code faithfully implements the dataflow semantics of source programs.We resolve two key technical challenges. The first is the change from a synchronous dataflow semantics, where programs manipulate streams of values, to an imperative one, where computations manipulate memory sequentially. The second is the verified compilation of an imperative language with encapsulated state to C code where the state is realized by nested records. We also treat a standard control optimization that eliminates unnecessary conditional statements

INRIA a CCSD electronic archive server

Improving Model-Based Software Synthesis: A Focus on Mathematical Structures

Author: Goens Jokisch Andres Wilhelm
Publication venue
Publication date: 14/05/2021
Field of study

Computer hardware keeps increasing in complexity. Software design needs to keep up with this. The right models and abstractions empower developers to leverage the novelties of modern hardware. This thesis deals primarily with Models of Computation, as a basis for software design, in a family of methods called software synthesis. We focus on Kahn Process Networks and dataﬂow applications as abstractions, both for programming and for deriving an eﬃcient execution on heterogeneous multicores. The latter we accomplish by exploring the design space of possible mappings of computation and data to hardware resources. Mapping algorithms are not at the center of this thesis, however. Instead, we examine the mathematical structure of the mapping space, leveraging its inherent symmetries or geometric properties to improve mapping methods in general. This thesis thoroughly explores the process of model-based design, aiming to go beyond the more established software synthesis on dataﬂow applications. We starting with the problem of assessing these methods through benchmarking, and go on to formally examine the general goals of benchmarks. In this context, we also consider the role modern machine learning methods play in benchmarking. We explore different established semantics, stretching the limits of Kahn Process Networks. We also discuss novel models, like Reactors, which are designed to be a deterministic, adaptive model with time as a ﬁrst-class citizen. By investigating abstractions and transformations in the Ohua language for implicit dataﬂow programming, we also focus on programmability. The focus of the thesis is in the models and methods, but we evaluate them in diverse use-cases, generally centered around Cyber-Physical Systems. These include the 5G telecommunication standard, automotive and signal processing domains. We even go beyond embedded systems and discuss use-cases in GPU programming and microservice-based architectures

Technische Universität Dresden: Qucosa

Design Space Exploration and Resource Management of Multi/Many-Core Systems

Author
Publication venue: 'MDPI AG'
Publication date: 11/01/2022
Field of study

The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends

Directory of Open Access Books (DOAB)

Design and Code Optimization for Systems with Next-generation Racetrack Memories

Author: Khan Asif Ali
Publication venue
Publication date: 16/06/2022
Field of study

With the rise of computationally expensive application domains such as machine learning, genomics, and fluids simulation, the quest for performance and energy-efficient computing has gained unprecedented momentum. The significant increase in computing and memory devices in modern systems has resulted in an unsustainable surge in energy consumption, a substantial portion of which is attributed to the memory system. The scaling of conventional memory technologies and their suitability for the next-generation system is also questionable. This has led to the emergence and rise of nonvolatile memory ( NVM ) technologies. Today, in different development stages, several NVM technologies are competing for their rapid access to the market. Racetrack memory ( RTM ) is one such nonvolatile memory technology that promises SRAM -comparable latency, reduced energy consumption, and unprecedented density compared to other technologies. However, racetrack memory ( RTM ) is sequential in nature, i.e., data in an RTM cell needs to be shifted to an access port before it can be accessed. These shift operations incur performance and energy penalties. An ideal RTM , requiring at most one shift per access, can easily outperform SRAM . However, in the worst-cast shifting scenario, RTM can be an order of magnitude slower than SRAM . This thesis presents an overview of the RTM device physics, its evolution, strengths and challenges, and its application in the memory subsystem. We develop tools that allow the programmability and modeling of RTM -based systems. For shifts minimization, we propose a set of techniques including optimal, near-optimal, and evolutionary algorithms for efficient scalar and instruction placement in RTMs . For array accesses, we explore schedule and layout transformations that eliminate the longer overhead shifts in RTMs . We present an automatic compilation framework that analyzes static control flow programs and transforms the loop traversal order and memory layout to maximize accesses to consecutive RTM locations and minimize shifts. We develop a simulation framework called RTSim that models various RTM parameters and enables accurate architectural level simulation. Finally, to demonstrate the RTM potential in non-Von-Neumann in-memory computing paradigms, we exploit its device attributes to implement logic and arithmetic operations. As a concrete use-case, we implement an entire hyperdimensional computing framework in RTM to accelerate the language recognition problem. Our evaluation shows considerable performance and energy improvements compared to conventional Von-Neumann models and state-of-the-art accelerators

Technische Universität Dresden: Qucosa

Parallel Processes in HPX: Designing an Infrastructure for Adaptive Resource Management

Author: Amatya Vinay Chandra
Publication venue: LSU Digital Commons
Publication date: 01/01/2014
Field of study

Advancement in cutting edge technologies have enabled better energy efficiency as well as scaling computational power for the latest High Performance Computing(HPC) systems. However, complexity, due to hybrid architectures as well as emerging classes of applications, have shown poor computational scalability using conventional execution models. Thus alternative means of computation, that addresses the bottlenecks in computation, is warranted. More precisely, dynamic adaptive resource management feature, both from systems as well as application\u27s perspective, is essential for better computational scalability and efficiency. This research presents and expands the notion of Parallel Processes as a placeholder for procedure definitions, targeted at one or more synchronous domains, meta data for computation and resource management as well as infrastructure for dynamic policy deployment. In addition to this, the research presents additional guidelines for a framework for resource management in HPX runtime system. Further, this research also lists design principles for scalability of Active Global Address Space (AGAS), a necessary feature for Parallel Processes. Also, to verify the usefulness of Parallel Processes, a preliminary performance evaluation of different task scheduling policies is carried out using two different applications. The applications used are: Unbalanced Tree Search, a reference dynamic graph application, implemented by this research in HPX and MiniGhost, a reference stencil based application using bulk synchronous parallel model. The results show that different scheduling policies provide better performance for different classes of applications; and for the same application class, in certain instances, one policy fared better than the others, while vice versa in other instances, hence supporting the hypothesis of the need of dynamic adaptive resource management infrastructure, for deploying different policies and task granularities, for scalable distributed computing

Louisiana State University

Instruction-set architecture synthesis for VLIW processors

Author: Jordans R.
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/2015
Field of study

Repository TU/e

Pure OAI Repository

3rd Many-core Applications Research Community (MARC) Symposium. (KIT Scientific Reports ; 7598)

Author: Becker Jürgen
Göhringer Diana
Hübner Michael
Publication venue: KIT Scientific Publishing, Karlsruhe
Publication date: 01/01/2011
Field of study

This manuscript includes recent scientific work regarding the Intel Single Chip Cloud computer and describes approaches for novel approaches for programming and run-time organization

KITopen

Interaction-aware analysis and optimization of real-time application and operating system

Author: Dietrich Christian
Publication venue: Hannover : Institutionelles Repositorium der Leibniz Universität Hannover
Publication date: 01/01/2019
Field of study

Mechanical and electronic automation was a key component of the technological advances in the last two hundred years. With the use of special-purpose machines, manual labor was replaced by mechanical motion, leaving workers with the operation of these machines, before also this task was conquered by embedded control systems. With the advances of general-purpose computing, the development of these control systems shifted more and more from a problem-specific one to a one-size-fits-all mentality as the trade-off between per-instance overheads and development costs was in favor of flexible and reusable implementations. However, with a scaling factor of thousands, if not millions, of deployed devices, overheads and inefficiencies accumulate; calling for a higher degree of specialization. For the area real-time operating systems (RTOSs), which form the base layer for many of these computerized control systems, we deploy way more flexibility than what is actually required for the applications that run on top of it. Since only the solution, but not the problem, became less specific to the control problem at hand, we have the chance to cut away inefficiencies, improve on system-analyses results, and optimize the resource consumption. However, such a tailoring will only be favorable if it can be performed without much developer interaction and in an automated fashion. Here, real-time systems are a good starting point, since we already have to have a large degree of static knowledge in order to guarantee their timeliness. Until now, this static nature is not exploited to its full extent and optimization potentials are left unused. The requirements of a system, with regard to the RTOS, manifest in the interactions between the application and the kernel. Threads request resources from the RTOS, which in return determines and enforces a scheduling order that will ensure the timely completion of all necessary computations. Since the RTOS runs only in the exception, its reaction to requests from the application (or from the environment) is its defining feature. In this thesis, I will grasp these interactions, and thereby the required RTOS semantic, in a control-flow-sensitive fashion. Extracted automatically, this knowledge about the reciprocal influence allows me to fit the implementation of a system closer to its actual requirements. The result is a system that is not only in its usage a special-purpose system, but also in its implementation and in its provided guarantees. In the development of my approach, it became clear that the focus on these interactions is not only highly fruitful for the optimization of a system, but also for its end-to-end analysis. Therefore, this thesis does not only provide methods to reduce the kernel-execution overhead and a system's memory consumption, but it also includes methods to calculate tighter response-time bounds and to give guarantees about the correct behavior of the kernel. All these contributions are enabled by my proposed interaction-aware methodology that takes the whole system, RTOS and application, into account. With this thesis, I show that a control-flow-sensitive whole-system view on the interactions is feasible and highly rewarding. With this approach, we can overcome many inefficiencies that arise from analyses that have an isolating focus on individual system components. Furthermore, the interaction-aware methods keep close to the actual implementation, and therefore are able to consider the behavioral patterns of the finally deployed real-time computing system

Institutionelles Repositorium der Leibniz Universität Hannover

Aspects of Code Generation and Data Transfer Techniques for Modern Parallel Architectures

Author: Mohr Manuel
Publication venue: KIT-Bibliothek, Karlsruhe
Publication date: 01/01/2018
Field of study

Im Bereich der Prozessorarchitekturen hat sich der Fokus neuer Entwicklungen von immer höheren Taktfrequenzen hin zu immer mehr Kernen auf einem Chip verschoben. Eine hohe Kernanzahl ermöglicht es unterschiedlich leistungsfähige Kerne anzubieten, und sogar dedizierte Kerne mit speziellen Befehlssätzen. Die Entwicklung für solch heterogene Plattformen ist herausfordernd und benötigt entsprechende Unterstützung von Entwicklungswerkzeugen, wie beispielsweise Übersetzern. Neben ihrer heterogenen Kernstruktur gibt es eine zweite Dimension, die die Entwicklung für solche Architekturen anspruchsvoll macht: ihre Speicherstruktur. Die Aufrechterhaltung von globaler Cache-Kohärenz erschwert das Erreichen hoher Kernzahlen. Hardwarebasierte Cache-Kohärenz-Protokolle skalieren entweder schlecht, oder sind kompliziert und führen zu Problemen bei Ausführungszeit und Energieeffizienz. Eine radikale Lösung dieses Problems stellt die Abschaffung der globalen Cache-Kohärenz dar. Jedoch ist es schwierig, bestehende Programmiermodelle effizient auf solch eine Hardware-Architektur mit schwachen Garantien abzubilden. Der erste Teil dieser Dissertation beschäftigt sich Datentransfertechniken für nicht-cache-kohärente Architekturen mit gemeinsamem Speicher. Diese Architekturen bieten einen gemeinsamen physikalischen Adressraum, implementieren aber keine hardwarebasierte Kohärenz zwischen allen Caches des Systems. Die logische Partitionierung des gemeinsamen Speichers ermöglicht die sichere Programmierung einer solchen Plattform. Im Allgemeinen erzeugt dies die Notwendigkeit Daten zwischen Speicherpartitionen zu kopieren. Wir untersuchen die Übersetzung für invasive Architekturen, einer Familie von nicht-cache-kohärenten Vielkernarchitekturen. Wir betrachten die effiziente Implementierung von Datentransfers sowohl einfacher als auch komplexer Datenstrukturen auf invasiven Architekturen. Insbesondere schlagen wir eine neuartige Technik zum Kopieren komplexer verzeigerter Datenstrukturen vor, die ohne Serialisierung auskommt. Hierzu verallgemeinern wir den Objekt-Klon-Ansatz mit übersetzergesteuerter automatischer software-basierter Kohärenz, sodass er auch im Kontext nicht-kohärenter Caches funktioniert. Wir präsentieren Implementierungen mehrerer Datentransfertechniken im Rahmen eines existierenden Übersetzers und seines Laufzeitsystems. Wir führen eine ausführliche Auswertung dieser Implementierungen auf einem FPGA-basierten Prototypen einer invasiven Architektur durch. Schließlich schlagen wir vor, Hardwareunterstützung für bereichsbasierte Cache-Operationen hinzuzufügen und beschreiben und bewerten mögliche Implementierungen und deren Kosten. Der zweite Teil dieser Dissertation befasst sich mit der Beschleunigung von Shuffle-Code, der bei der Registerzuteilung auftritt, durch die Verwendung von Permutationsbefehlen. Die Aufgabe der Registerzuteilung während der Programmübersetzung ist die Abbildung von Programmvariablen auf Maschinenregister. Während der Registerzuteilung erzeugt der Übersetzer Shuffle-Code, der aus Kopier- und Tauschbefehlen besteht, um Werte zwischen Registern zu transferieren. Abhängig von der Qualität der Registerzuteilung und der Zahl der verfügbaren Register kann eine große Menge an Shuffle-Code erzeugt werden. Wir schlagen vor, die Ausführung von Shuffle-Code mit Hilfe von neuartigen Permutationsbefehlen zu beschleunigen, die die Inhalte von einigen Registern in einem Taktzyklus beliebig permutieren. Um die Machbarkeit dieser Idee zu demonstrieren, erweitern wir zunächst ein bestehendes RISC-Befehlsformat um Permutationsbefehle. Anschließend beschreiben wir, wie die vorgeschlagenen Permutationsbefehle in einer bestehenden RISC-Architektur implementiert werden können. Dann entwickeln wir zwei Verfahren zur Codeerzeugung, die die Permutationsbefehle ausnutzen, um Shuffle-Code zu beschleunigen: eine schnelle Heuristik und einen auf dynamischer Programmierung basierenden optimalen Ansatz. Wir beweisen Qualitäts- und Korrektheitseingeschaften beider Ansätze und zeigen die Optimalität des zweiten Ansatzes. Im Folgenden implementieren wir beide Codeerzeugungsverfahren in einem Übersetzer und untersuchen sowie vergleichen deren Codequalität ausführlich mit Hilfe standardisierter Benchmarks. Zunächst messen wir die genaue Zahl der dynamisch ausgeführten Befehle, welche wir folgend validieren, indem wir Programmlaufzeiten auf einer FPGA-basierten Prototypimplementierung der um Permutationsbefehle erweiterten RISC-Architektur messen. Schließlich argumentieren wir, dass Permutationsbefehle auf modernen Out-Of-Order-Prozessorarchitekturen, die bereits Registerumbenennung unterstützen, mit wenig Aufwand implementierbar sind

KITopen

COMPILATION D'APPLICATIONS FLOT DE DONNÉES PARAMÉTRIQUES POUR MPSOC DÉDIÉS À LA RADIO LOGICIELLE

Author: Dardaillon Mickaël
Publication venue: HAL CCSD
Publication date: 19/11/2014
Field of study

The emergence of software-defined radio follows the rapidly evolving telecommunicationdomain. The requirements in both performance and dynamicity has engendered softwaredefined-radio-dedicated MPSoCs. Specialization of these MPSoCs make them difficult toprogram and verify. Dataflow models of computation have been suggested as a way to mitigatethis complexity. Moreover, the need for flexible yet verifiable models has led to thedevelopment of new parametric dataflow models.In this thesis, I study the compilation of parametric dataflow applications targetingsoftware-defined-radio platforms. After a hardware and software state of the art in this field, Ipropose a new refinement of dataflow scheduling, and outline its application to buffer size’sverification. Then, I introduce a new high-level format to define dataflow actors and graph,with the associated compilation flow. I apply these concepts to optimised code generation fortheMagali software-defined-radio platform. Compilation of parts of the LTE protocol are usedto evaluate the performances of the proposed compilation flow.Le développement de la radio logicielle fait suite à l’évolution rapide du domaine destélécommunications. Les besoins en performance et en dynamicité ont donné naissance à desMPSoC dédiés à la radio logicielle. La spécialisation de cesMPSoC rend cependant leur programmationet leur vérification complexes. Des travaux proposent d’atténuer cette complexitépar l’utilisation de paradigmes tels que lemodèle de calcul flot de données. Parallèlement, lebesoin demodèles flexibles et vérifiables a mené au développement de nouveaux modèlesflot de données paramétriques.Dans cette thèse, j’étudie la compilation d’applications utilisant un modèle de calcul flotde données paramétrique et ciblant des plateformes de radio logicielle. Après un état de l’artdu matériel et logiciel du domaine, je propose un raffinement de l’ordonnancement flot dedonnées, et présente son application à la vérification des taillesmémoires. Ensuite, j’introduisun nouveau format de haut niveau pour définir le graphe et les acteurs flot de données, ainsique le flot de compilation associé. J’applique ces concepts à la génération de code optimisépour la plateforme de radio logicielle Magali. La compilation de parties du protocole LTEpermet d’évaluer les performances du flot de compilation proposé

Thèses en Ligne

INRIA a CCSD electronic archive server

Hal-Diderot