77 research outputs found

    Worst-Case Communication Overhead in a Many-Core based Shared-Memory Model

    Get PDF
    National audienceWith emerging many-core architectures, using on-chip shared memories is an interesting approach because it provides high bandwidth and high throughput data exchange. Such a feature is usually implemented as a multi-bus multi-banked memory. Since predicting timing behavior is key to efficient design and verification of embedded real-time systems, the question that arises is how to evaluate the access time for one memory access of a given task while others may concurrently access the same memory-bank at t the same time. In this paper, we give the answers for a subset of streaming applications modeled like CSDF Model of Computation and implemented in Kalray’s MPPA chip

    Ordonnancement hybride des applications flots de données sur des systèmes embarqués multi-coeurs

    Get PDF
    Les systèmes embarqués sont de plus en plus présents dans l'industrie comme dans la vie quotidienne. Une grande partie de ces systèmes comprend des applications effectuant du traitement intensif des données: elles utilisent de nombreux filtres numériques, où les opérations sur les données sont répétitives et ont un contrôle limité. Les graphes "flots de données", grâce à leur déterminisme fonctionnel inhérent, sont très répandus pour modéliser les systèmes embarqués connus sous le nom de "data-driven". L'ordonnancement statique et périodique des graphes flot de données a été largement étudié, surtout pour deux modèles particuliers: SDF et CSDF. Dans cette thèse, on s'intéresse plus particulièrement à l'ordonnancement périodique des graphes CSDF. Le problème consiste à identifier des séquences périodiques infinies d'actionnement des acteurs qui aboutissent à des exécutions complètes à buffers bornés. L'objectif est de pouvoir aborder ce problème sous des angles différents : maximisation de débit, minimisation de la latence et minimisation de la capacité des buffers. La plupart des travaux existants proposent des solutions pour l'optimisation du débit et négligent le problème d'optimisation de la latence et propose même dans certains cas des ordonnancements qui ont un impact négatif sur elle afin de conserver les propriétés de périodicité. On propose dans cette thèse un ordonnancement hybride, nommé Self-Timed Périodique (STP), qui peut conserver les propriétés d'un ordonnancement périodique et à la fois améliorer considérablement sa performance en terme de latence.One of the most important aspects of parallel computing is its close relation to the underlying hardware and programming models. In this PhD thesis, we take dataflow as the basic model of computation, as it fits the streaming application domain. Cyclo-Static Dataflow (CSDF) is particularly interesting because this variant is one of the most expressive dataflow models while still being analyzable at design time. Describing the system at higher levels of abstraction is not sufficient, e.g. dataflow have no direct means to optimize communication channels generally based on shared buffers. Therefore, we need to link the dataflow MoCs used for performance analysis of the programs, the real time task models used for timing analysis and the low-level model used to derive communication times. This thesis proposes a design flow that meets these challenges, while enabling features such as temporal isolation and taking into account other challenges such as predictability and ease of validation. To this end, we propose a new scheduling policy noted Self-Timed Periodic (STP), which is an execution model combining Self-Timed Scheduling (STS) with periodic scheduling. In STP scheduling, actors are no longer strictly periodic but self-timed assigned to periodic levels: the period of each actor under periodic scheduling is replaced by its worst-case execution time. Then, STP retains some of the performance and flexibility of self-timed schedule, in which execution times of actors need only be estimates, and at the same time makes use of the fact that with a periodic schedule we can derive a tight estimation of the required performance metrics

    Efficient Adaptive Hard Real-time Multi-processor Systems

    Get PDF
    Modern computing systems are based on multi-processor systems, i.e. multiple cores on the same chip. Hard real-time systems are required to perform particular tasks within certain amount of time; failure to do so characterises an unaccepted behavior. Hard real-time systems are found in safety-critical applications, e.g. airbag control software, flight control software, etc. In safety-critical applications, failure to meet the real-time constraints can have catastrophic effects. The safe and, at the same time, efficient deployment of applications, with hard real-time constraints, on multi-processors is a challenging task. Scheduling methods and Models of Computation, that provide safe deployments, require a realistic estimation of the Worst-Case Execution Time (WCET) of tasks. The simultaneous access of shared resources by parallel tasks, causes interference delays due to hardware arbitration. Interference delays can be accounted for, with the pessimistic assumption that all possible interference can happen. The resulting schedules would be exceedingly conservative, thus the benefits of multi-processor would be significantly negated. Producing less pessimistic schedules is challenging due to the inter-dependency between WCET estimation and deployment optimisation. Accurate estimation of interference delays -and thus estimation of task WCET- depends on the way an application is deployed; deployment is an optimisation problem that depends on the estimation of task WCET. Another efficiency gap, which is of consequence in several systems (e.g. airbag control), stems from the fact that rarely tasks execute with their WCET. Safe runtime adaptation based on the Actual Execution Times, can yield additional improvements in terms of latency (more responsive systems). To achieve efficiency and retain adaptability, we propose that interference analysis should be coupled with the deployment process. The proposed interference analysis method estimates the possible amount of interference, based on an architecture and an application model. As more information is provided, such as scheduling, memory mapping, etc, the per-task interference estimation becomes more accurate. Thus, the method computes interference-sensitive WCET estimations (isWCET). Based on the isWCET method, we propose a method to break the inter-dependency between WCET estimation and deployment optimisation. Initially, the isWCETs are over-approximated, by assuming worst-case interference, and a safe deployment is derived. Subsequently, the proposed method computes accurate isWCETs by spatio-temporal exclusion, i.e. excluding interferences from non-overlapping tasks that share resources (space). Based on accurate isWCETs, the deployment solution is improved to provide better latency guarantees. We also propose a distributed runtime adaptation technique, that aims to improve run-time latency. Using isWCET estimations restricts the possible adaptations, as an adaptation might increase the interference and violate the safety guarantees. The proposed technique introduces statically scheduling dependencies between tasks that prevent additional interference. At runtime, a self-timed scheduling policy that respects these dependencies, is applied, proven to be safe, and with minimal overhead. Experimental evaluation on Kalray MPPA-256 shows that our methods improve isWCET up to 36%, guaranteed latency up to 46%, runtime performance up to 42%, with a consolidated performance gain of 50%

    Bridging the Gap between Application and Solid-State-Drives

    Get PDF
    Data storage is one of the important and often critical parts of the computing system in terms of performance, cost, reliability, and energy. Numerous new memory technologies, such as NAND flash, phase change memory (PCM), magnetic RAM (STT-RAM) and Memristor, have emerged recently. Many of them have already entered the production system. Traditional storage optimization and caching algorithms are far from optimal because storage I/Os do not show simple locality. To provide optimal storage we need accurate predictions of I/O behavior. However, the workloads are increasingly dynamic and diverse, making the long and short time I/O prediction challenge. Because of the evolution of the storage technologies and the increasing diversity of workloads, the storage software is becoming more and more complex. For example, Flash Translation Layer (FTL) is added for NAND-flash based Solid State Disks (NAND-SSDs). However, it introduces overhead such as address translation delay and garbage collection costs. There are many recent studies aim to address the overhead. Unfortunately, there is no one-size-fits-all solution due to the variety of workloads. Despite rapidly evolving in storage technologies, the increasing heterogeneity and diversity in machines and workloads coupled with the continued data explosion exacerbate the gap between computing and storage speeds. In this dissertation, we improve the data storage performance from both top-down and bottom-up approach. First, we will investigate exposing the storage level parallelism so that applications can avoid I/O contentions and workloads skew when scheduling the jobs. Second, we will study how architecture aware task scheduling can improve the performance of the application when PCM based NVRAM are equipped. Third, we will develop an I/O correlation aware flash translation layer for NAND-flash based Solid State Disks. Fourth, we will build a DRAM-based correlation aware FTL emulator and study the performance in various filesystems

    PROFILE- AND INSTRUMENTATION- DRIVEN METHODS FOR EMBEDDED SIGNAL PROCESSING

    Get PDF
    Modern embedded systems for digital signal processing (DSP) run increasingly sophisticated applications that require expansive performance resources, while simultaneously requiring better power utilization to prolong battery-life. Achieving such conflicting objectives requires innovative software/hardware design space exploration spanning a wide-array of techniques and technologies that offer trade-offs among performance, cost, power utilization, and overall system design complexity. To save on non-recurring engineering (NRE) costs and in order to meet shorter time-to-market requirements, designers are increasingly using an iterative design cycle and adopting model-based computer-aided design (CAD) tools to facilitate analysis, debugging, profiling, and design optimization. In this dissertation, we present several profile- and instrumentation-based techniques that facilitate design and maintenance of embedded signal processing systems: 1. We propose and develop a novel, translation lookaside buffer (TLB) preloading technique. This technique, called context-aware TLB preloading (CTP), uses a synergistic relationship between the (1) compiler for application specific analysis of a task's context, and (2) operating system (OS), for run-time introspection of the context and efficient identification of TLB entries for current and future usage. CTP works by (1) identifying application hotspots using compiler-enabled (or manual) profiling, and (2) exploiting well-understood memory access patterns, typical in signal processing applications, to preload the TLB at context switch time. The benefits of CTP in eliminating inter-task TLB interference and preemptively allocating TLB entries during context-switch are demonstrated through extensive experimental results with signal processing kernels. 2. We develop an instrumentation-driven approach to facilitate the conversion of legacy systems, not designed as dataflow-based applications, to dataflow semantics by automatically identifying the behavior of the core actors as instances of well-known dataflow models. This enables the application of powerful dataflow-based analysis and optimization methods to systems to which these methods have previously been unavailable. We introduce a generic method for instrumenting dataflow graphs that can be used to profile and analyze actors, and we use this instrumentation facility to instrument legacy designs being converted and then automatically detect the dataflow models of the core functions. We also present an iterative actor partitioning process that can be used to partition complex actors into simpler entities that are more prone to analysis. We demonstrate the utility of our proposed new instrumentation-driven dataflow approach with several DSP-based case studies. 3. We extend the instrumentation technique discussed in (2) to introduce a novel tool for model-based design validation called dataflow validation framework (DVF). DVF addresses the problem of ensuring consistency between (1) dataflow properties that are declared or otherwise assumed as part of dataflow-based application models, and (2) the dataflow behavior that is exhibited by implementations that are derived from the models. The ability of DVF to identify disparities between an application's formal dataflow representation and its implementation is demonstrated through several signal processing application development case studies

    An Efficient NoC-based Framework To Improve Dataflow Thread Management At Runtime

    Get PDF
    This doctoral thesis focuses on how the application threads that are based on dataflow execution model can be managed at Network-on-Chip (NoC) level. The roots of the dataflow execution model date back to the early 1970’s. Applications adhering to such program execution model follow a simple producer-consumer communication scheme for synchronising parallel thread related activities. In dataflow execution environment, a thread can run if and only if all its required inputs are available. Applications running on a large and complex computing environment can significantly benefit from the adoption of dataflow model. In the first part of the thesis, the work is focused on the thread distribution mechanism. It has been shown that how a scalable hash-based thread distribution mechanism can be implemented at the router level with low overheads. To enhance the support further, a tool to monitor the dataflow threads’ status and a simple, functional model is also incorporated into the design. Next, a software defined NoC has been proposed to manage the distribution of dataflow threads by exploiting its reconfigurability. The second part of this work is focused more on NoC microarchitecture level. Traditional 2D-mesh topology is combined with a standard ring, to understand how such hybrid network topology can outperform the traditional topology (such as 2D-mesh). Finally, a mixed-integer linear programming based analytical model has been proposed to verify if the application threads mapped on to the free cores is optimal or not. The proposed mathematical model can be used as a yardstick to verify the solution quality of the newly developed mapping policy. It is not trivial to provide a complete low-level framework for dataflow thread execution for better resource and power management. However, this work could be considered as a primary framework to which improvements could be carried out

    CONTINUOUS OPTIMIZATION OF DISTRIBUTED STREAM PROGRAMS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    High level design and control of adaptive multiprocessor system-on-chips

    Get PDF
    The design of modern embedded systems is getting more and more complex, as more func- tionality is integrated into these systems. At the same time, in order to meet the compu- tational requirements while keeping a low level power consumption, MPSoCs have emerged as the main solutions for such embedded systems. Furthermore, embedded systems are be- coming more and more adaptive, as the adaptivity can bring a number of benefits, such as software flexibility and energy efficiency. This thesis targets the safe design of such adaptive MPSoCs. First, each system configuration must be analyzed concerning its functional and non- functional properties. We present an abstract design and analysis framework, which allows for faster and cost-effective implementation decisions. This framework is intended as an intermediate reasoning support for system level software/hardware co-design environments. It can prune the design space at its largest, and identify candidate design solutions in a fast and efficient way. In the framework, we use an abstract clock-based encoding to model system behaviors. Different mapping and scheduling scenarios of applications on MPSoCs are analyzed via clock traces representing system simulations. Among properties of interest are functional behavioral correctness, temporal performance and energy consumption. Second, the reconfiguration management of adaptive MPSoCs must be addressed. We are specially interested in MPSoCs implemented on reconfigurable hardware architectures (i.e., FPGA fabrics), which provide a good flexibility and computational efficiency for adap- tive MPSoCs. We propose a general design framework based on the discrete controller syn- thesis (DCS) technique to address this issue. The main advantage of this technique is that it allows the automatic controller synthesis w.r.t. a given specification of control objectives. In the framework, the system reconfiguration behavior is modeled in terms of synchronous parallel automata. The reconfiguration management computation problem w.r.t. multiple objectives regarding e.g., resource usages, performance and power consumption is encoded as a DCS problem. The existing BZR programming language and Sigali tool are employed to perform DCS and generate a controller that satisfies the system requirements. Finally, we investigate two different ways of combining the two proposed design frame- works for adaptive MPSoCs. Firstly, they are combined to construct a complete design flow for adaptive MPSoCs. Secondly, they are combined to present how the designed run-time manager by the second framework can be integrated into the first framework so that high level simulations can be performed to assess the run-time manager.La conception de systèmes embarqués modernes est de plus en plus complexe, car plus de fonctionnalités sont intégrées dans ces systèmes. En même temps, afin de répondre aux exigences de calcul tout en conservant une consommation d'énergie de faible niveau, MPSoCs sont apparus comme les principales solutions pour tels systèmes embarqués. En outre, les systèmes embarqués sont de plus en plus adaptatifs, comme l’adaptabilité peut apporter un certain nombre d'avantages, tels que la flexibilité du logiciel et l'efficacité énergétique. Cette thèse vise la conception sécuritaire de ces MPSoCs adaptatifs. Tout d'abord, chaque configuration de système doit être analysée en ce qui concerne ses propriétés fonctionnelles et non fonctionnelles. Nous présentons un cadre abstraite de conception et d’analyse qui permet des décisions d’implémentation plus rapide et plus rentable. Ce cadre est conçu comme un support de raisonnement intermédiaire pour les environnements de co-conception de logiciel / matériel au niveau de système. Il peut élaguer l'espace de conception à sa plus grande portée, et identifier les candidats de solutions de conception de manière rapide et efficace. Dans ce cadre, nous utilisons un codage basé sur l’horloge abstrait pour modéliser les comportements du système. Différents scénarios d'applications de mapping et de planification sur MPSoCs sont analysés via les traces d'horloge qui représentent les simulations du système. Les propriétés d'intérêt sont l’exactitude du comportement fonctionnel, la performance temporelle et la consommation d'énergie. Deuxièmement, la gestion de la reconfiguration de MPSoCs adaptatifs doit être abordée. Nous sommes particulièrement intéressés par les MPSoCs implémentés sur des architectures reconfigurables de hardware (ex. FPGA tissus) qui offrent une bonne flexibilité et une efficacité de calcul pour les MPSoCs adaptatifs. Nous proposons un cadre général de conception basésur la technique de la synthèse de contrôleurs discrets (SCD) pour résoudre ce problème. L’avantage principal de cette technique est qu'elle permet une synthèse d'un contrôleur automatique vis-à-vis d’une spécification donnée des objectifs de contrôle. Dans ce cadre, le comportement de reconfiguration du système est modélisé en termes d'automates synchrones en parallèle. Le problème de calcul de la gestion reconfiguration vis-à-vis de multiples objectifs concernant, par exemple, les usages des ressources, la performance et la consommation d’énergie est codé comme un problème de SCD . Le langage de programmation BZR existant et l’outil Sigali sont employés pour effectuer SCD et générer un contrôleur qui satisfait aux exigences du système. Finalement, nous étudions deux façons différentes de combiner les deux cadres de conception proposées pour MPSoCs adaptatifs. Tout d'abord, ils sont combinés pour construire un flot de conception complet pour MPSoCs adaptatifs. Deuxièmement, ils sont combinés pour présenter la façon dont le gestionnaire d'exécution conçu dans le second cadre peut être intégré dans le premier cadre de sorte que les simulations de haut niveau peuvent être effectuées pour évaluer le gestionnaire d'exécution
    corecore