412 research outputs found

    Dynamic Tile Free Scheduling for Code with Acyclic Inter-Tile Dependence Graphs

    Get PDF
    Free scheduling is a task ordering technique under which instructions are executedas soon as their operands become available. Coarsening the grain ofcomputations under the free schedule, by means of using groups of loop neststatement instances (tiles) in place of single statement instances, increases thelocality of data accesses and reduces the number of synchronization events, andas a consequence improves program performance. The paper presents an approachfor code generation allowing for the free schedule for tiles of arbitrarilynested affine loops at run-time. The scope of the applicability of the introducedalgorithms is limited to tiled loop nests whose inter-tile dependence graphs arecycle-free. The approach is based on the Polyhedral Model. Results of experimentswith the PolyBench benchmark suite, demonstrating significant tiledcode speed-up, are discussed

    Mapping and Scheduling of Directed Acyclic Graphs on An FPFA Tile

    Get PDF
    An architecture for a hand-held multimedia device requires components that are energy-efficient, flexible, and provide high performance. In the CHAMELEON [4] project we develop a coarse grained reconfigurable device for DSP-like algorithms, the so-called Field Programmable Function Array (FPFA). The FPFA devices are reminiscent to FPGAs, but with a matrix of Processing Parts (PP) instead of CLBs. The design of the FPFA focuses on: (1) Keeping each PP small to maximize the number of PPs that can fit on a chip; (2) providing sufficient flexibility; (3) Low energy consumption; (4) Exploiting the maximum amount of parallelism; (5) A strong support tool for FPFA-based applications. The challenge in providing compiler support for the FPFA-based design stems from the flexibility of the FPFA structure. If we do not use the characteristics of the FPFA structure properly, the advantages of an FPFA may become its disadvantages. The GECKO1project focuses on this problem. In this paper, we present a mapping and scheduling scheme for applications running on one FPFA tile. Applications are written in C and C code is translated to a Directed Acyclic Graphs (DAG) [4]. This scheme can map a DAG directly onto the reconfigurable PPs of an FPFA tile. It tries to achieve low power consumption by exploiting locality of reference and high performance by exploiting maximum parallelism

    Analyzable dataflow executions with adaptive redundancy

    Get PDF
    Increasing performance requirements in the embedded systems domain have encouraged a drift from singlecore to multicore processors, and thus multicore processors are widely used in embedded systems today. Cars are an example for complex embedded systems in which the use of multicore processors is continuously increasing. A major reason for this is to consolidate different software components on one chip and thus reduce the number of electronic control units. However, the de facto standard in the automotive industry, AUTOSAR (AUTomotive Open System ARchitecture), was originally designed for singlecore processors. Although basic support for multicore processors was added, more complex architectures are currently not compatible with the software stack. Regarding the software components running on the ECUS of modern cars, requirements are diverse. On the one hand, there are safety-critical tasks, like the airbag control, anti-lock braking system, electronic stability control and emergency brake assist, and on the other hand, tasks which do not have any safety-related requirements at all, for example tasks controlling the infotainment system. Trends like autonomous driving lead to even more demanding tasks in the system since such tasks are both safety-critical and data-intensive. As embedded applications, like those in the automotive domain, become more complex, new approaches are necessary. Data-intensive tasks are usually tackled with large-scale computing frameworks. In this thesis, some major concepts of such frameworks are transferred to the high-performance embedded systems domain. For this purpose, the thesis describes a runtime environment (RTE) that is suitable for different kinds of multi- and manycore hardware architectures. The RTE follows a dataflow execution model based on directed acyclic graphs (DAGs). Graphs are divided into sections which are scheduled separately. For each section, the RTE uses a DAG scheduling heuristic to compute multiple schedules covering different redundancy configurations. This allows the RTE to dynamically change the redundancy of parts of the graph at runtime despite the use of fixed schedules. Alternatively, the RTE also provides an online scheduler. To specify suitable graphs, the RTE also provides a programming model which shares similarities with common large-scale computing frameworks, for example Apache Spark. Using this programming model, three common distributed algorithms, namely Cannon's algorithm, the Cooley-Tukey algorithm and bitonic sort, were implemented. With these three programs, the performance of the RTE was evaluated for a variety of configurations on two different hardware architectures. The results show that the proposed RTE is able to reach the performance of established parallel computation frameworks and that for suitable graphs with reasonable sectionings the negative influence on the runtime is either small or non-existent.Aufgrund steigender Anforderungen an die Leistungsfähigkeit von eingebetteten Systemen finden Mehrkernprozessoren mittlerweile auch in eingebetteten Systemen Verwendung. Autos sind ein Beispiel für eingebettete Systeme, in denen die Verbreitung von Mehrkernprozessoren kontinuierlich zunimmt. Ein Hauptgrund ist, dass es dadurch möglich wird, mehrere Applikationen, für die ursprünglich mehrere Electronic Control Units (ECUs) notwendig waren, auf ein und demselben Chip auszuführen und dadurch die Anzahl der ECUs im Gesamtsystem zu verringern. Der De-facto-Standard AUTOSAR (AUTomotive Open System ARchitecture) wurde jedoch ursprünglich nur im Hinblick auf Einkernprozessoren entworfen und, obwohl der Softwarestack um grundlegende Unterstützung für Mehrkernprozessoren erweitert wurde, sind komplexere Architekturen nicht damit kompatibel. Die Anforderungen der Softwarekomponenten von modernen Autos sind vielfältig. Einerseits gibt es hochgradig sicherheitskritische Tasks, die beispielsweise die Airbags, das Antiblockiersystem, die Fahrdynamikregelung oder den Notbremsassistenten steuern und andererseits Tasks, die keinerlei sicherheitskritische Anforderungen aufweisen, wie zum Beispiel Tasks zur Steuerung des Infotainment-Systems. Neue Trends wie autonomes Fahren führen zu weiteren anspruchsvollen Tasks, die sowohl hohe Leistungs- als auch Sicherheitsanforderungen aufweisen. Da die Komplexität eingebetteter Anwendungen, beispielsweise im Automobilbereich, stetig zunimmt, sind neue Ansätze erforderlich. Für komplexe, datenintensive Aufgaben werden in der Regel Cluster-Computing-Frameworks eingesetzt. In dieser Arbeit werden Konzepte solcher Frameworks auf den Bereich der eingebetteten Systeme übertragen. Dazu beschreibt die Arbeit eine Laufzeitumgebung (RTE) für eingebettete Mehrkernarchitekturen. Die RTE folgt einem Datenfluss-Ausführungsmodell, das auf gerichteten azyklischen Graphen basiert. Graphen können in Abschnitte eingeteilt werden, für welche separat mehrere unterschiedlich redundante Schedules mit Hilfe einer Scheduling-Heuristik berechnet werden. Dieser Ansatz erlaubt es, die Redundanz von Teilen der Anwendung zur Laufzeit zu verändern. Alternativ unterstützt die RTE auch Scheduling zur Laufzeit. Zur Erzeugung von Graphen stellt die RTE ein Programmiermodell bereit, welches sich an etablierten Frameworks, insbesondere Apache Spark, orientiert. Damit wurden drei Beispielanwendungen implementiert, die auf gängigen Algorithmen basieren. Konkret handelt es sich um Cannon's Algorithmus, den Cooley-Tukey-Algorithmus und bitonisches Sortieren. Um die Leistungsfähigkeit der RTE zu ermitteln, wurden diese drei Anwendungen mehrfach mit verschiedenen Konfigurationen auf zwei Hardware-Architekturen ausgeführt. Die Ergebnisse zeigen, dass die RTE in ihrer Leistungsfähigkeit mit etablierten Systemen vergleichbar ist und die Laufzeit bei einer sinnvollen Graphaufteilung im besten Fall nur geringfügig beeinflusst wird

    Predictable multi-processor system on chip design for multimedia applications

    Get PDF
    The design of multimedia systems has become increasingly complex due to consumer requirements. Consumers demand the functionalities offered by a huge desktop from these systems. Many of these systems are mobile. Therefore, power consumption and size of these devices should be small. These systems are increasingly becoming multi-processor based (MPSoCs) for the reasons of power and performance. Applications execute on these systems in different combinations also known as use-cases. Applications may have different performance requirements in each use-case. Currently, verification of all these use-cases takes bulk of the design effort. There is a need for analysis based techniques so that the platforms have a predictable behaviour and in turn provide guarantees on performance without expending precious man hours on verification. In this dissertation, techniques and architectures have been developed to design and manage these multi-processor based systems efficiently. The dissertation presents predictable architectural components for MPSoCs, a Predictable MPSoC design strategy, automatic platform synthesis tool, a run-time system and an MPSoC simulation technique. The introduction of predictability helps in rapid design of MPSoC platforms. Chapter 1 of the thesis studies the trends in modern multimedia applications and processor architectures. The chapter further highlights the problems in the design of MPSoC platforms and emphasizes the need of predictable design techniques. Predictable design techniques require predictable application and architectural components. The chapter further elaborates on Synchronous Data Flow Graphs which are used to model the applications throughout this thesis. The chapter presents the architecture template used in this thesis and enlists the contributions of the thesis. One of the contributions of this thesis is the design of a predictable component called communication assist. Chapter 2 of the thesis describes the architecture of this communication assist. The communication assist presented in this thesis not only decouples the communication from computation but also provides timing guarantees. Based on this communication assist, an MPSoC platform generation technique has been presented that can design MPSoC platforms capable of satisfying the throughput constraints of multiple applications in all use-cases. The technique is presented in Chapter 3. The design strategy uses three simple steps for platform design. In the first step it finds the required number of processors. The second step minimizes the communication interconnect between the processors and the third step minimizes the communication memory requirement of the platform. Further in Chapter 4, a tool has been developed to generate CA-based platforms for FPGAs. The output of this tool can be used to synthesize platforms on real hardware with the help of FPGA synthesis tools. The applications executing on these platforms often exhibit dynamism e.g. variation in task execution times and change in application throughput requirements. Further, new applications may often be added by consumers at run-time. Resource managers have been presented in literature to handle such dynamic situations. However, the scalability of these resource managers becomes an issue with the increase in number of processors and applications. Chapter 5 presents distributed run-time resource management techniques. Two versions of distributed resource managers have been presented which are scalable with the number of applications and processors. MPSoC platforms for real-time applications are designed assuming worst-case task execution times. It is known that the difference between average-case and worst-case behaviour can be quite large. Therefore, knowing the average case performance is also important for the system designer, and software simulation is often employed to estimate this. However, simulation in software is slow and does not scale with the number of applications and processing elements. In Chapter 6, a fast and scalable simulation methodology is introduced that can simulate the execution of multiple applications on an MPSoC platform. It is based on parallel execution of SDF (Synchronous Data Flow) models of applications. The simulation methodology uses Parallel Discrete Event Simulation (PDES) primitives and it is termed as "Smart Conservative PDES". The methodology generates a parallel simulator which is synthesizable on FPGAs. The framework can also be used to model dynamic arbitration policies which are difficult to analyse using models. The generated platform is also useful in carrying out Design Space Exploration as shown in the thesis. Finally, Chapter 7 summarizes the main findings and (practical) implications of the studies described in previous chapters of this dissertation. Using the contributions mentioned in the thesis, a designer can design and implement predictable multiprocessor based systems capable of satisfying throughput constraints of multiple applications in given set of use-cases, and employ resource management strategies to deal with dynamism in the applications. The chapter also describes the main limitations of this dissertation and makes suggestions for future research

    Decoupling algorithms from schedules for easy optimization of image processing pipelines

    Get PDF
    Using existing programming tools, writing high-performance image processing code requires sacrificing readability, portability, and modularity. We argue that this is a consequence of conflating what computations define the algorithm, with decisions about storage and the order of computation. We refer to these latter two concerns as the schedule, including choices of tiling, fusion, recomputation vs. storage, vectorization, and parallelism. We propose a representation for feed-forward imaging pipelines that separates the algorithm from its schedule, enabling high-performance without sacrificing code clarity. This decoupling simplifies the algorithm specification: images and intermediate buffers become functions over an infinite integer domain, with no explicit storage or boundary conditions. Imaging pipelines are compositions of functions. Programmers separately specify scheduling strategies for the various functions composing the algorithm, which allows them to efficiently explore different optimizations without changing the algorithmic code. We demonstrate the power of this representation by expressing a range of recent image processing applications in an embedded domain specific language called Halide, and compiling them for ARM, x86, and GPUs. Our compiler targets SIMD units, multiple cores, and complex memory hierarchies. We demonstrate that it can handle algorithms such as a camera raw pipeline, the bilateral grid, fast local Laplacian filtering, and image segmentation. The algorithms expressed in our language are both shorter and faster than state-of-the-art implementations.National Science Foundation (U.S.) (Grant 0964004)National Science Foundation (U.S.) (Grant 0964218)National Science Foundation (U.S.) (Grant 0832997)United States. Dept. of Energy (Award DE-SC0005288)Cognex CorporationAdobe System

    MULTI-SCALE SCHEDULING TECHNIQUES FOR SIGNAL PROCESSING SYSTEMS

    Get PDF
    A variety of hardware platforms for signal processing has emerged, from distributed systems such as Wireless Sensor Networks (WSNs) to parallel systems such as Multicore Programmable Digital Signal Processors (PDSPs), Multicore General Purpose Processors (GPPs), and Graphics Processing Units (GPUs) to heterogeneous combinations of parallel and distributed devices. When a signal processing application is implemented on one of those platforms, the performance critically depends on the scheduling techniques, which in general allocate computation and communication resources for competing processing tasks in the application to optimize performance metrics such as power consumption, throughput, latency, and accuracy. Signal processing systems implemented on such platforms typically involve multiple levels of processing and communication hierarchy, such as network-level, chip-level, and processor-level in a structural context, and application-level, subsystem-level, component-level, and operation- or instruction-level in a behavioral context. In this thesis, we target scheduling issues that carefully address and integrate scheduling considerations at different levels of these structural and behavioral hierarchies. The core contributions of the thesis include the following. Considering both the network-level and chip-level, we have proposed an adaptive scheduling algorithm for wireless sensor networks (WSNs) designed for event detection. Our algorithm exploits discrepancies among the detection accuracy of individual sensors, which are derived from a collaborative training process, to allow each sensor to operate in a more energy efficient manner while the network satisfies given constraints on overall detection accuracy. Considering the chip-level and processor-level, we incorporated both temperature and process variations to develop new scheduling methods for throughput maximization on multicore processors. In particular, we studied how to process a large number of threads with high speed and without violating a given maximum temperature constraint. We targeted our methods to multicore processors in which the cores may operate at different frequencies and different levels of leakage. We develop speed selection and thread assignment schedulers based on the notion of a core's steady state temperature. Considering the application-level, component-level and operation-level, we developed a new dataflow based design flow within the targeted dataflow interchange format (TDIF) design tool. Our new multiprocessor system-on-chip (MPSoC)-oriented design flow, called TDIF-PPG, is geared towards analysis and mapping of embedded DSP applications on MPSoCs. An important feature of TDIF-PPG is its capability to integrate graph level parallelism and actor level parallelism into the application mapping process. Here, graph level parallelism is exposed by the dataflow graph application representation in TDIF, and actor level parallelism is modeled by a novel model for multiprocessor dataflow graph implementation that we call the Parallel Processing Group (PPG) model. Building on the contribution above, we formulated a new type of parallel task scheduling problem called Parallel Actor Scheduling (PAS) for chip-level MPSoC mapping of DSP systems that are represented as synchronous dataflow (SDF) graphs. In contrast to traditional SDF-based scheduling techniques, which focus on exploiting graph level (inter-actor) parallelism, the PAS problem targets the integrated exploitation of both intra- and inter-actor parallelism for platforms in which individual actors can be parallelized across multiple processing units. We address a special case of the PAS problem in which all of the actors in the DSP application or subsystem being optimized can be parallelized. For this special case, we develop and experimentally evaluate a two-phase scheduling framework with three work flows --- particle swarm optimization with a mixed integer programming formulation, particle swarm optimization with a simulated annealing engine, and particle swarm optimization with a fast heuristic based on list scheduling. Then, we extend our scheduling framework to support general PAS problem which considers the actors cannot be parallelized

    Spatial instruction scheduling for raw machines

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2002.Includes bibliographical references (leaves 89-91).Instruction scheduling on software exposed architectures, such as Raw, must be performed in both time and space. The complexity and variance of application scheduling regions dictates that the space-time scheduling task be divided into phases. Unfortunately, the interaction of phases presents a phase ordering problem. In this thesis, the structure of program scheduling regions is studied. The scheduling regions are shown to have varying characteristics that are too diverse for a single simple algorithm to cover. A new scheduling technique is proposed to cope with this diversity and minimize the phase ordering problem. First, rather than maintaining exact mappings of instructions to time and space, the internal state of the scheduler maintains probabilities for different assignments of instructions to time and space resources. Second, a set of small scheduling heuristics cooperatively iterate over the probabilistic assignments many times in order to minimize the effects of phase ordering. A simple spatial instruction scheduler for Raw machines based on this technique is implemented and shown to outperform existing spatial scheduling systems on average.by Shane Michael Swenson.M.Eng

    High Performance Embedded Computing

    Get PDF
    Nowadays, the prevalence of computing systems in our lives is so ubiquitous that we live in a cyber-physical world dominated by computer systems, from pacemakers to cars and airplanes. These systems demand for more computational performance to process large amounts of data from multiple data sources with guaranteed processing times. Actuating outside of the required timing bounds may cause the failure of the system, being vital for systems like planes, cars, business monitoring, e-trading, etc. High-Performance and Time-Predictable Embedded Computing presents recent advances in software architecture and tools to support such complex systems, enabling the design of embedded computing devices which are able to deliver high-performance whilst guaranteeing the application required timing bounds. Technical topics discussed in the book include: Parallel embedded platforms Programming models Mapping and scheduling of parallel computations Timing and schedulability analysis Runtimes and operating systemsThe work reflected in this book was done in the scope of the European project P SOCRATES, funded under the FP7 framework program of the European Commission. High-performance and time-predictable embedded computing is ideal for personnel in computer/communication/embedded industries as well as academic staff and master/research students in computer science, embedded systems, cyber-physical systems and internet-of-things
    corecore