7 research outputs found

    Minimizing Buffer Requirements under Rate-Optimal Schedule in Regular Dataflow Networks

    No full text
    Abstract. Large-grain synchronous dataflow graphs or multi-rate graphs have the distinct feature that the nodes of the dataflow graph fire at different rates. Such multi-rate large-grain dataflow graphs have been widely re-garded as a powerful programming model for DSP applications. In this paper we propose a method to min-imize buffer storage requirement in constructing rate-optimal compile-time (MBRO) schedules for multi-rate dataflow graphs. We demonstrate that the constraints to minimize buffer storage while executing at the opti-mal computation rate (i.e. the maximum possible computation rate without storage constraints) can be formu-lated as a unified linear programming problem in our framework. A novel feature of our method is that in constructing the rate-optimal schedule, it directly minimizes the memory requirement by choosing the sched-ule time of nodes appropriately. Lastly, a new circular-arc interval graph coloring algorithm has been proposed to further reduce the memory requirement by allowing buffer sharing among the arcs of the multi-rate dataflow graph. We have constructed an experimental testbed which implements our MBRO scheduling algorithm as well as (i) the widely used periodic admissible parallel schedules (also known as block schedules) proposed by Lee an

    Verification of the Performance Properties of Embedded Streaming Applications via Constraint-Based Scheduling

    Get PDF
    RÉSUMÉ Les capacités et, en conséquence, la complexité de la conception de systèmes embarqués ont énormément augmenté ces dernières années, surfant sur la vague de la loi de Moore. Au contraire, le temps de mise en marché a diminué, ce qui oblige les concepteurs à faire face à certains défis, ce qui les poussent à adopter de nouvelles méthodes de conception pour accroître leur productivité. En réponse à ces nouvelles pressions, les systèmes modernes ont évolué vers des technologies multiprocesseurs sur puce. De nouvelles architectures sont apparues dans le multitraitement sur puce afin d'utiliser les énormes progrès des technologies de fabrication. Les systèmes multiprocesseurs sur puce (MPSoCs) ont été adoptés comme plates-formes appropriées pour l'exécution d'applications embarquées complexes. Pour réduire le coût de la plate-forme matérielle, les applications partagent des ressources, ce qui peut entraîner des interférences dans le temps entre les applications dues à des conflits dans la demande des ressources. Les caractéristiques d'un SoC typique imposent de grands défis sur la vérification SoC à deux égards. Tout d'abord, la grande échelle de l'intégration du matériel mène à des interactions matériel-matériel sophistiquées. Puisqu’un SoC a de multiples composants, les interactions entre ceux-ci pourraient donner lieu à des propriétés émergentes qui ne sont pas présentes dans un seul composant. En second lieu, l'introduction de logiciels dans le comportement du matériel mène à des interactions matériel-logiciel sophistiqué. Puisqu’un SoC a au moins un processeur, le logiciel constitue une nouvelle dimension des comportements du SoC et donc apporte une nouvelle dimension à la vérification. Cela rend la vérification d'une tâche difficile, en particulier pour les applications de communication et de multimédia. Cela est dû à des contraintes non-fonctionnelles des modules matériel et logiciel, tels que la vitesse du processeur, la taille de la mémoire tampon, le budget de l'énergie, la politique de planification, et la combinaison de multiples applications. Cette thèse préconise la programmation par contraintes (CP) comme un outil puissant pour la vérification des mesures de performance de MPSoCs. Dans ce travail, nous avons considéré des applications de diffusion sur l'architecture cible d’un système-sur-puce (MPSoC) multi-processeur comme un problème d'ordonnancement à base de contraintes. Nous l’avons testé séparément et en interaction avec d'autres types d'applications. L'idée est de créer un scénario au niveau du système qui prend en compte le flux de travail au niveau du système par rapport aux ressources du système et des exigences de performance, à savoir les délais de la tâche, le temps de réponse, le CPU et l’utilisation de la mémoire, ainsi que la taille de la mémoire tampon. Plus précisément, nous examinons si le comportement des différentes interactions entre les composants du système d'exécution des tâches différentes peut être efficacement exprimé comme un problème d'ordonnancement à base de contraintes sur l'espace des entrées possibles du système, afin de déterminer si nous pouvons traiter des cas similaires d'échec en utilisant ce modèle. Résoudre ce problème consiste à trouver une meilleure façon d’inspecter le système en cours de vérification dans une phase de conception qui arrive très tôt et dans un délai beaucoup plus raisonnable. Notre approche proposée a été testée avec diverses applications, différents flux d'entrée et des architectures différentes. Nous avons construit notre modèle en prenant en considération les architectures existantes sur le marché, des applications choisies qui sont en courante et comparé les résultats de notre modèle avec les résultats provenant de l'exécution des applications réelles sur le système. Les résultats montrent que la méthode permet de déterminer les conditions de défaillance du système dans une fraction du temps nécessaire à la vérification par simulation. Il donne à l’ingénieur d’essai la possibilité d'explorer l'espace de conception et d'en déduire la meilleure politique. Il contribue également à choisir une architecture appropriée pour des applications en cours d'exécution.----------ABSTRACT The abilities and, accordingly, the design complexity of embedded systems have expanded enormously in recent years, riding the wave of Moore’s law. On the contrary, time to market has shrunk, forcing challenges onto designers who in turn, seek to adopt new design methods to increase their productivity. As a response to these new pressures, modern-day systems have moved towards on-chip multiprocessing technologies. New architectures have emerged in on-chip multiprocessing in order to utilize the tremendous advances of fabrication technology. Multiprocessor Systems on a Chip (MPSoCs) were adopted as suitable platforms for executing complex embedded applications. To reduce the cost of the hardware platform, applications share resources, which may result in inter-application timing interference due to resource request conflicts. The features of a typical SoC impose great challenges on SoC verification in two respects. First, the large scale of hardware integration leads to sophisticated hardware-hardware interactions. Since a SoC has multiple components, the interactions between them could give rise to emerging properties that are not present in any single component. Second, the introduction of software into hardware behaviour leads to sophisticated hardware-software interactions. Since an SoC has at least one processor, software forms a new dimension of the SoC’s behaviours and hence brings a new dimension to verification. This makes verification a challenging task, in particular for communication and multimedia applications. This is due to the non-functional constraints of hardware and software modules, such as processor speed, buffer size, energy budget, and scheduling policy, and the combination of multiple applications. This thesis advocates Constraint Programming (CP) as a powerful tool for the verification of performance metrics of MPSoCs. In this work, we mapped streaming applications onto a target Multi-Processor System-on-Chip (MPSoC) architecture as a constraint-based scheduling problem. We tested it separately and in interaction with other application types. The idea is to create a system-level scenario that takes into account the system level work-flow with respect to System resources and performance requirements, namely task deadlines, response time, CPU and memory usage, and buffer size. Specifically, we investigate whether the behaviour of different interactions among system components executing different tasks can be effectively re-expressed as a constraint-based scheduling problem over the space of possible inputs to the system, finding if we can address similar cases of failure using this model. Solving this problem means finding a better way to investigate and verify the System under verification in a very early design stage and in a much more reasonable time. Our proposed approach was tested with various applications, different input streams and different architectures. We built our model for existing architectures on the market running chosen applications and compared our model results with the results coming from running the actual applications on the system. Results show that the methodology is able to identify system failure conditions in a fraction of the time needed by simulation-based verification. It gives the Test Engineer the ability to explore the design space and deduce the best policy. It also helps choose a proper architecture for the applications running

    An accurate analysis for guaranteed performance of multiprocessor streaming applications

    Get PDF
    Already for more than a decade, consumer electronic devices have been available for entertainment, educational, or telecommunication tasks based on multimedia streaming applications, i.e., applications that process streams of audio and video samples in digital form. Multimedia capabilities are expected to become more and more commonplace in portable devices. This leads to challenges with respect to cost efficiency and quality. This thesis contributes models and analysis techniques for improving the cost efficiency, and therefore also the quality, of multimedia devices. Portable consumer electronic devices should feature flexible functionality on the one hand and low power consumption on the other hand. Those two requirements are conflicting. Therefore, we focus on a class of hardware that represents a good trade-off between those two requirements, namely on domain-specific multiprocessor systems-on-chip (MP-SoC). Our research work contributes to dynamic (i.e., run-time) optimization of MP-SoC system metrics. The central question in this area is how to ensure that real-time constraints are satisfied and the metric of interest such as perceived multimedia quality or power consumption is optimized. In these cases, we speak of quality-of-service (QoS) and power management, respectively. In this thesis, we pursue real-time constraint satisfaction that is guaranteed by the system by construction and proven mainly based on analytical reasoning. That approach is often taken in real-time systems to ensure reliable performance. Therefore the performance analysis has to be conservative, i.e. it has to use pessimistic assumptions on the unknown conditions that can negatively influence the system performance. We adopt this hypothesis as the foundation of this work. Therefore, the subject of this thesis is the analysis of guaranteed performance for multimedia applications running on multiprocessors. It is very important to note that our conservative approach is essentially different from considering only the worst-case state of the system. Unlike the worst-case approach, our approach is dynamic, i.e. it makes use of run-time characteristics of the input data and the environment of the application. The main purpose of our performance analysis method is to guide the run-time optimization. Typically, a resource or quality manager predicts the execution time, i.e., the time it takes the system to process a certain number of input data samples. When the execution times get smaller, due to dependency of the execution time on the input data, the manager can switch the control parameter for the metric of interest such that the metric improves but the system gets slower. For power optimization, that means switching to a low-power mode. If execution times grow, the manager can set parameters so that the system gets faster. For QoS management, for example, the application can be switched to a different quality mode with some degradation in perceived quality. The real-time constraints are then never violated and the metrics of interest are kept as good as possible. Unfortunately, maintaining system metrics such as power and quality at the optimal level contradicts with our main requirement, i.e., providing performance guarantees, because for this one has to give up some quality or power consumption. Therefore, the performance analysis approach developed in this thesis is not only conservative, but also accurate, so that the optimization of the metric of interest does not suffer too much from conservativity. This is not trivial to realize when two factors are combined: parallel execution on multiple processors and dynamic variation of the data-dependent execution delays. We achieve the goal of conservative and accurate performance estimation for an important class of multiprocessor platforms and multimedia applications. Our performance analysis technique is realizable in practice in QoS or power management setups. We consider a generic MP-SoC platform that runs a dynamic set of applications, each application possibly using multiple processors. We assume that the applications are independent, although it is possible to relax this requirement in the future. To support real-time constraints, we require that the platform can provide guaranteed computation, communication and memory budgets for applications. Following important trends in system-on-chip communication, we support both global buses and networks-on-chip. We represent every application as a homogeneous synchronous dataflow (HSDF) graph, where the application tasks are modeled as graph nodes, called actors. We allow dynamic datadependent actor execution delays, which makes HSDF graphs very useful to express modern streaming applications. Our reason to consider HSDF graphs is that they provide a good basic foundation for analytical performance estimation. In this setup, this thesis provides three major contributions: 1. Given an application mapped to an MP-SoC platform, given the performance guarantees for the individual computation units (the processors) and the communication unit (the network-on-chip), and given constant actor execution delays, we derive the throughput and the execution time of the system as a whole. 2. Given a mapped application and platform performance guarantees as in the previous item, we extend our approach for constant actor execution delays to dynamic datadependent actor delays. 3. We propose a global implementation trajectory that starts from the application specification and goes through design-time and run-time phases. It uses an extension of the HSDF model of computation to reflect the design decisions made along the trajectory. We present our model and trajectory not only to put the first two contributions into the right context, but also to present our vision on different parts of the trajectory, to make a complete and consistent story. Our first contribution uses the idea of so-called IPC (inter-processor communication) graphs known from the literature, whereby a single model of computation (i.e., HSDF graphs) are used to model not only the computation units, but also the communication unit (the global bus or the network-on-chip) and the FIFO (first-in-first-out) buffers that form a ‘glue’ between the computation and communication units. We were the first to propose HSDF graph structures for modeling bounded FIFO buffers and guaranteed throughput network connections for the network-on-chip communication in MP-SoCs. As a result, our HSDF models enable the formalization of the on-chip FIFO buffer capacity minimization problem under a throughput constraint as a graph-theoretic problem. Using HSDF graphs to formalize that problem helps to find the performance bottlenecks in a given solution to this problem and to improve this solution. To demonstrate this, we use the JPEG decoder application case study. Also, we show that, assuming constant – worst-case for the given JPEG image – actor delays, we can predict execution times of JPEG decoding on two processors with an accuracy of 21%. Our second contribution is based on an extension of the scenario approach. This approach is based on the observation that the dynamic behavior of an application is typically composed of a limited number of sub-behaviors, i.e., scenarios, that have similar resource requirements, i.e., similar actor execution delays in the context of this thesis. The previous work on scenarios treats only single-processor applications or multiprocessor applications that do not exploit all the flexibility of the HSDF model of computation. We develop new scenario-based techniques in the context of HSDF graphs, to derive the timing overlap between different scenarios, which is very important to achieve good accuracy for general HSDF graphs executing on multiprocessors. We exploit this idea in an application case study – the MPEG-4 arbitrarily-shaped video decoder, and demonstrate execution time prediction with an average accuracy of 11%. To the best of our knowledge, for the given setup, no other existing performance technique can provide a comparable accuracy and at the same time performance guarantees

    Compiler techniques for scalable performance of stream programs on multicore architectures

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 211-222).Given the ubiquity of multicore processors, there is an acute need to enable the development of scalable parallel applications without unduly burdening programmers. Currently, programmers are asked not only to explicitly expose parallelism but also concern themselves with issues of granularity, load-balancing, synchronization, and communication. This thesis demonstrates that when algorithmic parallelism is expressed in the form of a stream program, a compiler can effectively and automatically manage the parallelism. Our compiler assumes responsibility for low-level architectural details, transforming implicit algorithmic parallelism into a mapping that achieves scalable parallel performance for a given multicore target. Stream programming is characterized by regular processing of sequences of data, and it is a natural expression of algorithms in the areas of audio, video, digital signal processing, networking, and encryption. Streaming computation is represented as a graph of independent computation nodes that communicate explicitly over data channels. Our techniques operate on contiguous regions of the stream graph where the input and output rates of the nodes are statically determinable. Within a static region, the compiler first automatically adjusts the granularity and then exploits data, task, and pipeline parallelism in a holistic fashion. We introduce techniques that data-parallelize nodes that operate on overlapping sliding windows of their input, translating serializing state into minimal and parametrized inter-core communication. Finally, for nodes that cannot be data-parallelized due to state, we are the first to automatically apply software-pipelining techniques at a coarse granularity to exploit pipeline parallelism between stateful nodes. Our framework is evaluated in the context of the StreamIt programming language. StreamIt is a high-level stream programming language that has been shown to improve programmer productivity in implementing streaming algorithms. We employ the StreamIt Core benchmark suite of 12 real-world applications to demonstrate the effectiveness of our techniques for varying multicore architectures. For a 16-core distributed memory multicore, we achieve a 14.9x mean speedup. For benchmarks that include sliding-window computation, our sliding-window data-parallelization techniques are required to enable scalable performance for a 16-core SMP multicore (14x mean speedup) and a 64-core distributed shared memory multicore (52x mean speedup).by Michael I. Gordon.Ph.D

    Language and compiler support for stream programs

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2009.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 153-166).Stream programs represent an important class of high-performance computations. Defined by their regular processing of sequences of data, stream programs appear most commonly in the context of audio, video, and digital signal processing, though also in networking, encryption, and other areas. Stream programs can be naturally represented as a graph of independent actors that communicate explicitly over data channels. In this work we focus on programs where the input and output rates of actors are known at compile time, enabling aggressive transformations by the compiler; this model is known as synchronous dataflow. We develop a new programming language, StreamIt, that empowers both programmers and compiler writers to leverage the unique properties of the streaming domain. StreamIt offers several new abstractions, including hierarchical single-input single-output streams, composable primitives for data reordering, and a mechanism called teleport messaging that enables precise event handling in a distributed environment. We demonstrate the feasibility of developing applications in StreamIt via a detailed characterization of our 34,000-line benchmark suite, which spans from MPEG-2 encoding/decoding to GMTI radar processing. We also present a novel dynamic analysis for migrating legacy C programs into a streaming representation. The central premise of stream programming is that it enables the compiler to perform powerful optimizations. We support this premise by presenting a suite of new transformations. We describe the first translation of stream programs into the compressed domain, enabling programs written for uncompressed data formats to automatically operate directly on compressed data formats (based on LZ77). This technique offers a median speedup of 15x on common video editing operations.(cont.) We also review other optimizations developed in the StreamIt group, including automatic parallelization (offering an 11x mean speedup on the 16-core Raw machine), optimization of linear computations (offering a 5.5x average speedup on a Pentium 4), and cache-aware scheduling (offering a 3.5x mean speedup on a StrongARM 1100). While these transformations are beyond the reach of compilers for traditional languages such as C, they become tractable given the abundant parallelism and regular communication patterns exposed by the stream programming model.by William Thies.Ph.D
    corecore