71 research outputs found

    Techniques for Crafting Customizable MPSoCS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Towards Efficient Resource Allocation for Embedded Systems

    Get PDF
    Das Hauptthema ist die dynamische Ressourcenverwaltung in eingebetteten Systemen, insbesondere die Verwaltung von Rechenzeit und Netzwerkverkehr auf einem MPSoC. Die Idee besteht darin, eine Pipeline für die Verarbeitung von Mobiler Kommunikation auf dem Chip dynamisch zu schedulen, um die Effizienz der Hardwareressourcen zu verbessern, ohne den Ressourcenverbrauch des dynamischen Schedulings dramatisch zu erhöhen. Sowohl Software- als auch Hardwaremodule werden auf Hotspots im Ressourcenverbrauch untersucht und optimiert, um diese zu entfernen. Da Applikationen im Bereich der Signalverarbeitung normalerweise mit Hilfe von SDF-Diagrammen beschrieben werden können, wird deren dynamisches Scheduling optimiert, um den Ressourcenverbrauch gegenüber dem üblicherweise verwendeten statischen Scheduling zu verbessern. Es wird ein hybrider dynamischer Scheduler vorgestellt, der die Vorteile von Processing-Networks und der Planung von Task-Graphen kombiniert. Es ermöglicht dem Scheduler, ein Gleichgewicht zwischen der Parallelisierung der Berechnung und der Zunahme des dynamischen Scheduling-Aufands optimal abzuwägen. Der resultierende dynamisch erstellte Schedule reduziert den Ressourcenverbrauch um etwa 50%, wobei die Laufzeit im Vergleich zu einem statischen Schedule nur um 20% erhöht wird. Zusätzlich wird ein verteilter dynamischer SDF-Scheduler vorgeschlagen, der das Scheduling in verschiedene Teile zerlegt, die dann zu einer Pipeline verbunden werden, um mehrere parallele Prozessoren einzubeziehen. Jeder Scheduling-Teil wird zu einem Cluster mit Load-Balancing erweitert, um die Anzahl der parallel laufenden Scheduling-Jobs weiter zu erhöhen. Auf diese Weise wird dem vorhandene Engpass bei dem dynamischen Scheduling eines zentralisierten Schedulers entgegengewirkt, sodass 7x mehr Prozessoren mit dem Pipelined-Clustered-Dynamic-Scheduler für eine typische Signalverarbeitungsanwendung verwendet werden können. Das neue dynamische Scheduling-System setzt das Vorhandensein von drei verschiedenen Kommunikationsmodi zwischen den Verarbeitungskernen voraus. Bei der Emulation auf Basis des häufig verwendeten RDMA-Protokolls treten Leistungsprobleme auf. Sehr gut kann RDMA für einmalige Punkt-zu-Punkt-Datenübertragungen verwendet werden, wie sie bei der Ausführung von Task-Graphen verwendet werden. Process-Networks verwenden normalerweise Datenströme mit hohem Volumen und hoher Bandbreite. Es wird eine FIFO-basierte Kommunikationslösung vorgestellt, die einen zyklischen Puffer sowohl im Sender als auch im Empfänger implementiert, um diesen Bedarf zu decken. Die Pufferbehandlung und die Datenübertragung zwischen ihnen erfolgen ausschließlich in Hardware, um den Software-Overhead aus der Anwendung zu entfernen. Die Implementierung verbessert die Zugriffsverwaltung mehrerer Nutzer auf flächen-effiziente Single-Port Speichermodule. Es werden 0,8 der theoretisch möglichen Bandbreite, die normalerweise nur mit flächenmäßig teureren Dual-Port-Speichern erreicht wird. Der dritte Kommunikationsmodus definiert eine einfache Message-Passing-Implementierung, die ohne einen Verbindungszustand auskommt. Dieser Modus wird für eine effiziente prozessübergreifende Kommunikation des verteilten Scheduling-Systems und der engen Ansteuerung der restlichen Prozessoren benötigt. Eine Flusskontrolle in Hardware stellt sicher, dass eine große Anzahl von Sendern Nachrichten an denselben Empfänger senden kann. Dabei wird garantiert, dass alle Nachrichten korrekt empfangen werden, ohne dass eine Verbindung hergestellt werden muss und die Nachrichtenlaufzeit gering bleibt. Die Arbeit konzentriert sich auf die Optimierung des Codesigns von Hardware und Software, um die kompromisslose Ressourceneffizienz der dynamischen SDF-Graphen-Planung zu erhöhen. Besonderes Augenmerk wird auf die Abhängigkeiten zwischen den Ebenen eines verteilten Scheduling-Systems gelegt, das auf der Verfügbarkeit spezifischer hardwarebeschleunigter Kommunikationsmethoden beruht.:1 Introduction 1.1 Motivation 1.2 The Multiprocessor System on Chip Architecture 1.3 Concrete MPSoC Architecture 1.4 Representing LTE/5G baseband processing as Static Data Flow 1.5 Compuation Stack 1.6 Performance Hotspots Addressed 1.7 State of the Art 1.8 Overview of the Work 2 Hybrid SDF Execution 2.1 Addressed Performance Hotspot 2.2 State of the Art 2.3 Static Data Flow Graphs 2.4 Runtime Environment 2.5 Overhead of Deloying Tasks to a MPSoC 2.6 Interpretation of SDF Graphs as Task Graphs 2.7 Interpreting SDF Graphs as Process Networks 2.8 Hybrid Interpretation 2.9 Graph Topology Considerations 2.10 Theoretic Impact of Hybrid Interpretation 2.11 Simulating Hybrid Execution 2.12 Pipeline SDF Graph Example 2.13 Random SDF Graphs 2.14 LTE-like SDF Graph 2.15 Key Lernings 3 Distribution of Management 3.1 Addressed Performance Hotspot 3.2 State of the Art 3.3 Revising Deployment Overhead 3.4 Distribution of Overhead 3.5 Impact of Management Distribution to Resource Utilization 3.6 Reconfigurability 3.7 Key Lernings 4 Sliced FIFO Hardware 4.1 Addressed Performance Hotspot 4.2 State of the Art 4.3 System Environment 4.4 Sliced Windowed FIFO buffer 4.5 Single FIFO Evaluation 4.6 Multiple FIFO Evalutaion 4.7 Hardware Implementation 4.8 Key Lernings 5 Message Passing Hardware 5.1 Addressed Performance Hotspot 5.2 State of the Art 5.3 Message Passing Regarded as Queueing 5.4 A Remote Direct Memory Access Based Implementation 5.5 Hardware Implementation Concept 5.6 Evalutation of Performance 5.7 Key Lernings 6 SummaryThe main topic is the dynamic resource allocation in embedded systems, especially the allocation of computing time and network traffic on an multi processor system on chip (MPSoC). The idea is to dynamically schedule a mobile communication signal processing pipeline on the chip to improve hardware resource efficiency while not dramatically improve resource consumption because of dynamic scheduling overhead. Both software and hardware modules are examined for resource consumption hotspots and optimized to remove them. Since signal processing can usually be described with the help of static data flow (SDF) graphs, the dynamic handling of those is optimized to improve resource consumption over the commonly used static scheduling approach. A hybrid dynamic scheduler is presented that combines benefits from both processing networks and task graph scheduling. It allows the scheduler to optimally balance parallelization of computation and addition of dynamic scheduling overhead. The resulting dynamically created schedule reduces resource consumption by about 50%, with a runtime increase of only 20% compared to a static schedule. Additionally, a distributed dynamic SDF scheduler is proposed that splits the scheduling into different parts, which are then connected to a scheduling pipeli ne to incorporate multiple parallel working processors. Each scheduling stage is reworked into a load-balanced cluster to increase the number of parallel scheduling jobs further. This way, the still existing dynamic scheduling bottleneck of a centralized scheduler is widened, allowing handling 7x more processors with the pipelined, clustered dynamic scheduler for a typical signal processing application. The presented dynamic scheduling system assumes the presence of three different communication modes between the processing cores. When emulated on top of the commonly used remote direct memory access (RDMA) protocol, performance issues are encountered. Firstly, RDMA can neatly be used for single-shot point-to-point data transfers, like used in task graph scheduling. Process networks usually make use of high-volume and high-bandwidth data streams. A first in first out (FIFO) communication solution is presented that implements a cyclic buffer on both sender and receiver to serve this need. The buffer handling and data transfer between them are done purely in hardware to remove software overhead from the application. The implementation improves the multi-user access to area-efficient single port on-chip memory modules. It achieves 0.8 of the theoretically possible bandwidth, usually only achieved with area expensive dual-port memories. The third communication mode defines a lightweight message passing (MP) implementation that is truly connectionless. It is needed for efficient inter-process communication of the distributed and clustered scheduling system and the worker processing units’ tight coupling. A hardware flow control assures that an arbitrary number of senders can spontaneously start sending messages to the same receiver. Yet, all messages are guaranteed to be correctly received while eliminating the need for connection establishment and keeping a low message delay. The work focuses on the hardware-software codesign optimization to increase the uncompromised resource efficiency of dynamic SDF graph scheduling. Special attention is paid to the inter-level dependencies in developing a distributed scheduling system, which relies on the availability of specific hardwareaccelerated communication methods.:1 Introduction 1.1 Motivation 1.2 The Multiprocessor System on Chip Architecture 1.3 Concrete MPSoC Architecture 1.4 Representing LTE/5G baseband processing as Static Data Flow 1.5 Compuation Stack 1.6 Performance Hotspots Addressed 1.7 State of the Art 1.8 Overview of the Work 2 Hybrid SDF Execution 2.1 Addressed Performance Hotspot 2.2 State of the Art 2.3 Static Data Flow Graphs 2.4 Runtime Environment 2.5 Overhead of Deloying Tasks to a MPSoC 2.6 Interpretation of SDF Graphs as Task Graphs 2.7 Interpreting SDF Graphs as Process Networks 2.8 Hybrid Interpretation 2.9 Graph Topology Considerations 2.10 Theoretic Impact of Hybrid Interpretation 2.11 Simulating Hybrid Execution 2.12 Pipeline SDF Graph Example 2.13 Random SDF Graphs 2.14 LTE-like SDF Graph 2.15 Key Lernings 3 Distribution of Management 3.1 Addressed Performance Hotspot 3.2 State of the Art 3.3 Revising Deployment Overhead 3.4 Distribution of Overhead 3.5 Impact of Management Distribution to Resource Utilization 3.6 Reconfigurability 3.7 Key Lernings 4 Sliced FIFO Hardware 4.1 Addressed Performance Hotspot 4.2 State of the Art 4.3 System Environment 4.4 Sliced Windowed FIFO buffer 4.5 Single FIFO Evaluation 4.6 Multiple FIFO Evalutaion 4.7 Hardware Implementation 4.8 Key Lernings 5 Message Passing Hardware 5.1 Addressed Performance Hotspot 5.2 State of the Art 5.3 Message Passing Regarded as Queueing 5.4 A Remote Direct Memory Access Based Implementation 5.5 Hardware Implementation Concept 5.6 Evalutation of Performance 5.7 Key Lernings 6 Summar

    A Survey and Comparative Study of Hard and Soft Real-time Dynamic Resource Allocation Strategies for Multi/Many-core Systems

    Get PDF
    Multi-/many-core systems are envisioned to satisfy the ever-increasing performance requirements of complex applications in various domains such as embedded and high-performance computing. Such systems need to cater to increasingly dynamic workloads, requiring efficient dynamic resource allocation strategies to satisfy hard or soft real-time constraints. This article provides an extensive survey of hard and soft real-time dynamic resource allocation strategies proposed since the mid-1990s and highlights the emerging trends for multi-/many-core systems. The survey covers a taxonomy of the resource allocation strategies and considers their various optimization objectives, which have been used to provide comprehensive comparison. The strategies employ various principles, such as market and biological concepts, to perform the optimizations. The trend followed by the resource allocation strategies, open research challenges, and likely emerging research directions have also been provided

    A Probabilistic Approach for the System-Level Design of Multi-ASIP Platforms

    Get PDF

    Ordonnancement hybride des applications flots de données sur des systèmes embarqués multi-coeurs

    Get PDF
    Les systèmes embarqués sont de plus en plus présents dans l'industrie comme dans la vie quotidienne. Une grande partie de ces systèmes comprend des applications effectuant du traitement intensif des données: elles utilisent de nombreux filtres numériques, où les opérations sur les données sont répétitives et ont un contrôle limité. Les graphes "flots de données", grâce à leur déterminisme fonctionnel inhérent, sont très répandus pour modéliser les systèmes embarqués connus sous le nom de "data-driven". L'ordonnancement statique et périodique des graphes flot de données a été largement étudié, surtout pour deux modèles particuliers: SDF et CSDF. Dans cette thèse, on s'intéresse plus particulièrement à l'ordonnancement périodique des graphes CSDF. Le problème consiste à identifier des séquences périodiques infinies d'actionnement des acteurs qui aboutissent à des exécutions complètes à buffers bornés. L'objectif est de pouvoir aborder ce problème sous des angles différents : maximisation de débit, minimisation de la latence et minimisation de la capacité des buffers. La plupart des travaux existants proposent des solutions pour l'optimisation du débit et négligent le problème d'optimisation de la latence et propose même dans certains cas des ordonnancements qui ont un impact négatif sur elle afin de conserver les propriétés de périodicité. On propose dans cette thèse un ordonnancement hybride, nommé Self-Timed Périodique (STP), qui peut conserver les propriétés d'un ordonnancement périodique et à la fois améliorer considérablement sa performance en terme de latence.One of the most important aspects of parallel computing is its close relation to the underlying hardware and programming models. In this PhD thesis, we take dataflow as the basic model of computation, as it fits the streaming application domain. Cyclo-Static Dataflow (CSDF) is particularly interesting because this variant is one of the most expressive dataflow models while still being analyzable at design time. Describing the system at higher levels of abstraction is not sufficient, e.g. dataflow have no direct means to optimize communication channels generally based on shared buffers. Therefore, we need to link the dataflow MoCs used for performance analysis of the programs, the real time task models used for timing analysis and the low-level model used to derive communication times. This thesis proposes a design flow that meets these challenges, while enabling features such as temporal isolation and taking into account other challenges such as predictability and ease of validation. To this end, we propose a new scheduling policy noted Self-Timed Periodic (STP), which is an execution model combining Self-Timed Scheduling (STS) with periodic scheduling. In STP scheduling, actors are no longer strictly periodic but self-timed assigned to periodic levels: the period of each actor under periodic scheduling is replaced by its worst-case execution time. Then, STP retains some of the performance and flexibility of self-timed schedule, in which execution times of actors need only be estimates, and at the same time makes use of the fact that with a periodic schedule we can derive a tight estimation of the required performance metrics

    On the simulation and design of manycore CMPs

    Get PDF
    The progression of Moore’s Law has resulted in both embedded and performance computing systems which use an ever increasing number of processing cores integrated in a single chip. Commercial systems are now available which provide hundreds of cores, and academics have proposed architectures for up to 1024 cores. Embedded multicores are increasingly popular as it is easier to guarantee hard-realtime constraints using individual cores dedicated for tasks, than to use traditional time-multiplexed processing. However, finding the optimal hardware configuration to meet these requirements at minimum cost requires extensive trial and error approaches to investigate the design space. This thesis tackles the problems encountered in the design of these large scale multicore systems by first addressing the problem of fast, detailed micro-architectural simulation. Initially addressing embedded systems, this work exploits the lack of hardware cache-coherence support in many deeply embedded systems to increase the available parallelism in the simulation. Then, through partitioning the NoC and using packet counting and cycle skipping reduces the amount of computation required to accurately model the NoC interconnect. In combination, this enables simulation speeds significantly higher than the state of the art, while maintaining less error, when compared to real hardware, than any similar simulator. Simulation speeds reach up to 370MIPS (Million (target) Instructions Per Second), or 110MHz, which is better than typical FPGA prototypes, and approaching final ASIC production speeds. This is achieved while maintaining an error of only 2.1%, significantly lower than other similar simulators. The thesis continues by scaling the simulator past large embedded systems up to 64-1024 core processors, adding support for coherent architectures using the same packet counting techniques along with low overhead context switching to enable the simulation of such large systems with stricter synchronisation requirements. The new interconnect model was partitioned to enable parallel simulation to further improve simulation speeds in a manner which did not sacrifice any accuracy. These innovations were leveraged to investigate significant novel energy saving optimisations to the coherency protocol, processor ISA, and processor micro-architecture. By introducing a new instruction, with the name wait-on-address, the energy spent during spin-wait style synchronisation events can be significantly reduced. This functions by putting the core into a low-power idle state while the cache line of the indicated address is monitored for coherency action. Upon an update or invalidation (or traditional timer or external interrupts) the core will resume execution, but the active energy of running the core pipeline and repeatedly accessing the data and instruction caches is effectively reduced to static idle power. The thesis also shows that existing combined software-hardware schemes to track data regions which do not require coherency can adequately address the directory-associativity problem, and introduces a new coherency sharer encoding which reduces the energy consumed by sharer invalidations when sharers are grouped closely together, such as would be the case with a system running many tasks with a small degree of parallelism in each. The research concludes by using the extremely fast simulation speeds developed to produce a large set of training data, collecting various runtime and energy statistics for a wide range of embedded applications on a huge diverse range of potential MPSoC designs. This data was used to train a series of machine learning based models which were then evaluated on their capacity to predict performance characteristics of unseen workload combinations across the explored MPSoC design space, using only two sample simulations, with promising results from some of the machine learning techniques. The models were then used to produce a ranking of predicted performance across the design space, and on average Random Forest was able to predict the best design within 89% of the runtime performance of the actual best tested design, and better than 93% of the alternative design space. When predicting for a weighted metric of energy, delay and area, Random Forest on average produced results within 93% of the optimum result. In summary this thesis improves upon the state of the art for cycle accurate multicore simulation, introduces novel energy saving changes the the ISA and microarchitecture of future multicore processors, and demonstrates the viability of machine learning techniques to significantly accelerate the design space exploration required to bring a new manycore design to market

    Design Space Exploration for MPSoC Architectures

    Get PDF
    Multiprocessor system-on-chip (MPSoC) designs utilize the available technology and communication architectures to meet the requirements of the upcoming applications. In MPSoC, the communication platform is both the key enabler, as well as the key differentiator for realizing efficient MPSoCs. It provides product differentiation to meet a diverse, multi-dimensional set of design constraints, including performance, power, energy, reconfigurability, scalability, cost, reliability and time-to-market. The communication resources of a single interconnection platform cannot be fully utilized by all kind of applications, such as the availability of higher communication bandwidth for computation but not data intensive applications is often unfeasible in the practical implementation. This thesis aims to perform the architecture-level design space exploration towards efficient and scalable resource utilization for MPSoC communication architecture. In order to meet the performance requirements within the design constraints, careful selection of MPSoC communication platform, resource aware partitioning and mapping of the application play important role. To enhance the utilization of communication resources, variety of techniques such as resource sharing, multicast to avoid re-transmission of identical data, and adaptive routing can be used. For implementation, these techniques should be customized according to the platform architecture. To address the resource utilization of MPSoC communication platforms, variety of architectures with different design parameters and performance levels, namely Segmented bus (SegBus), Network-on-Chip (NoC) and Three-Dimensional NoC (3D-NoC), are selected. Average packet latency and power consumption are the evaluation parameters for the proposed techniques. In conventional computing architectures, fault on a component makes the connected fault-free components inoperative. Resource sharing approach can utilize the fault-free components to retain the system performance by reducing the impact of faults. Design space exploration also guides to narrow down the selection of MPSoC architecture, which can meet the performance requirements with design constraints.Siirretty Doriast
    corecore