Search CORE

139 research outputs found

Improving the Flexibility of the Deficit Table Scheduler

Author: F.J. Alfaro
H.M. Chaskar
K.I. Park
R. Jain
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

Abstract. A key component for networks with Quality of Service (QoS) support is the egress link scheduler. The table-based schedulers are simple to implement and can offer good latency bounds. Some of the latest proposals of network technologies, like Advanced Switching and Infini-Band, define in their specifications one of these schedulers. However, these schedulers do not work properly with variable packet sizes and face the problem of bounding the bandwidth and latency assignments. We have proposed a new table-based scheduler, the Deficit Table (DTable) scheduler, that works properly with variable packet sizes. Moreover, we have proposed a methodology to configure this table-based scheduler that partially decouples the bandwidth and latency assignments. In this paper we propose a method to improve the flexibility of the decoupling methodology. Moreover, we compare the latency performance of this strategy with two well-known scheduling algorithms: the Self-Clocked Weighted Fair Queuing (SCFQ) and the Deficit Round Robin (DRR) algorithms.

CiteSeerX

Crossref

Recommended from our members

Ray: A Distributed Execution Engine for the Machine Learning Ecosystem

Author: Moritz Philipp C
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

In recent years, growing data volumes and more sophisticated computational procedures have greatly increased the demand for computational power. Machine learning and artificial intelligence applications, for example, are notorious for their computational requirements. At the same time, Moores law is ending and processor speeds are stalling. As a result, distributed computing has become ubiquitous. While the cloud makes distributed hardware infrastructure widely accessible and therefore offers the potential of horizontal scale, developing these distributed algorithms and applications remains surprisingly hard. This is due to the inherent complexity of concurrent algorithms, the engineering challenges that arise when communicating between many machines, the requirements like fault tolerance and straggler mitigation that arise at large scale and the lack of a general-purpose distributed execution engine that can support a wide variety of applications.In this thesis, we study the requirements for a general-purpose distributed computation model and present a solution that is easy to use yet expressive and resilient to faults. At its core our model takes familiar concepts from serial programming, namely functions and classes, and generalizes them to the distributed world, therefore unifying stateless and stateful distributed computation. This model not only supports many machine learning workloads like training or serving, but is also a good t for cross-cutting machine learning applications like reinforcement learning and data processing applications like streaming or graph processing. We implement this computational model as an open-source system called Ray, which matches or exceeds the performance of specialized systems in many application domains, while also offering horizontally scalability and strong fault tolerance properties

eScholarship - University of California

Predictable and composable system-on-chip memory controllers

Author: Akesson K.B.
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/2010
Field of study

Contemporary System-on-Chip (SoC) become more and more complex, as increasing integration results in a larger number of concurrently executing applications. These applications consist of tasks that are mapped on heterogeneous multi-processor platforms with distributed memory hierarchies, where SRAMs and SDRAMs are shared by a variety of arbiters. Some applications have real-time requirements, meaning that they must perform a particular computation before a deadline to guarantee functional correctness, or to prevent quality degradation. Mapping the applications on the platform such that all real-time requirements are satisfied is very challenging. The number of possible mappings of tasks to processing elements and data structures to memories may be large, and appropriate configuration settings must be determined once the mapping is chosen. Verifying that a particular mapping satisfies all application requirements is typically done by system-level simulation. However, resource sharing causes interference between applications, making their temporal behaviors inter-dependent. All concurrently executing applications must hence be verified together, causing the verification complexity of the system to increase exponentially with the number of applications. Together these factors contribute to making the integration and verification process a dominant part of SoC development, both in terms of time and money. Predictable and composable systems are proposed to manage the increasing verification complexity. Predictable systems provide lower bounds on application performance, while applications in composable systems are completely isolated and cannot affect each other’s temporal behavior by even a single clock cycle. Predictable systems enable formal verification that covers all possible interactions with the platform. However, this assumes that the behavior of an application is captured in a performance model, which is not the case for many applications. Composability offers a complementary verification approach by letting these applications be verified independently by simulation with linear verification complexity. A limitation of current predictable and composable systems is that there are no memory controllers supporting the concepts in a general way. Current SRAM controllers can be shared in a predictable way with a variety of arbiters, but are only composable if statically scheduled or shared using time-division multiplexing. Existing SDRAM controllers are not composable, and are either unpredictable or limited to applications that are statically scheduled. This thesis addresses the limitations of current predictable and composable systems by proposing a general predictable and composable memory controller, thereby addressing the mapping and verification problem in embedded systems. The proposed memory controller is divided into a front-end and a back-end. The back-end is specific for DDR2/DDR3 SDRAM and makes the memory behave in a predictable manner using precomputed memory patterns that are dynamically combined at run time. The front-end contains buffering and an arbiter in the class of Latency-Rate (LR) servers, which is a class with many well-known predictable arbiters. We extend this class with a Credit-Controlled Static-Priority (CCSP) arbiter that is developed specifically for shared resources with latency-critical requestors and high loads, such as memories. Three key features of CCSP are: 1) It accommodates latency-critical requestors with low bandwidth requirements without wasting bandwidth. 2) Over-allocated bandwidth can be made negligible at an increased area cost, without affecting latency. 3) It has a small implementation that runs fast enough to keep up with most DDR2/DDR3 memories. The proposed front-end is general and can be used with other predictable resources, such as SRAM controllers. The proposed memory controller hence supports multiple arbiter and memory types, thus addressing the diversity in modern SoCs. The combination of front-end and predictable memory behaves like a LR server, which is the shared resource abstraction used in this work. In essence, a LR server guarantees a requestor a minimum bandwidth and a maximum latency, enabling formal verification of real-time requirements. The LR server model is compatible with several commonly used formal analysis frameworks, such as network calculus and data-flow analysis. Our memory controller hence allows any combination of predictable memory and LR arbiter to be used transparently for formal verification of applications with any of these frameworks. Sharing a predictable memory at run-time results in interference between requestors, making the memory controller non-composable. This is addressed by adding a Delay Block to the front-end that delays all signals sent from the front-end to a requestor to always emulate worst-case interference. This makes requestors unable to affect each other’s temporal behavior, which is sufficient to guarantee composability on the level of applications. Our predictable memory controller hence offers composable service with a variety of memory and arbiter types, which widely extends the scope of composable platforms. Another benefit of this approach is that it enables composable service to be dynamically enabled and disabled, enabling requestors that do not require composable service to use slack bandwidth to improve performance. The predictable and composable memory controller is supported by a configuration flow that automatically computes memory patterns and arbiter settings to satisfy given bandwidth and latency requirements. The flow uses abstraction to separate the configuration of the memory and the arbiter, enabling settings to be computed in a streamlined fashion for all supported memories and arbiters

Repository TU/e

Pure OAI Repository

Erreichen von Performance in Netzwerken-On-Chip für Echtzeitsysteme

Author: Kostrzewa Adam
Publication venue
Publication date: 01/01/2018
Field of study

In many new applications, such as in automatic driving, high performance requirements have reached safety critical real-time systems. Consequently, Networks-on-Chip (NoCs) must efficiently host new sets of highly dynamic workloads e.g., high resolution sensor fusion and data processing, autonomous decision’s making combined with machine learning. The static platform management, as used in current safety critical systems, is no more sufficient to provide the needed level of service. A dynamic platform management could meet the challenge, but it usually suffers from a lack of predictability and the simplicity necessary for certification of safety and real-time properties. In this work, we propose a novel, global and dynamic arbitration for NoCs with real-time QoS requirements. The mechanism decouples the admission control from arbitration in routers thereby simplifying a dynamic adaptation and real-time analysis. Consequently, the proposed solution allows the deployment of a sophisticated contract-based QoS provisioning without introducing complicated and hard to maintain schemes, known from the frequently applied static arbiters. The presented work introduces an overlay network to synchronize transmissions using arbitration units called Resource Managers (RMs), which allows global and work-conserving scheduling. The description of resource allocation strategies is supplemented by protocol design and verification methodology bringing adaptive control to NoC communication in setups with different QoS requirements and traffic classes. For doing that, a formal worst-case timing analysis for the mechanism has been proposed which demonstrates that this solution not only exposes higher performance in simulation but, even more importantly, consistently reaches smaller formally guaranteed worst-case latencies than other strategies for realistic levels of system's utilization. The approach is not limited to a specific network architecture or topology as the mechanism does not require modifications of routers and therefore can be used together with the majority of existing manycore systems. Indeed, the evaluation followed using the generic performance optimized router designs, as well as two systems-on-chip focused on real-time deployments. The results confirmed that the proposed approach proves to exhibit significantly higher average performance in simulation and execution.In vielen neuen sicherheitskritische Anwendungen, wie z.B. dem automatisierten Fahren, werden große Anforderungen an die Leistung von Echtzeitsysteme gestellt. Daher müssen Networks-on-Chip (NoCs) neue, hochdynamische Workloads wie z.B. hochauflösende Sensorfusion und Datenverarbeitung oder autonome Entscheidungsfindung kombiniert mit maschineller Lernen, effizient auf einem System unterbringen. Die Steuerung der zugrunde liegenden NoC-Architektur, muss die Systemsicherheit vor Fehlern, resultierend aus dem dynamischen Verhalten des Systems schützen und gleichzeitig die geforderte Performance bereitstellen. In dieser Arbeit schlagen wir eine neuartige, globale und dynamische Steuerung für NoCs mit Echtzeit QoS Anforderungen vor. Das Schema entkoppelt die Zutrittskontrolle von der Arbitrierung in Routern. Hierdurch wird eine dynamische Anpassung ermöglicht und die Echtzeitanalyse vereinfacht. Der Einsatz einer ausgefeilten vertragsbasierten Ressourcen-Zuweisung wird so ermöglicht, ohne komplexe und schwer wartbare Mechanismen, welche bereits aus dem statischen Plattformmanagement bekannt sind einzuführen. Diese Arbeit stellt ein übergelagertes Netzwerk vor, welches Übertragungen mit Hilfe von Arbitrierungseinheiten, den so genannten Resource Managern (RMs), synchronisiert. Dieses überlagerte Netzwerk ermöglicht eine globale und lasterhaltende Steuerung. Die Beschreibung verschiedener Ressourcenzuweisungstrategien wird ergänzt durch ein Protokolldesign und Methoden zur Verifikation der adaptiven NoC Steuerung mit unterschiedlichen QoS Anforderungen und Verkehrsklassen. Hierfür wird eine formale Worst Case Timing Analyse präsentiert, welche das vorgestellte Verfahren abbildet. Die Resultate bestätitgen, dass die präsentierte Lösung nicht nur eine höhere Performance in der Simulation bietet, sondern auch formal kleinere Worst-Case Latenzen für realistische Systemauslastungen als andere Strategien garantiert. Der vorgestellte Ansatz ist nicht auf eine bestimmte Netzwerkarchitektur oder Topologie beschränkt, da der Mechanismus keine Änderungen an den unterliegenden Routern erfordert und kann daher zusammen mit bestehenden Manycore-Systemen eingesetzt werden. Die Evaluierung erfolgte auf Basis eines leistungsoptimierten Router-Designs sowie zwei auf Echtzeit-Anwendungen fokusierten Platformen. Die Ergebnisse bestätigten, dass der vorgeschlagene Ansatz im Durchschnitt eine deutlich höhere Leistung in der Simulation und Ausführung liefert

Digitale Bibliothek Braunschweig

Temporal analysis and scheduling of hard real-time radios running on a multi-processor

Author: Pires dos reis Moreira O.
Publication venue: Technische Universiteit Eindhoven
Publication date: 01/01/2012
Field of study

On a multi-radio baseband system, multiple independent transceivers must share the resources of a multi-processor, while meeting each its own hard real-time requirements. Not all possible combinations of transceivers are known at compile time, so a solution must be found that either allows for independent timing analysis or relies on runtime timing analysis. This thesis proposes a design flow and software architecture that meets these challenges, while enabling features such as independent transceiver compilation and dynamic loading, and taking into account other challenges such as ease of programming, efficiency, and ease of validation. We take data flow as the basic model of computation, as it fits the application domain, and several static variants (such as Single-Rate, Multi-Rate and Cyclo-Static) have been shown to possess strong analytical properties. Traditional temporal analysis of data flow can provide minimum throughput guarantees for a self-timed implementation of data flow. Since transceivers may need to guarantee strictly periodic execution and meet latency requirements, we extend the analysis techniques to show that we can enforce strict periodicity for an actor in the graph; we also provide maximum latency analysis techniques for periodic, sporadic and bursty sources. We propose a scheduling strategy and an automatic scheduling flow that enable the simultaneous execution of multiple transceivers with hard-realtime requirements, described as Single-Rate Data Flow (SRDF) graphs. Each transceiver has its own execution rate and starts and stops independently from other transceivers, at times unknown at compile time, on a multiprocessor. We show how to combine scheduling and mapping decisions with the input application data flow graph to generate a worst-case temporal analysis graph. We propose algorithms to find a mapping per transceiver in the form of clusters of statically-ordered actors, and a budget for either a Time Division Multiplex (TDM) or Non-Preemptive Non-Blocking Round Robin (NPNBRR) scheduler per cluster per transceiver. The budget is computed such that if the platform can provide it, then the desired minimum throughput and maximum latency of the transceiver are guaranteed, while minimizing the required processing resources. We illustrate the use of these techniques to map a combination of WLAN and TDS-CDMA receivers onto a prototype Software-Defined Radio platform. The functionality of transceivers for standards with very dynamic behavior – such as WLAN – cannot be conveniently modeled as an SRDF graph, since SRDF is not capable of expressing variations of actor firing rules depending on the values of input data. Because of this, we propose a restricted, customized data flow model of computation, Mode-Controlled Data Flow (MCDF), that can capture the data-value dependent behavior of a transceiver, while allowing rigorous temporal analysis, and tight resource budgeting. We develop a number of analysis techniques to characterize the temporal behavior of MCDF graphs, in terms of maximum latencies and throughput. We also provide an extension to MCDF of our scheduling strategy for SRDF. The capabilities of MCDF are then illustrated with a WLAN 802.11a receiver model. Having computed budgets for each transceiver, we propose a way to use these budgets for run-time resource mapping and admissibility analysis. During run-time, at transceiver start time, the budget for each cluster of statically-ordered actors is allocated by a resource manager to platform resources. The resource manager enforces strict admission control, to restrict transceivers from interfering with each other’s worst-case temporal behaviors. We propose algorithms adapted from Vector Bin-Packing to enable the mapping at start time of transceivers to the multi-processor architecture, considering also the case where the processors are connected by a network on chip with resource reservation guarantees, in which case we also find routing and resource allocation on the network-on-chip. In our experiments, our resource allocation algorithms can keep 95% of the system resources occupied, while suffering from an allocation failure rate of less than 5%. An implementation of the framework was carried out on a prototype board. We present performance and memory utilization figures for this implementation, as they provide insights into the costs of adopting our approach. It turns out that the scheduling and synchronization overhead for an unoptimized implementation with no hardware support for synchronization of the framework is 16.3% of the cycle budget for a WLAN receiver on an EVP processor at 320 MHz. However, this overhead is less than 1% for mobile standards such as TDS-CDMA or LTE, which have lower rates, and thus larger cycle budgets. Considering that clock speeds will increase and that the synchronization primitives can be optimized to exploit the addressing modes available in the EVP, these results are very promising

Repository TU/e

Pure OAI Repository

Queueing-Theoretic End-to-End Latency Modeling of Future Wireless Networks

Author: Schulz Philipp
Publication venue
Publication date: 11/03/2020
Field of study

The fifth generation (5G) of mobile communication networks is envisioned to enable a variety of novel applications. These applications demand requirements from the network, which are diverse and challenging. Consequently, the mobile network has to be not only capable to meet the demands of one of these applications, but also be flexible enough that it can be tailored to different needs of various services. Among these new applications, there are use cases that require low latency as well as an ultra-high reliability, e.g., to ensure unobstructed production in factory automation or road safety for (autonomous) transportation. In these domains, the requirements are crucial, since violating them may lead to financial or even human damage. Hence, an ultra-low probability of failure is necessary. Based on this, two major questions arise that are the motivation for this thesis. First, how can ultra-low failure probabilities be evaluated, since experiments or simulations would require a tremendous number of runs and, thus, turn out to be infeasible. Second, given a network that can be configured differently for different applications through the concept of network slicing, which performance can be expected by different parameters and what is their optimal choice, particularly in the presence of other applications. In this thesis, both questions shall be answered by appropriate mathematical modeling of the radio interface and the radio access network. Thereby the aim is to find the distribution of the (end-to-end) latency, allowing to extract stochastic measures such as the mean, the variance, but also ultra-high percentiles at the distribution tail. The percentile analysis eventually leads to the desired evaluation of worst-case scenarios at ultra-low probabilities. Therefore, the mathematical tool of queuing theory is utilized to study video streaming performance and one or multiple (low-latency) applications. One of the key contributions is the development of a numeric algorithm to obtain the latency of general queuing systems for homogeneous as well as for prioritized heterogeneous traffic. This provides the foundation for analyzing and improving end-to-end latency for applications with known traffic distributions in arbitrary network topologies and consisting of one or multiple network slices.Es wird erwartet, dass die fünfte Mobilfunkgeneration (5G) eine Reihe neuartiger Anwendungen ermöglichen wird. Allerdings stellen diese Anwendungen sowohl sehr unterschiedliche als auch überaus herausfordernde Anforderungen an das Netzwerk. Folglich muss das mobile Netz nicht nur die Voraussetzungen einer einzelnen Anwendungen erfüllen, sondern auch flexibel genug sein, um an die Vorgaben unterschiedlicher Dienste angepasst werden zu können. Ein Teil der neuen Anwendungen erfordert hochzuverlässige Kommunikation mit niedriger Latenz, um beispielsweise unterbrechungsfreie Produktion in der Fabrikautomatisierung oder Sicherheit im (autonomen) Straßenverkehr zu gewährleisten. In diesen Bereichen ist die Erfüllung der gestellten Anforderungen besonders kritisch, da eine Verletzung finanzielle oder sogar personelle Schäden nach sich ziehen könnte. Eine extrem niedrige Ausfallwahrscheinlichkeit ist daher von größter Wichtigkeit. Daraus ergeben sich zwei wesentliche Fragestellungen, welche diese Arbeit motivieren. Erstens, wie können extrem niedrige Ausfallwahrscheinlichkeiten evaluiert werden. Ihr Nachweis durch Experimente oder Simulationen würde eine extrem große Anzahl an Durchläufen benötigen und sich daher als nicht realisierbar herausstellen. Zweitens, welche Performanz ist für ein gegebenes Netzwerk durch unterschiedliche Konfigurationen zu erwarten und wie kann die optimale Konfiguration gewählt werden. Diese Frage ist insbesondere dann interessant, wenn mehrere Anwendungen gleichzeitig bedient werden und durch sogenanntes Slicing für jeden Dienst unterschiedliche Konfigurationen möglich sind. In dieser Arbeit werden beide Fragen durch geeignete mathematische Modellierung der Funkschnittstelle sowie des Funkzugangsnetzes (Radio Access Network) adressiert. Mithilfe der Warteschlangentheorie soll die stochastische Verteilung der (Ende-zu-Ende-) Latenz bestimmt werden. Dies liefert unterschiedliche stochastische Metriken, wie den Erwartungswert, die Varianz und insbesondere extrem hohe Perzentile am oberen Rand der Verteilung. Letztere geben schließlich Aufschluss über die gesuchten schlimmsten Fälle, die mit sehr geringer Wahrscheinlichkeit eintreten können. In der Arbeit werden Videostreaming und ein oder mehrere niedriglatente Anwendungen untersucht. Zu den wichtigsten Beiträgen zählt dabei die Entwicklung einer numerischen Methode, um die Latenz in allgemeinen Warteschlangensystemen für homogenen sowie für priorisierten heterogenen Datenverkehr zu bestimmen. Dies legt die Grundlage für die Analyse und Verbesserung von Ende-zu-Ende-Latenz für Anwendungen mit bekannten Verkehrsverteilungen in beliebigen Netzwerktopologien mit ein oder mehreren Slices

Technische Universität Dresden: Qucosa

A measurement-based approach to service modeling and bandwidth estimation in IEEE 802.11 wireless networks

Author: Bredel Michael
Publication venue: Gottfried Wilhelm Leibniz Universität Hannover
Publication date: 01/01/2012
Field of study

[no abstract

Institutionelles Repositorium der Leibniz Universität Hannover