87 research outputs found

    Integrating Abstract NoC Models within MPSoC Design

    Full text link

    Driving the Network-on-Chip Revolution to Remove the Interconnect Bottleneck in Nanoscale Multi-Processor Systems-on-Chip

    Get PDF
    The sustained demand for faster, more powerful chips has been met by the availability of chip manufacturing processes allowing for the integration of increasing numbers of computation units onto a single die. The resulting outcome, especially in the embedded domain, has often been called SYSTEM-ON-CHIP (SoC) or MULTI-PROCESSOR SYSTEM-ON-CHIP (MP-SoC). MPSoC design brings to the foreground a large number of challenges, one of the most prominent of which is the design of the chip interconnection. With a number of on-chip blocks presently ranging in the tens, and quickly approaching the hundreds, the novel issue of how to best provide on-chip communication resources is clearly felt. NETWORKS-ON-CHIPS (NoCs) are the most comprehensive and scalable answer to this design concern. By bringing large-scale networking concepts to the on-chip domain, they guarantee a structured answer to present and future communication requirements. The point-to-point connection and packet switching paradigms they involve are also of great help in minimizing wiring overhead and physical routing issues. However, as with any technology of recent inception, NoC design is still an evolving discipline. Several main areas of interest require deep investigation for NoCs to become viable solutions: • The design of the NoC architecture needs to strike the best tradeoff among performance, features and the tight area and power constraints of the onchip domain. • Simulation and verification infrastructure must be put in place to explore, validate and optimize the NoC performance. • NoCs offer a huge design space, thanks to their extreme customizability in terms of topology and architectural parameters. Design tools are needed to prune this space and pick the best solutions. • Even more so given their global, distributed nature, it is essential to evaluate the physical implementation of NoCs to evaluate their suitability for next-generation designs and their area and power costs. This dissertation performs a design space exploration of network-on-chip architectures, in order to point-out the trade-offs associated with the design of each individual network building blocks and with the design of network topology overall. The design space exploration is preceded by a comparative analysis of state-of-the-art interconnect fabrics with themselves and with early networkon- chip prototypes. The ultimate objective is to point out the key advantages that NoC realizations provide with respect to state-of-the-art communication infrastructures and to point out the challenges that lie ahead in order to make this new interconnect technology come true. Among these latter, technologyrelated challenges are emerging that call for dedicated design techniques at all levels of the design hierarchy. In particular, leakage power dissipation, containment of process variations and of their effects. The achievement of the above objectives was enabled by means of a NoC simulation environment for cycleaccurate modelling and simulation and by means of a back-end facility for the study of NoC physical implementation effects. Overall, all the results provided by this work have been validated on actual silicon layout

    CoreVA-MPSoC: A Many-core Architecture with Tightly Coupled Shared and Local Data Memories

    Get PDF
    Ax J, Sievers G, Daberkow J, et al. CoreVA-MPSoC: A Many-core Architecture with Tightly Coupled Shared and Local Data Memories. IEEE Transactions on Parallel and Distributed Systems. 2018;29(5):1030-1043

    Piattaforme multicore e integrazione tri-dimensionale: analisi architetturale e ottimizzazione

    Get PDF
    Modern embedded systems embrace many-core shared-memory designs. Due to constrained power and area budgets, most of them feature software-managed scratchpad memories instead of data caches to increase the data locality. It is therefore programmers’ responsibility to explicitly manage the memory transfers, and this make programming these platform cumbersome. Moreover, complex modern applications must be adequately parallelized before they can the parallel potential of the platform into actual performance. To support this, programming languages were proposed, which work at a high level of abstraction, and rely on a runtime whose cost hinders performance, especially in embedded systems, where resources and power budget are constrained. This dissertation explores the applicability of the shared-memory paradigm on modern many-core systems, focusing on the ease-of-programming. It focuses on OpenMP, the de-facto standard for shared memory programming. In a first part, the cost of algorithms for synchronization and data partitioning are analyzed, and they are adapted to modern embedded many-cores. Then, the original design of an OpenMP runtime library is presented, which supports complex forms of parallelism such as multi-level and irregular parallelism. In the second part of the thesis, the focus is on heterogeneous systems, where hardware accelerators are coupled to (many-)cores to implement key functional kernels with orders-of-magnitude of speedup and energy efficiency compared to the “pure software” version. However, three main issues rise, namely i) platform design complexity, ii) architectural scalability and iii) programmability. To tackle them, a template for a generic hardware processing unit (HWPU) is proposed, which share the memory banks with cores, and the template for a scalable architecture is shown, which integrates them through the shared-memory system. Then, a full software stack and toolchain are developed to support platform design and to let programmers exploiting the accelerators of the platform. The OpenMP frontend is extended to interact with it.I sistemi integrati moderni sono architetture many-core, in cui spesso lo spazio di memoria è condiviso fra i processori. Per ridurre i consumi, molte di queste architetture sostituiscono le cache dati con memorie scratchpad gestite in software, per massimizzarne la località alle CPU e aumentare le performance. Questo significa che i dati devono essere spostati manualmente da parte del programmatore. Inoltre, tradurre in perfomance l’enorme parallelismo potenziale delle piattaforme many-core non è semplice. Per supportare la programmazione, diversi programming model sono stati proposti, e siccome lavorano ad un alto livello di astrazione, sfruttano delle librerie di runtime che forniscono servizi di base quali sincronizzazione, allocazione della memoria, threading. Queste librerie hanno un costo, che nei sistemi integrati è troppo elevato e ostacola il raggiungimento delle piene performance. Questa tesi analizza come un programming model ad alto livello di astrazione – OpenMP – possa essere efficientemente supportato, se il suo stack software viene adattato per sfruttare al meglio la piattaforma sottostante. In una prima parte, studio diversi meccanismi di sincronizzazione e comunicazione fra thread paralleli, portati sulle piattaforme many-core. In seguito, li utilizzo per scrivere un runtime di supporto a OpenMP che sia il più possibile efficente e “leggero” e che supporti paradigmi di parallelismo multi-livello e irregolare, spesso presenti nelle applicazioni moderne. Una seconda parte della tesi esplora le architetture eterogenee, ossia con acceleratori hardware. Queste architetture soffrono di problematiche sia i) per il processo di design della piattaforma, che ii) di scalabilità della piattaforma stessa (aumento del numero degli acceleratori e dei processori), che iii) di programmabilità. La tesi propone delle soluzioni a tutti e tre i problemi. Il linguaggio di programmazione usato è OpenMP, sia per la sua grande espressività a livello semantico, sia perché è lo standard de-facto per programmare sistemi a memoria condivisa

    Many-Core Scheduling of Data Parallel Applications Using SMT Solvers

    Full text link
    Abstract—To program recently developed many-core systems-on-chip two traditionally separate performance optimization problems have to be solved together. Firstly, it is the parallel scheduling on a shared-memory multi-core system. Secondly, it is the co-scheduling of network communication and processor computation. This is because many-core systems are networks of multi-core clusters. In this paper, we demonstrate the applicabil-ity of modern constraint solvers to efficiently schedule parallel applications on many-cores and validate the results by running benchmarks on a real many-core platform. Index Terms—task graph, scheduling, multiprocessor, DMA I

    On-Chip-Netzwerk-Architekturen fĂĽr eingebettete hierarchische Multiprozessoren

    Get PDF
    Ax J. On-Chip-Netzwerk-Architekturen für eingebettete hierarchische Multiprozessoren. Bielefeld: Universität Bielefeld; 2019.Das Ziel der vorliegenden Arbeit ist die Realisierung und Analyse einer skalierbaren Verbindungsstruktur für ein Multi-Prozessorsystem auf einem Chip (MPSoC). Durch die zunehmende Digitalisierung werden in immer mehr Geräten des täglichen Lebens und der Industrie mikroelektronische Systeme eingesetzt. Hierbei handelt es sich häufig um energiebeschränkte Systeme, die zusätzlich einen stetig steigenden Bedarf an Rechenleistung aufweisen. Ein Trend, diesen Bedarf zu decken ist die Integration von zunehmend mehr Prozessorkernen auf einem einzelnen Mikrochip. Many-Core-Systeme mit vielen hunderten bis tausenden ressourceneffizienten CPU-Kernen versprechen hierbei eine besonders hohe Energieeffizienz. Im Vergleich zu Systemen mit wenigen leistungsfähigen, jedoch auch komplexeren CPUs, wird bei Many-Cores die Rechenleistung durch massive Parallelität erzielt. In der AG Kognitronik und Sensorik der Universität Bielefeld wird dazu das CoreVA-MPSoC entwickelt. Um hunderte von CPUs auf einen Chip zu integrieren, verfügt das CoreVA-MPSoC über eine hierarchische Verbindungsstruktur. Diese besteht aus einem On-Chip-Netzwerk (NoC), welches eine Vielzahl von CPU-Cluster koppelt. In jedem CPU-Cluster sind mehrere ressourceneffiziente VLIW-Prozessorkerne über eine eng gekoppelte Bus-Struktur verbunden. Der Fokus dieser Arbeit ist die Entwicklung und Entwurfsraumexploration einer ressourceneffizienten NoC-Architektur für den Einsatz im CoreVA-MPSoC. Die Entwurfsraumexploration findet dazu auf verschiedenen Ebenen statt. Auf der Ebene der Verbindungsstruktur des NoCs werden verschiedene Topologien und Mechanismen der Flusskontrolle untersucht. Des Weiteren wird die Entwicklung und Analyse eines synchronen, mesochronen und asynchronen NoCs vorgestellt, um die Skalierbarkeit und Energieeffizienz dieser Methoden zu untersuchen. Eine weitere Ebene bildet die Schnittstelle zum Prozessorsystem bzw. CPU-Cluster, die einen maßgeblichen Einfluss auf die Softwareentwicklung und Gesamtperformanz des Systems hat. Auf Systemebene wird schließlich die Anbindung verschiedener Speicherarchitekturen an das NoC vorgestellt und deren Auswirkung auf Performanz und Energiebedarf analysiert. Ein abstraktes Modell des CoreVA-MPSoCs mit Fokus auf dem NoC erlaubt die Abschätzung von Fläche, Performanz und Energie des Systems, bzw. der Ausführung von Streaming-Anwendungen. Dieses Modell kann im CoreVA-MPSoC-Compiler für die automatische Abbildung von Anwendungen auf dem MPSoC eingesetzt werden. Zehn Streaming-Anwendungen, vorwiegend aus dem Bereich der Signal- und Bildverarbeitung, zeigen bei der Abbildung auf einem CoreVA-MPSoC mit 32 CPUs eine durchschnittliche Beschleunigung um den Faktor 24 gegenüber der Ausführung auf einer CPU. Ein CoreVA-MPSoC mit 64 CPUs und insgesamt 3MB Speicher besitzt bei einer prototypischen Implementierung in einer 28-nm-FD-SOI-Standardzellenbibliothek einen Flächenbedarf von 14,4mm2. Bei einer Taktfrequenz von 700MHz liegt die durchschnittliche Leistungsaufnahme bei 2W. Eine FPGA-basierte Emulation auf einem FPGA-Cluster aus Xilinx Virtex-5-FPGAs erlaubt zudem eine skalierbare Verifikation eines CoreVA-MPSoCs mit nahezu beliebig vielen CPUs

    Low-Power Embedded Design Solutions and Low-Latency On-Chip Interconnect Architecture for System-On-Chip Design

    Get PDF
    This dissertation presents three design solutions to support several key system-on-chip (SoC) issues to achieve low-power and high performance. These are: 1) joint source and channel decoding (JSCD) schemes for low-power SoCs used in portable multimedia systems, 2) efficient on-chip interconnect architecture for massive multimedia data streaming on multiprocessor SoCs (MPSoCs), and 3) data processing architecture for low-power SoCs in distributed sensor network (DSS) systems and its implementation. The first part includes a low-power embedded low density parity check code (LDPC) - H.264 joint decoding architecture to lower the baseband energy consumption of a channel decoder using joint source decoding and dynamic voltage and frequency scaling (DVFS). A low-power multiple-input multiple-output (MIMO) and H.264 video joint detector/decoder design that minimizes energy for portable, wireless embedded systems is also designed. In the second part, a link-level quality of service (QoS) scheme using unequal error protection (UEP) for low-power network-on-chip (NoC) and low latency on-chip network designs for MPSoCs is proposed. This part contains WaveSync, a low-latency focused network-on-chip architecture for globally-asynchronous locally-synchronous (GALS) designs and a simultaneous dual-path routing (SDPR) scheme utilizing path diversity present in typical mesh topology network-on-chips. SDPR is akin to having a higher link width but without the significant hardware overhead associated with simple bus width scaling. The last part shows data processing unit designs for embedded SoCs. We propose a data processing and control logic design for a new radiation detection sensor system generating data at or above Peta-bits-per-second level. Implementation results show that the intended clock rate is achieved within the power target of less than 200mW. We also present a digital signal processing (DSP) accelerator supporting configurable MAC, FFT, FIR, and 3-D cross product operations for embedded SoCs. It consumes 12.35mW along with 0.167mm2 area at 333MHz

    A Model-based Design Framework for Application-specific Heterogeneous Systems

    Get PDF
    The increasing heterogeneity of computing systems enables higher performance and power efficiency. However, these improvements come at the cost of increasing the overall complexity of designing such systems. These complexities include constructing implementations for various types of processors, setting up and configuring communication protocols, and efficiently scheduling the computational work. The process for developing such systems is iterative and time consuming, with no well-defined performance goal. Current performance estimation approaches use source code implementations that require experienced developers and time to produce. We present a framework to aid in the design of heterogeneous systems and the performance tuning of applications. Our framework supports system construction: integrating custom hardware accelerators with existing cores into processors, integrating processors into cohesive systems, and mapping computations to processors to achieve overall application performance and efficient hardware usage. It also facilitates effective design space exploration using processor models (for both existing and future processors) that do not require source code implementations to estimate performance. We evaluate our framework using a variety of applications and implement them in systems ranging from low power embedded systems-on-chip (SoC) to high performance systems consisting of commercial-off-the-shelf (COTS) components. We show how the design process is improved, reducing the number of design iterations and unnecessary source code development ultimately leading to higher performing efficient systems

    Compilation de systèmes temps réel

    Get PDF
    I introduce and advocate for the concept of Real-Time Systems Compilation. By analogy with classical compilation, real-time systems compilation consists in the fully automatic construction of running, correct-by-construction implementations from functional and non-functional specifications of embedded control systems. Like in a classical compiler, the whole process must be fast (thus enabling a trial-and-error design style) and produce reasonably efficient code. This requires the use of fast heuristics, and the use of fine-grain platform and application models. Unlike a classical compiler, a real-time systems compiler must take into account non-functional properties of a system and ensure the respect of non-functional requirements (in addition to functional correctness). I also present Lopht, a real-time systems compiler for statically-scheduled real-time systems we built by combining techniques and concepts from real-time scheduling, compilation, and synchronous languages

    State of the art baseband DSP platforms for Software Defined Radio: A survey

    Get PDF
    Software Defined Radio (SDR) is an innovative approach which is becoming a more and more promising technology for future mobile handsets. Several proposals in the field of embedded systems have been introduced by different universities and industries to support SDR applications. This article presents an overview of current platforms and analyzes the related architectural choices, the current issues in SDR, as well as potential future trends.Peer reviewe
    • …
    corecore