18 research outputs found

    Low-Power Embedded Design Solutions and Low-Latency On-Chip Interconnect Architecture for System-On-Chip Design

    Get PDF
    This dissertation presents three design solutions to support several key system-on-chip (SoC) issues to achieve low-power and high performance. These are: 1) joint source and channel decoding (JSCD) schemes for low-power SoCs used in portable multimedia systems, 2) efficient on-chip interconnect architecture for massive multimedia data streaming on multiprocessor SoCs (MPSoCs), and 3) data processing architecture for low-power SoCs in distributed sensor network (DSS) systems and its implementation. The first part includes a low-power embedded low density parity check code (LDPC) - H.264 joint decoding architecture to lower the baseband energy consumption of a channel decoder using joint source decoding and dynamic voltage and frequency scaling (DVFS). A low-power multiple-input multiple-output (MIMO) and H.264 video joint detector/decoder design that minimizes energy for portable, wireless embedded systems is also designed. In the second part, a link-level quality of service (QoS) scheme using unequal error protection (UEP) for low-power network-on-chip (NoC) and low latency on-chip network designs for MPSoCs is proposed. This part contains WaveSync, a low-latency focused network-on-chip architecture for globally-asynchronous locally-synchronous (GALS) designs and a simultaneous dual-path routing (SDPR) scheme utilizing path diversity present in typical mesh topology network-on-chips. SDPR is akin to having a higher link width but without the significant hardware overhead associated with simple bus width scaling. The last part shows data processing unit designs for embedded SoCs. We propose a data processing and control logic design for a new radiation detection sensor system generating data at or above Peta-bits-per-second level. Implementation results show that the intended clock rate is achieved within the power target of less than 200mW. We also present a digital signal processing (DSP) accelerator supporting configurable MAC, FFT, FIR, and 3-D cross product operations for embedded SoCs. It consumes 12.35mW along with 0.167mm2 area at 333MHz

    A survey on scheduling and mapping techniques in 3D Network-on-chip

    Full text link
    Network-on-Chips (NoCs) have been widely employed in the design of multiprocessor system-on-chips (MPSoCs) as a scalable communication solution. NoCs enable communications between on-chip Intellectual Property (IP) cores and allow those cores to achieve higher performance by outsourcing their communication tasks. Mapping and Scheduling methodologies are key elements in assigning application tasks, allocating the tasks to the IPs, and organising communication among them to achieve some specified objectives. The goal of this paper is to present a detailed state-of-the-art of research in the field of mapping and scheduling of applications on 3D NoC, classifying the works based on several dimensions and giving some potential research directions

    Run-time management of many-core SoCs: A communication-centric approach

    Get PDF
    The single core performance hit the power and complexity limits in the beginning of this century, moving the industry towards the design of multi- and many-core system-on-chips (SoCs). The on-chip communication between the cores plays a criticalrole in the performance of these SoCs, with power dissipation, communication latency, scalability to many cores, and reliability against the transistor failures as the main design challenges. Accordingly, we dedicate this thesis to the communicationcentered management of the many-core SoCs, with the goal to advance the state-ofthe-art in addressing these challenges. To this end, we contribute to on-chip communication of many-core SoCs in three main directions. First, we start with a synthesizable SoC with full system simulation. We demonstrate the importance of the networking overhead in a practical system, and propose our sophisticated network interface (NI) that offloads the work from SW to HW. Our results show around 5x and up to 50x higher network performance, compared to previous works. As the second direction of this thesis, we study the significance of run-time application mapping. We demonstrate that contiguous application mapping not only improves the network latency (by 23%) and power dissipation (by 50%), but also improves the system throughput (by 3%) and quality-of-service (QoS) of soft real-time applications (up to 100x less deadline misses). Also our hierarchical run-time application mapping provides 99.41% successful mapping when up to 8 links are broken. As the final direction of the thesis, we propose a fault-tolerant routing algorithm, the maze-routing. It is the first-in-class algorithm that provides guaranteed delivery, a fully-distributed solution, low area overhead (by 16x), and instantaneous reconfiguration (vs. 40K cycles down time of previous works), all at the same time. Besides the individual goals of each contribution, when applicable, we ensure that our solutions scale to extreme network sizes like 12x12 and 16x16. This thesis concludes that the communication overhead and its optimization play a significant role in the performance of many-core SoC

    Exploration architecturale et étude des performances des réseaux sur puce 3D partiellement connectés verticalement

    Get PDF
    Utilization of the third dimension can lead to a significant reduction in power and average hop-count in Networks- on-Chip (NoC). TSV technology, as the most promising technology in 3D integration, offers short and fast vertical links which copes with the long wire problem in 2D NoCs. Nonetheless, TSVs are huge and their manufacturing process is still immature, which reduces the yield of 3D NoC based SoC. Therefore, Vertically-Partially-Connected 3D-NoC has been introduced to benefit from both 3D technology and high yield. Moreover, Vertically-Partially-Connected 3D-NoC is flexible, due to the fact that the number, placement, and assignment of the vertical links in each layer can be decided based on the limitations and requirements of the design. However, there are challenges to present a feasible and high-performance Vertically-Partially-Connected Mesh-based 3D-NoC due to the removed vertical links between the layers. This thesis addresses the challenges of Vertically-Partially-Connected Mesh-based 3D-NoC: Routing is the major problem of the Vertically-Partially-Connected 3D-NoC. Since some vertical links are removed, some of the routers do not have up or/and down ports. Therefore, there should be a path to send a packet to upper or lower layer which obviously has to be determined by a routing algorithm. The suggested paths should not cause deadlock through the network. To cope with this problem we explain and evaluate a deadlock- and livelock-free routing algorithm called Elevator First. Fundamentally, the NoC performance is affected by both 1) micro-architecture of routers and 2) architecture of interconnection. The router architecture has a significant effect on the performance of NoC, as it is a part of transportation delay. Therefore, the simplicity and efficiency of the design of NoC router micro architecture are the critical issues, especially in Vertically-Partially-Connected 3D-NoC which has already suffered from high average latency due to some removed vertical links. Therefore, we present the design and implementation the micro-architecture of a router which not only exactly and quickly transfers the packets based on the Elevator First routing algorithm, but it also consumes a reasonable amount of area and power. From the architecture point of view, the number and placement of vertical links have a key role in the performance of the Vertically-Partially-Connected Mesh-based 3D-NoC, since they affect the average hop-count and link and buffer utilization in the network. Furthermore, the assignment of the vertical links to the routers which do not have up or/and down port(s) is an important issue which influences the performance of the 3D routers. Therefore, the architectural exploration of Vertically-Partially-Connected Mesh-based 3D-NoC is both important and non-trivial. We define, study, and evaluate the parameters which describe the behavior of the network. The parameters can be helpful to place and assign the vertical links in the layers effectively. Finally, we propose a quadratic-based estimation method to anticipate the saturation threshold of the network's average latency.L'utilisation de la troisième dimension peut entraîner une réduction significative de la puissance et de la latence moyenne du trafic dans les réseaux sur puce (Network-on-Chip). La technologie des vias à travers le substrat (ou Through-Silicon Via) est la technologie la plus prometteuse pour l'intégration 3D, car elle offre des liens verticaux courts qui remédient au problème des longs fils dans les NoCs-2D. Les TSVs sont cependant énormes et les processus de fabrication sont immatures, ce qui réduit le rendement des systèmes sur puce à base de NoC-3D. Par conséquent, l'idée de réseaux sur puce 3D partiellement connectés verticalement a été introduite pour bénéficier de la technologie 3D tout en conservant un haut rendement. En outre, de tels réseaux sont flexibles, car le nombre, l'emplacement et l'affectation des liens verticaux dans chaque couche peuvent être décidés en fonction des exigences de l'application. Cependant, ce type de réseaux pose un certain nombre de défis : Le routage est le problème majeur, car l'élimination de certains liens verticaux fait que l'on ne peut utiliser les algorithmes classiques qui suivent l'ordre des dimensions. Pour répondre à cette question nous expliquons et évaluons un algorithme de routage déterministe appelé “Elevator First”, qui garanti d'une part que si un chemin existe, alors on le trouve, et que d'autre part il n'y aura pas d'interblocages. Fondamentalement, la performance du NoC est affecté par a) la micro architecture des routeurs et b) l'architecture d'interconnexion. L'architecture du routeur a un effet significatif sur la performance du NoC, à cause de la latence qu'il induit. Nous présentons la conception et la mise en œuvre de la micro-architecture d'un routeur à faible latence implantant​​l'algorithme de routage Elevator First, qui consomme une quantité raisonnable de surface et de puissance. Du point de vue de l'architecture, le nombre et le placement des liens verticaux ont un rôle important dans la performance des réseaux 3D partiellement connectés verticalement, car ils affectent le nombre moyen de sauts et le taux d'utilisation des FIFOs dans le réseau. En outre, l'affectation des liens verticaux vers les routeurs qui n'ont pas de ports vers le haut ou/et le bas est une question importante qui influe fortement sur les performances. Par conséquent, l'exploration architecturale des réseaux sur puce 3D partiellement connectés verticalement est importante. Nous définissons, étudions et évaluons des paramètres qui décrivent le comportement du réseau, de manière à déterminer le placement et l'affectation des liens verticaux dans les couches de manière simple et efficace. Nous proposons une méthode d'estimation quadratique visantà anticiper le seuil de saturation basée sur ces paramètres

    Spatial parallelism in the routers of asynchronous on-chip networks

    Get PDF
    State-of-the-art multi-processor systems-on-chip use on-chip networks as their communication fabric. Although most on-chip networks are implemented synchronously, asynchronous on-chip networks have several advantages over their synchronous counterparts. Timing division multiplexing (TDM) flow control methods have been utilized in asynchronous on-chip networks extensively. The synchronization required by TDM leads to significant speed penalties. Compared with using TDM methods, spatial parallelism methods, such as the spatial division multiplexing (SDM) flow control method, achieve better network throughput with less area overhead.This thesis proposes several techniques to increase spatial parallelism in the routers of asynchronous on-chip networks.Channel slicing is a new pipeline structure that alleviates the speed penalty by removing the synchronization among bit-level data pipelines. It is also found out that the lookahead pipeline using early evaluated acknowledgement can be used in routers to further improve speed.SDM is a new flow control method proposed for asynchronous on-chip networks. It improves network throughput without introducing synchronization among buffers of different frames, which is required by TDM methods. It is also found that the area overhead of SDM is smaller than the virtual channel (VC) flow control method -- the most used TDM method. The major design problem of SDM is the area consuming crossbars. A novel 2-stage Clos switch structure is proposed to replace the crossbar in SDM routers, which significantly reduces the area overhead. This Clos switch is dynamically reconfigured by a new asynchronous Clos scheduler.Several asynchronous SDM routers are implemented using these new techniques. An asynchronous VC router is also reproduced for comparison. Performance analyses show that the SDM routers outperform the VC router in throughput, area overhead and energy efficiency.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Spatial parallelism in the routers of asynchronous on-chip networks

    Get PDF
    State-of-the-art multi-processor systems-on-chip use on-chip networks as their communication fabric. Although most on-chip networks are implemented synchronously, asynchronous on-chip networks have several advantages over their synchronous counterparts. Timing division multiplexing (TDM) flow control methods have been utilized in asynchronous on-chip networks extensively. The synchronization required by TDM leads to significant speed penalties. Compared with using TDM methods, spatial parallelism methods, such as the spatial division multiplexing (SDM) flow control method, achieve better network throughput with less area overhead.This thesis proposes several techniques to increase spatial parallelism in the routers of asynchronous on-chip networks.Channel slicing is a new pipeline structure that alleviates the speed penalty by removing the synchronization among bit-level data pipelines. It is also found out that the lookahead pipeline using early evaluated acknowledgement can be used in routers to further improve speed.SDM is a new flow control method proposed for asynchronous on-chip networks. It improves network throughput without introducing synchronization among buffers of different frames, which is required by TDM methods. It is also found that the area overhead of SDM is smaller than the virtual channel (VC) flow control method -- the most used TDM method. The major design problem of SDM is the area consuming crossbars. A novel 2-stage Clos switch structure is proposed to replace the crossbar in SDM routers, which significantly reduces the area overhead. This Clos switch is dynamically reconfigured by a new asynchronous Clos scheduler.Several asynchronous SDM routers are implemented using these new techniques. An asynchronous VC router is also reproduced for comparison. Performance analyses show that the SDM routers outperform the VC router in throughput, area overhead and energy efficiency.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    On Energy Efficient Computing Platforms

    Get PDF
    In accordance with the Moore's law, the increasing number of on-chip integrated transistors has enabled modern computing platforms with not only higher processing power but also more affordable prices. As a result, these platforms, including portable devices, work stations and data centres, are becoming an inevitable part of the human society. However, with the demand for portability and raising cost of power, energy efficiency has emerged to be a major concern for modern computing platforms. As the complexity of on-chip systems increases, Network-on-Chip (NoC) has been proved as an efficient communication architecture which can further improve system performances and scalability while reducing the design cost. Therefore, in this thesis, we study and propose energy optimization approaches based on NoC architecture, with special focuses on the following aspects. As the architectural trend of future computing platforms, 3D systems have many bene ts including higher integration density, smaller footprint, heterogeneous integration, etc. Moreover, 3D technology can signi cantly improve the network communication and effectively avoid long wirings, and therefore, provide higher system performance and energy efficiency. With the dynamic nature of on-chip communication in large scale NoC based systems, run-time system optimization is of crucial importance in order to achieve higher system reliability and essentially energy efficiency. In this thesis, we propose an agent based system design approach where agents are on-chip components which monitor and control system parameters such as supply voltage, operating frequency, etc. With this approach, we have analysed the implementation alternatives for dynamic voltage and frequency scaling and power gating techniques at different granularity, which reduce both dynamic and leakage energy consumption. Topologies, being one of the key factors for NoCs, are also explored for energy saving purpose. A Honeycomb NoC architecture is proposed in this thesis with turn-model based deadlock-free routing algorithms. Our analysis and simulation based evaluation show that Honeycomb NoCs outperform their Mesh based counterparts in terms of network cost, system performance as well as energy efficiency.Siirretty Doriast

    A time-predictable many-core processor design for critical real-time embedded systems

    Get PDF
    Critical Real-Time Embedded Systems (CRTES) are in charge of controlling fundamental parts of embedded system, e.g. energy harvesting solar panels in satellites, steering and breaking in cars, or flight management systems in airplanes. To do so, CRTES require strong evidence of correct functional and timing behavior. The former guarantees that the system operates correctly in response of its inputs; the latter ensures that its operations are performed within a predefined time budget. CRTES aim at increasing the number and complexity of functions. Examples include the incorporation of \smarter" Advanced Driver Assistance System (ADAS) functionality in modern cars or advanced collision avoidance systems in Unmanned Aerial Vehicles (UAVs). All these new features, implemented in software, lead to an exponential growth in both performance requirements and software development complexity. Furthermore, there is a strong need to integrate multiple functions into the same computing platform to reduce the number of processing units, mass and space requirements, etc. Overall, there is a clear need to increase the computing power of current CRTES in order to support new sophisticated and complex functionality, and integrate multiple systems into a single platform. The use of multi- and many-core processor architectures is increasingly seen in the CRTES industry as the solution to cope with the performance demand and cost constraints of future CRTES. Many-cores supply higher performance by exploiting the parallelism of applications while providing a better performance per watt as cores are maintained simpler with respect to complex single-core processors. Moreover, the parallelization capabilities allow scheduling multiple functions into the same processor, maximizing the hardware utilization. However, the use of multi- and many-cores in CRTES also brings a number of challenges related to provide evidence about the correct operation of the system, especially in the timing domain. Hence, despite the advantages of many-cores and the fact that they are nowadays a reality in the embedded domain (e.g. Kalray MPPA, Freescale NXP P4080, TI Keystone II), their use in CRTES still requires finding efficient ways of providing reliable evidence about the correct operation of the system. This thesis investigates the use of many-core processors in CRTES as a means to satisfy performance demands of future complex applications while providing the necessary timing guarantees. To do so, this thesis contributes to advance the state-of-the-art towards the exploitation of parallel capabilities of many-cores in CRTES contributing in two different computing domains. From the hardware domain, this thesis proposes new many-core designs that enable deriving reliable and tight timing guarantees. From the software domain, we present efficient scheduling and timing analysis techniques to exploit the parallelization capabilities of many-core architectures and to derive tight and trustworthy Worst-Case Execution Time (WCET) estimates of CRTES.Los sistemas críticos empotrados de tiempo real (en ingles Critical Real-Time Embedded Systems, CRTES) se encargan de controlar partes fundamentales de los sistemas integrados, e.g. obtención de la energía de los paneles solares en satélites, la dirección y frenado en automóviles, o el control de vuelo en aviones. Para hacerlo, CRTES requieren fuerte evidencias del correcto comportamiento funcional y temporal. El primero garantiza que el sistema funciona correctamente en respuesta de sus entradas; el último asegura que sus operaciones se realizan dentro de unos limites temporales establecidos previamente. El objetivo de los CRTES es aumentar el número y la complejidad de las funciones. Algunos ejemplos incluyen los sistemas inteligentes de asistencia a la conducción en automóviles modernos o los sistemas avanzados de prevención de colisiones en vehiculos aereos no tripulados. Todas estas nuevas características, implementadas en software,conducen a un crecimiento exponencial tanto en los requerimientos de rendimiento como en la complejidad de desarrollo de software. Además, existe una gran necesidad de integrar múltiples funciones en una sóla plataforma para así reducir el número de unidades de procesamiento, cumplir con requisitos de peso y espacio, etc. En general, hay una clara necesidad de aumentar la potencia de cómputo de los actuales CRTES para soportar nueva funcionalidades sofisticadas y complejas e integrar múltiples sistemas en una sola plataforma. El uso de arquitecturas multi- y many-core se ve cada vez más en la industria CRTES como la solución para hacer frente a la demanda de mayor rendimiento y las limitaciones de costes de los futuros CRTES. Las arquitecturas many-core proporcionan un mayor rendimiento explotando el paralelismo de aplicaciones al tiempo que proporciona un mejor rendimiento por vatio ya que los cores se mantienen más simples con respecto a complejos procesadores de un solo core. Además, las capacidades de paralelización permiten programar múltiples funciones en el mismo procesador, maximizando la utilización del hardware. Sin embargo, el uso de multi- y many-core en CRTES también acarrea ciertos desafíos relacionados con la aportación de evidencias sobre el correcto funcionamiento del sistema, especialmente en el ámbito temporal. Por eso, a pesar de las ventajas de los procesadores many-core y del hecho de que éstos son una realidad en los sitemas integrados (por ejemplo Kalray MPPA, Freescale NXP P4080, TI Keystone II), su uso en CRTES aún precisa de la búsqueda de métodos eficientes para proveer evidencias fiables sobre el correcto funcionamiento del sistema. Esta tesis ahonda en el uso de procesadores many-core en CRTES como un medio para satisfacer los requisitos de rendimiento de aplicaciones complejas mientras proveen las garantías de tiempo necesarias. Para ello, esta tesis contribuye en el avance del estado del arte hacia la explotación de many-cores en CRTES en dos ámbitos de la computación. En el ámbito del hardware, esta tesis propone nuevos diseños many-core que posibilitan garantías de tiempo fiables y precisas. En el ámbito del software, la tesis presenta técnicas eficientes para la planificación de tareas y el análisis de tiempo para aprovechar las capacidades de paralelización en arquitecturas many-core, y también para derivar estimaciones de peor tiempo de ejecución (Worst-Case Execution Time, WCET) fiables y precisas

    Vorhersagbares und zur Laufzeit adaptierbares On-Chip Netzwerk für gemischt kritische Echtzeitsysteme

    Get PDF
    The industry of safety-critical and dependable embedded systems calls for even cheaper, high performance platforms that allow flexibility and an efficient verification of safety and real-time requirements. To cope with the increasing complexity of interconnected functions and to reduce the cost and power consumption of the system, multicore systems are used to efficiently integrate different processing units in the same chip. Networks-on-chip (NoCs), as a modular interconnect, are used as a promising solution for such multiprocessor systems on chip (MPSoCs), due to their scalability and performance. For safety-critical systems, a major goal is the avoidance of hazards. For this, safety-critical systems are qualified or even certified to prove the correctness of the functioning under all possible cases. A predictable behaviour of the NoC can help to ease the qualification process of the system. To achieve the required predictability, designers have two classes of solutions: quality of service mechanisms and (formal) analysis. For mixed-criticality systems, isolation and analysis approaches must be combined to efficiently achieve the desired predictability. Traditional NoC analysis and architecture concepts tackle only a subpart of the challenges: they focus on either performance or predictability. Existing, predictable NoCs are deemed too expensive and inflexible to host a variety of applications with opposing constraints. And state-of-the-art analyses neglect certain platform properties to verify the behaviour. Together this leads to a high over-provisioning of the hardware resources as well as adverse impacts on system performance, and on the flexibility of the system. In this work we tackle these challenges and develop a predictable and runtime-adaptable NoC architecture that efficiently integrates mixed-critical applications with opposing constraints. Additionally, we present a modelling and analysis framework for NoCs that accounts for backpressure. This framework enables to evaluate the performance and reliability early at design time. Hence, the designer can assess multiple design decisions by using abstract models and formal approaches.Die Industrie der sicherheitskritischen und zuverlässigen eingebetteten Systeme verlangt nach noch günstigeren, leistungsfähigeren Plattformen, welche Flexibilität und eine effiziente Überprüfung der Sicherheits- und Echtzeitanforderungen ermöglichen. Um der zunehmenden Komplexität der zunehmend vernetzten Funktionen gerecht zu werden und die Kosten und den Stromverbrauch eines Systems zu reduzieren, werden Mehrkern-Systeme eingesetzt. On-Chip Netzwerke werden aufgrund ihrer Skalierbarkeit und Leistung als vielversprechende Lösung für solch Mehrkern-Systeme eingesetzt. Bei sicherheitskritischen Systemen ist die Vermeidung von Gefahren ein wesentliches Ziel. Dazu werden sicherheitskritische Systeme qualifiziert oder zertifiziert, um die Funktionsfähigkeit in allen möglichen Fällen nachzuweisen. Ein vorhersehbares Verhalten des on-Chip Netzwerks kann dabei helfen, den Qualifizierungsprozess des Systems zu erleichtern. Um die erforderliche Vorhersagbarkeit zu erreichen, gibt es zwei Klassen von Lösungen: Quality of Service Mechanismen und (formale) Analyse. Für Systeme mit gemischter Relevanz müssen Isolationsmechanismen und Analyseansätze kombiniert werden, um die gewünschte Vorhersagbarkeit effizient zu erreichen. Traditionelle Analyse- und Architekturkonzepte für on-Chip Netzwerke lösen nur einen Teil dieser Herausforderungen: sie konzentrieren sich entweder auf Leistung oder Vorhersagbarkeit. Existierende vorhersagbare on-Chip Netzwerke werden als zu teuer und unflexibel erachtet, um eine Vielzahl von Anwendungen mit gegensätzlichen Anforderungen zu integrieren. Und state-of-the-art Analysen vernachlässigen bzw. vereinfachen bestimmte Plattformeigenschaften, um das Verhalten überprüfen zu können. Dies führt zu einer hohen Überbereitstellung der Hardware-Ressourcen als auch zu negativen Auswirkungen auf die Systemleistung und auf die Flexibilität des Systems. In dieser Arbeit gehen wir auf diese Herausforderungen ein und entwickeln eine vorhersehbare und zur Laufzeit anpassbare Architektur für on-Chip Netzwerke, welche gemischt-kritische Anwendungen effizient integriert. Zusätzlich stellen wir ein Modellierungs- und Analyseframework für on-Chip Netzwerke vor, das den Paketrückstau berücksichtigt. Dieses Framework ermöglicht es, Designentscheidungen anhand abstrakter Modelle und formaler Ansätze frühzeitig beurteilen
    corecore