628 research outputs found

    High-Level Debugging and Verification for FPGA-Based Multicore Architectures

    Get PDF
    Simulators are key tools for computer architecture research. However, multicore architectures represent a highly complex challenge for software simulators, which may suffer from fidelity loss and long execution times. FPGAs can simulate multicore architectures with scalable performance and high accuracy, but the difficulty of debugging could hinder their adoption. In this paper we propose several techniques for inspection, debugging and verification of multicore architectures, both for software-based and FPGA-based simulations. These debugging extensions are cycle-accurate and unobtrusive. As a proof of concept, we have developed a 24-core RISC multiprocessor that runs the Linux Kernel, for which we provide three simulation modes: a fast, functional simulation; a detailed, cycle-accurate simulation; and a FPGA-based simulation. Our platform can run up to 24 cores and perform full-system verification at 17 million instructions per second.Peer ReviewedPostprint (author's final draft

    Performance and resource modeling for FPGAs using high-level synthesis tools

    Get PDF
    High-performance computing with FPGAs is gaining momentum with the advent of sophisticated High-Level Synthesis (HLS) tools. The performance of a design is impacted by the input-output bandwidth, the code optimizations and the resource consumption, making the performance estimation a challenge. This paper proposes a performance model which extends the roofline model to take into account the resource consumption and the parameters used in the HLS tools. A strategy is developed which maximizes the performance and the resource utilization within the area of the FPGA. The model is used to optimize the design exploration of a class of window-based image processing application

    Contention-aware performance monitoring counter support for real-time MPSoCs

    Get PDF
    Tasks running in MPSoCs experience contention delays when accessing MPSoC’s shared resources, complicating task timing analysis and deriving execution time bounds. Understanding the Actual Contention Delay (ACD) each task suffers due to other corunning tasks, and the particular hardware shared resources in which contention occurs, is of prominent importance to increase confidence on derived execution time bounds of tasks. And, whenever those bounds are violated, ACD provides information on the reasons for overruns. Unfortunately, existing MPSoC designs considered in real-time domains offer limited hardware support to measure tasks’ ACD losing all these potential benefits. In this paper we propose the Contention Cycle Stack (CCS), a mechanism that extends performance monitoring counters to track specific events that allow estimating the ACD that each task suffers from every contending task on every hardware shared resource. We build the CCS using a set of specialized low-overhead Performance Monitoring Counters for the Cobham Gaisler GR740 (NGMP) MPSoC – used in the space domain – for which we show CCS’s benefits.The research leading to these results has received funding from the European Space Agency under contracts 4000109680, 4000110157 and NPI 4000102880, and the Ministry of Science and Technology of Spain under contract TIN-2015-65316-P. Jaume Abella has been partially supported by the Ministry of Economy and Competitiveness under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717.Peer ReviewedPostprint (author's final draft

    Automated Hardware Prototyping for 3D Network on Chips

    Get PDF
    Vor mehr als 50 Jahren stellte Intel® Mitbegründer Gordon Moore eine Prognose zum Entwicklungsprozess der Transistortechnologie auf. Er prognostizierte, dass sich die Zahl der Transistoren in integrierten Schaltungen alle zwei Jahre verdoppeln wird. Seine Aussage ist immer noch gültig, aber ein Ende von Moores Gesetz ist in Sicht. Mit dem Ende von Moore’s Gesetz müssen neue Aspekte untersucht werden, um weiterhin die Leistung von integrierten Schaltungen zu steigern. Zwei mögliche Ansätze für "More than Moore” sind 3D-Integrationsverfahren und heterogene Systeme. Gleichzeitig entwickelt sich ein Trend hin zu Multi-Core Prozessoren, basierend auf Networks on chips (NoCs). Neben dem Ende des Mooreschen Gesetzes ergeben sich bei immer kleiner werdenden Technologiegrößen, vor allem jenseits der 60 nm, neue Herausforderungen. Eine Schwierigkeit ist die Wärmeableitung in großskalierten integrierten Schaltkreisen und die daraus resultierende Überhitzung des Chips. Um diesem Problem in modernen Multi-Core Architekturen zu begegnen, muss auch die Verlustleistung der Netzwerkressourcen stark reduziert werden. Diese Arbeit umfasst eine durch Hardware gesteuerte Kombination aus Frequenzskalierung und Power Gating für 3D On-Chip Netzwerke, einschließlich eines FPGA Prototypen. Dafür wurde ein Takt-synchrones 2D Netzwerk auf ein dreidimensionales asynchrones Netzwerk mit mehreren Frequenzbereichen erweitert. Zusätzlich wurde ein skalierbares Online-Power-Management System mit geringem Ressourcenaufwand entwickelt. Die Verifikation neuer Hardwarekomponenten ist einer der zeitaufwendigsten Schritte im Entwicklungsprozess hochintegrierter digitaler Schaltkreise. Um diese Aufgabe zu beschleunigen und um eine parallele Softwareentwicklung zu ermöglichen, wurde im Rahmen dieser Arbeit ein automatisiertes und benutzerfreundliches Tool für den Entwurf neuer Hardware Projekte entwickelt. Eine grafische Benutzeroberfläche zum Erstellen des gesamten Designablaufs, vom Erstellen der Architektur, Parameter Deklaration, Simulation, Synthese und Test ist Teil dieses Werkzeugs. Zudem stellt die Größe der Architektur für die Erstellung eines Prototypen eine besondere Herausforderung dar. Frühere Arbeiten haben es versäumt, eine schnelles und unkompliziertes Prototyping, insbesondere von Architekturen mit mehr als 50 Prozessorkernen, zu realisieren. Diese Arbeit umfasst eine Design Space Exploration und FPGA-basierte Prototypen von verschiedenen 3D-NoC Implementierungen mit mehr als 80 Prozessoren

    Pre-validation of SoC via hardware and software co-simulation

    Get PDF
    Abstract. System-on-chips (SoCs) are complex entities consisting of multiple hardware and software components. This complexity presents challenges in their design, verification, and validation. Traditional verification processes often test hardware models in isolation until late in the development cycle. As a result, cooperation between hardware and software development is also limited, slowing down bug detection and fixing. This thesis aims to develop, implement, and evaluate a co-simulation-based pre-validation methodology to address these challenges. The approach allows for the early integration of hardware and software, serving as a natural intermediate step between traditional hardware model verification and full system validation. The co-simulation employs a QEMU CPU emulator linked to a register-transfer level (RTL) hardware model. This setup enables the execution of software components, such as device drivers, on the target instruction set architecture (ISA) alongside cycle-accurate RTL hardware models. The thesis focuses on two primary applications of co-simulation. Firstly, it allows software unit tests to be run in conjunction with hardware models, facilitating early communication between device drivers, low-level software, and hardware components. Secondly, it offers an environment for using software in functional hardware verification. A significant advantage of this approach is the early detection of integration errors. Software unit tests can be executed at the IP block level with actual hardware models, a task previously only possible with costly system-level prototypes. This enables earlier collaboration between software and hardware development teams and smoothens the transition to traditional system-level validation techniques.Järjestelmäpiirin esivalidointi laitteiston ja ohjelmiston yhteissimulaatiolla. Tiivistelmä. Järjestelmäpiirit (SoC) ovat monimutkaisia kokonaisuuksia, jotka koostuvat useista laitteisto- ja ohjelmistokomponenteista. Tämä monimutkaisuus asettaa haasteita niiden suunnittelulle, varmennukselle ja validoinnille. Perinteiset varmennusprosessit testaavat usein laitteistomalleja eristyksissä kehityssyklin loppuvaiheeseen saakka. Tämän myötä myös yhteistyö laitteisto- ja ohjelmistokehityksen välillä on vähäistä, mikä hidastaa virheiden tunnistamista ja korjausta. Tämän diplomityön tavoitteena on kehittää, toteuttaa ja arvioida laitteisto-ohjelmisto-yhteissimulointiin perustuva esivalidointimenetelmä näiden haasteiden ratkaisemiseksi. Menetelmä mahdollistaa laitteiston ja ohjelmiston varhaisen integroinnin, toimien luonnollisena välietappina perinteisen laitteistomallin varmennuksen ja koko järjestelmän validoinnin välillä. Yhteissimulointi käyttää QEMU suoritinemulaattoria, joka on yhdistetty rekisterinsiirtotason (RTL) laitteistomalliin. Tämä mahdollistaa ohjelmistokomponenttien, kuten laiteajureiden, suorittamisen kohdejärjestelmän käskysarja-arkkitehtuurilla (ISA) yhdessä kellosyklitarkkojen RTL laitteistomallien kanssa. Työ keskittyy kahteen yhteissimulaation pääsovellukseen. Ensinnäkin se mahdollistaa ohjelmiston yksikkötestien suorittamisen laitteistomallien kanssa, varmistaen kommunikaation laiteajurien, matalan tason ohjelmiston ja laitteistokomponenttien välillä. Toiseksi se tarjoaa ympäristön ohjelmiston käyttämiseen toiminnallisessa laitteiston varmennuksessa. Merkittävä etu tästä lähestymistavasta on integraatiovirheiden varhainen havaitseminen. Ohjelmiston yksikkötestejä voidaan suorittaa jo IP-lohkon tasolla oikeilla laitteistomalleilla, mikä on aiemmin ollut mahdollista vain kalliilla järjestelmätason prototyypeillä. Tämä mahdollistaa aikaisemman ohjelmisto- ja laitteistokehitystiimien välisen yhteistyön ja helpottaa siirtymistä perinteisiin järjestelmätason validointimenetelmiin

    High-Integrity Performance Monitoring Units in Automotive Chips for Reliable Timing V&V

    Get PDF
    As software continues to control more system-critical functions in cars, its timing is becoming an integral element in functional safety. Timing validation and verification (V&V) assesses softwares end-to-end timing measurements against given budgets. The advent of multicore processors with massive resource sharing reduces the significance of end-to-end execution times for timing V&V and requires reasoning on (worst-case) access delays on contention-prone hardware resources. While Performance Monitoring Units (PMU) support this finer-grained reasoning, their design has never been a prime consideration in high-performance processors - where automotive-chips PMU implementations descend from - since PMU does not directly affect performance or reliability. To meet PMUs instrumental importance for timing V&V, we advocate for PMUs in automotive chips that explicitly track activities related to worst-case (rather than average) softwares behavior, are recognized as an ISO-26262 mandatory high-integrity hardware service, and are accompanied with detailed documentation that enables their effective use to derive reliable timing estimatesThis work has also been partially supported by the Spanish Ministry of Economy and Competitiveness (MINECO) under grant TIN2015-65316-P and the HiPEAC Network of Excellence. Jaume Abella has been partially supported by the MINECO under Ramon y Cajal postdoctoral fellowship number RYC-2013-14717. Enrico Mezzet has been partially supported by the Spanish Ministry of Economy and Competitiveness under Juan de la Cierva-Incorporación postdoctoral fellowship number IJCI-2016- 27396.Peer ReviewedPostprint (author's final draft

    Multiprocessor platform using LEON3 processor

    Get PDF
    The recent advances in embedded systems world, lead us to more complex systems with application specific blocks (IP cores), the System on Chip (SoC) devices. A good example of these complex devices can be encountered in the cell phones that can have image processing cores, communication cores, memory card cores, and others. The need of augmenting systems’ processing performance with lowest power, leads to a concept of Multiprocessor System on Chip (MSoC) in which the execution of multiple tasks can be distributed along various processors. This thesis intends to address the creation of a synthesizable multiprocessing system to be placed in a FPGA device, providing a good flexibility to tailor the system to a specific application. To deliver a multiprocessing system, will be used the synthesisable 32-bit SPARC V8 compliant, LEON3 processor.Os avanços recentes no mundo dos sistemas embebidos levam-nos a sistemas mais complexos com blocos para aplicações específicas (IP cores), os dispositivos System on Chip (SoC). Um bom exemplo destes complexos dispositivos pode ser encontrado nos telemóveis, que podem conter cores de processamento de imagem, cores de comunicações, cores para cartões de memória, entre outros. A necessidade de aumentar o desempenho dos sistemas de processamento com o menor consumo possível, leva ao conceito de Multiprocessor System on Chip (MSoC) em que a execução de múltiplas tarefas pode ser distribuída por vários processadores. Esta Tese pretende abordar a criação de um sistema de multiprocessamento sintetizável para ser colocado numa FPGA, proporcionando uma boa flexibilidade para a adaptação do sistema a uma aplicação específica. Para obter o sistema multiprocessamento, irá ser utilizado o processador sintetizável SPARC V8 de 32-bit, LEON3

    Analysis and optimization of a debug post-silicon hardware architecture

    Get PDF
    The goal of this thesis is to analyze the post-silicon validation hardware infrastructure implemented on multicore systems taking as an example Esperanto Technologies SoC, which has thousands of RISC-V processors and targets specific software applications. Then, based on the conclusions of the analysis, the project proposes a new post-silicon debug architecture that can fit on any System on-Chip without depending on its target application or complexity and that optimizes the options available on the market for multicore systems

    Cycle-accurate modeling of multicore processors on FPGAs

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Department of Electrical Engineering and Computer Science, 2013.Cataloged from PDF version of thesis.Includes bibliographical references (pages 169-176).We present a novel modeling methodology which enables the generation of a high-performance, cycle-accurate simulator from a cycle-level specification of the target design. We describe Arete, a full-system multicore processor simulator, developed using our modeling methodology. We provide details on Arete's resource-efficient and high-performance implementation on multiple FPGA platforms, and the architectural experiments performed using it. We present clear evidence that the use of simplified models in architectural studies can lead to wrong conclusions. Through two experiments performed using both cycle-accurate and simplified models, we show that on one hand there are substantial quantitative and qualitative differences in results, and on the other, the results match quite well.by Asif Imtiaz Khan.Ph.D
    corecore