8 research outputs found

    Evaluating the impact of future memory technologies in the design of multicore processors

    Get PDF
    "It’s the Memory, Stupid!" In 1996, Richard Sites, one of the fathers of Computer Architecture and lead designer of the DEC alpha, wrote a paper [36] with the title above. In that paper he realized that the only important design issue for microprocessors in the next decade would be the memory subsystems design. After more than a decade later, the community of researchers started to digest and internalize this quote. Now, after more than two decades, it can be said that a lot of progress has been done since 1996 but the expectations of the enormous data sets that software is going to handle in the followings years tells that more aggressive designs are needed. Another reason new memory technologies are needed is because of the multicore architecture which has increased the required memory bandwidth. This architecture completely extended across the main computer sectors was the result of continuing the Moore’s law in exchange of adding more difficulties for software and hardware developers. All of this has promoted this project. First, it has been decided to create a bridge of the state of the art DRAM simulator Ramulator [30] with the micro-architecture simulator TaskSim [34]. Once the bridge has been completed, the second goal of this project has been to make an evaluation of the impact of the current and future memory technologies in multicore architectures. As a first approach, this new infrastructure has been used to evaluate the behavior of several parallel applications concluding that the execution time of the applications varies significantly across different memory technologies which even increase the differences while simulating different processors. The doubtless winner among all the memory technologies evaluated has been HBM which in some cases has achieved the best expected memory cycle response time."És la Memòria, Estúpid!" El 1996, Richard Sites, un dels pares de l’Arquitectura de Computadors i dissenyador principal del DEC Alpha, va escriure un article [36] amb el títol anterior. En aquest article es va adonar que l’única qüestió important per al disseny de microprocessadors en la dècada següent seria el disseny de subsistemes de memòria. Després de més d’una dècada, la comunitat d’investigadors va començar a digerir i assimilar aquesta cita. Ara, després de més de dues dècades, es pot dir que un gran progrés s’ha fet des de 1996, però les expectatives dels enormes conjunts de dades que el software utilitzarà en els següents anys indica que es necessiten dissenys més agressius. Una altra raó per el qual es necessiten noves tecnologies de memòria és degut a les arquitectures multinucli que augmenten l’ample de banda de memòria requerit. Aquesta arquitectura completament estesa a través dels sectors principals va ser el resultat de seguir la llei de Moore a canvi d’afegir més dificultats per als desenvolupadors de software i hardware. Tot això ha promogut aquest projecte. En primer lloc, s’ha decidit crear un pont sobre el capdavanter simulador de DRAM Ramulator [30] amb un de micro-arquitectura TaskSim [34]. Una vegada que s’ha completat el pont, el segon objectiu d’aquest projecte ha estat fer una avaluació de l’impacte de les tecnologies de memòria actuals i futures en les arquitectures multinucli. Com a primera aproximació, s’ha utilitzat aquesta nova infraestructura per avaluar el comportament de diverses aplicacions paral·leles arribant a la conclusió que el temps d’execució de les aplicacions varia significativament entre diferents tecnologies de memòria, que fins i tot augmenten les diferències amb la simulació de diferents processadors. El guanyador, sens dubte, entre totes les tecnologies de memòria ha estat HBM que en alguns casos ha aconseguit el millor temps de cicle de memòria esperat."Es la Memoria, Estupido!" En 1996, Richard Sites, uno de los padres de la Arquitectura de Computadores y diseñador principal del DEC alpha, escribió un artículo [36] con el título anterior. En ese artículo se dio cuenta de que el único problema de diseño importante para los microprocesadores en la próxima década sería el diseño del subsistema de memoria. Una década más tarde, la comunidad de investigadores comenzó a digerir e interiorizar esta cita. Ahora, después de más de dos décadas, se puede decir que se han hecho muchos progresos desde 1996, pero las expectativas de los enormes conjuntos de datos que el software va a manejar en los próximos años indican que se necesitan diseños más agresivos. Otra razón por la que se necesitan nuevas tecnologías de memoria es debido a la arquitectura multinúcleo que aumenta el ancho de banda de memoria requerido. Esta arquitectura completamente extendida a través de los principales sectores fue el resultado de continuar la ley de Moore a cambio de añadir más dificultades para los desarrolladores de software y hardware. Todo esto ha promovido este proyecto. En primer lugar, se ha decidido crear un puente entre el potente simulador de DRAM Ramulator [30] con uno de micro-arquitecura TaskSim [34]. Una vez que el puente se ha completado, el segundo objetivo de este proyecto ha sido hacer una evaluación del impacto de las tecnologías de memoria actuales y futuras en arquitecturas multinúcleo. Como primera aproximación, se ha utilizado esta nueva infraestructura para evaluar varias aplicaciones paralelas llegando a la conclusión de que el tiempo de ejecución de las aplicaciones varía significativamente entre las diferentes tecnologías de memoria y que incluso aumentan las diferencias al simular diferentes procesadores. El ganador sin duda entre todas las tecnologías de memoria evaluadas ha sido HBM que en algunos casos ha logrado el mejor tiempo de ciclo de memoria de respuesta esperado

    Fast behavioural RTL simulation of 10B transistor SoC designs with Metro-Mpi

    Get PDF
    Chips with tens of billions of transistors have become today's norm. These designs are straining our electronic design automation tools throughout the design process, requiring ever more computational resources. In many tools, parallelisation has improved both latency and throughput for the designer's benefit. However, tools largely remain restricted to a single machine and in the case of RTL simulation, we believe that this leaves much potential performance on the table. We introduce Metro-MPI to improve RTL simulation for modern 10 billion transistor-scale chips. Metro-MPI exploits the natural boundaries present in chip designs to partition RTL simulations and leverage High Performance Computing (HPC) techniques to extract parallelism. For chip designs that scale in size by exploiting latency-insensitive interfaces like networks-on-chip and AXI, Metro-MPI offers a new paradigm for RTL simulation scalability. Our implementation of Metro-MPI in Open-Piton+Ariane delivers 2.7 MIPS of RTL simulation throughput for the first time on a design with more than 10 billion transistors and 1,024 Linux-capable cores, opening new avenues for distributed RTL simulation of emerging system-on-chip designs. Compared to sequential and multithreaded RTL simulations of smaller designs, Metro-MPI achieves up to 135.98× and 9.29× speedups. Similarly, for a representative regression run, Metro-Mpireduces energy consumption by up to 2.53× and 2.91× .This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (contract PID2019-107255GB-C21), by the Generalitat de Catalunya (contract 2017-SGR-1328), by the European Union within the framework of the ERDF of Catalonia 2014-2020 under the DRAC project [001-P-001723], and by the Arm-BSC Center of Excellence. G. Lopez-Paradís has been supported by the Generalitat de Catalunya through a FI fellowship 2021FI-B00994 and GSoC 2021, and M. Moreto by a Ramon y Cajal fellowship no. RYC-2016-21104. A. Armejach is a Serra Hunter Fellow.Peer ReviewedPostprint (author's final draft

    Sargantana: A 1 GHz+ in-order RISC-V processor with SIMD vector extensions in 22nm FD-SOI

    Get PDF
    The RISC-V open Instruction Set Architecture (ISA) has proven to be a solid alternative to licensed ISAs. In the past 5 years, a plethora of industrial and academic cores and accelerators have been developed implementing this open ISA. In this paper, we present Sargantana, a 64-bit processor based on RISC-V that implements the RV64G ISA, a subset of the vector instructions extension (RVV 0.7.1), and custom application-specific instructions. Sargantana features a highly optimized 7-stage pipeline implementing out-of-order write-back, register renaming, and a non-blocking memory pipeline. Moreover, Sar-gantana features a Single Instruction Multiple Data (SIMD) unit that accelerates domain-specific applications. Sargantana achieves a 1.26 GHz frequency in the typical corner, and up to 1.69 GHz in the fast corner using 22nm FD-SOI commercial technology. As a result, Sargantana delivers a 1.77× higher Instructions Per Cycle (IPC) than our previous 5-stage in-order DVINO core, reaching 2.44 CoreMark/MHz. Our core design delivers comparable or even higher performance than other state-of-the-art academic cores performance under Autobench EEMBC benchmark suite. This way, Sargantana lays the foundations for future RISC-V based core designs able to meet industrial-class performance requirements for scientific, real-time, and high-performance computing applications.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (contract PID2019- 107255GB-C21), by the Generalitat de Catalunya (contract 2017-SGR-1328), by the European Union within the framework of the ERDF of Catalonia 2014-2020 under the DRAC project [001-P-001723], and by Lenovo-BSC Contract-Framework (2020). The Spanish Ministry of Economy, Industry and Competitiveness has partially supported M. Doblas and V. Soria-Pardos through a FPU fellowship no. FPU20-04076 and FPU20-02132 respectively. G. Lopez-Paradis has been supported by the Generalitat de Catalunya through a FI fellowship 2021FI-B00994. S. Marco-Sola was supported by Juan de la Cierva fellowship grant IJC2020-045916-I funded by MCIN/AEI/10.13039/501100011033 and by “European Union NextGenerationEU/PRTR”, and M. Moretó through a Ramon y Cajal fellowship no. RYC-2016-21104.Peer ReviewedPostprint (author's final draft

    DVINO: A RISC-V vector processor implemented in 65nm technology

    Get PDF
    This paper describes the design, verification, implementation and fabrication of the Drac Vector IN-Order (DVINO) processor, a RISC-V vector processor capable of booting Linux jointly developed by BSC, CIC-IPN, IMB-CNM (CSIC), and UPC. The DVINO processor includes an internally developed two-lane vector processor unit as well as a Phase Locked Loop (PLL) and an Analog-to-Digital Converter (ADC). The paper summarizes the design from architectural as well as logic synthesis and physical design in CMOS 65nm technology.The DRAC project is co-financed by the European Union Regional Development Fund within the framework of the ERDF Operational Program of Catalonia 2014-2020 with a grant of 50% of total eligible cost. The authors are part of RedRISCV which promotes activities around open hardware. The Lagarto Project is supported by the Research and Graduate Secretary (SIP) of the Instituto Politecnico Nacional (IPN) from Mexico, and by the CONACyT scholarship for Center for Research in Computing (CIC-IPN).Peer ReviewedArticle signat per 43 autors/es: Guillem Cabo∗, Gerard Candón∗, Xavier Carril∗, Max Doblas∗, Marc Domínguez∗, Alberto González∗, Cesar Hernández†, Víctor Jiménez∗, Vatistas Kostalampros∗, Rubén Langarita∗, Neiel Leyva†, Guillem López-Paradís∗, Jonnatan Mendoza∗, Francesco Minervini∗, Julian Pavón∗, Cristobal Ramírez∗, Narcís Rodas∗, Enrico Reggiani∗, Mario Rodríguez∗, Carlos Rojas∗, Abraham Ruiz∗, Víctor Soria∗, Alejandro Suanes‡, Iván Vargas∗, Roger Figueras∗, Pau Fontova∗, Joan Marimon∗, Víctor Montabes∗, Adrián Cristal∗, Carles Hernández∗, Ricardo Martínez‡, Miquel Moretó∗§, Francesc Moll∗§, Oscar Palomar∗§, Marco A. Ramírez†, Antonio Rubio§, Jordi Sacristán‡, Francesc Serra-Graells‡, Nehir Sonmez∗, Lluís Terés‡, Osman Unsal∗, Mateo Valero∗§, Luís Villa† // ∗Barcelona Supercomputing Center (BSC), Barcelona, Spain. Email: [email protected]; †Centro de Investigación en Computación, Instituto Politécnico Nacional (CIC-IPN), Mexico City, Mexico; ‡ Institut de Microelectronica de Barcelona, IMB-CNM (CSIC), Spain. Email: [email protected]; §Universitat Politecnica de Catalunya (UPC), Barcelona, Spain. Email: [email protected] (author's final draft

    Towards the simulation and emulation of large-scale hardware designs

    Get PDF
    The heritage of Moore's law has converged in a heterogeneous processor with a many-core and different application- or domain-specific accelerators. Having also finished the benefits of Dennard scaling, we have ended up in chips with a large area that cannot be powered all at the same time but have space to improve the performance. As a result, there are no more big performance gains from technology, and the most promising solutions are the creation of very smart designs of existing modules or exploring new specialized architectures. It is already a reality to see commercial products with many accelerators integrated on the System-On-Chip (SoC). Therefore, future chips' perspective is to continue increasing the complexity and number of hardware modules added to the SoC. Consequently, the complexity to verify such systems has increased in the last decades and will increment in the near future. The latter has resulted in multiple proposals to speed-up the verification in both academia and industry. It also corresponds to the main focus of this thesis resulting in two different contributions. In the first contribution, we explore a solution to emulate a big Network-On-Chip (NoC) in an emulation platform such as an FPGA or a hardware emulator. Emulating a NoC of 16 cores is unfeasible even in a hardware emulation platform depending on cores' size, which is pretty big. For this reason, we have exchanged the cores by a trace-based packet injector that mimics the behavior of an Out-of-Order (OoO) core running a benchmark. This contribution has materialized in the design of the trace specification and implementation of the trace generator in a full-system simulator: gem5. In addition, a preliminary study with a simple NoC has been done in order to validate the traces, with successful results. In the second contribution, we have developed a tool to perform functional testing and early design exploration of Register-Transfer Level (RTL) models inside a full-system simulator: gem5. We enable early performance studies of RTL models in an environment that models an entire SoC able to boot Linux and run complex multi-threaded and multi-programmed workloads. The framework is open-source and unifies gem5 with a HDL simulator: Verilator. Finally, we have made an evaluation of two different cases: a functional debug of an in-house Performance Monitoring Unit (PMU); a design space exploration of the type of memory to use with a Machine Learning (ML) accelerator: NVIDIA Deep Learning Accelerator (NVDLA)

    gem5 + rtl: A framework to enable RTL models inside a full-system simulator

    Get PDF
    In recent years there has been a surge of interest in designing custom accelerators for power-efficient high-performance computing. However, available tools to simulate low-level RTL designs often neglect the target system in which the design will operate. This hinders proper testing and debugging of functionalities, and does not allow co-designing the accelerator to obtain a balanced and efficient architecture. In this paper, we introduce gem5 + rtl, a flexible framework that enables simulation of RTL models inside a full-system software simulator. We present the framework’s functionality that allows easy integration of RTL models on a simulated system-on-chip (SoC) that is able to boot Linux and run complex multi-threaded and multi-programmed workloads. We demonstrate the framework with two relevant use cases that integrate a multi-core SoC with a Performance Monitoring Unit (PMU) and the NVIDIA Deep Learning Accelerator (NVDLA), showcasing how the framework enables testing RTL model features and how it can enable co-design taking into account the entire SoC.This research was supported by the European Union Regional Development Fund within the framework of the ERDF Operational Program of Catalonia 2014-2020 with a grant of 50% of total cost eligible under the DRAC project [001-P- 001723], by the Spanish goverment (grant RTI2018-095094- B-C21 CONSENT), by the Spanish Ministry of Science and Innovation (contracts PID2019-107255GB-C21) and by the Catalan Government (contracts 2017-SGR-1414, 2017-SGR705). This work has also been supported by the European Community’s Horizon 2020 Framework Programme under the Mont-Blanc 2020 and EPI projects (grant agreements n. 779877 and n. 826647); and by the Arm-BSC Center of Excellence. G. López-Paradís has been partially supported by the Agency for Management of University and Research Grants (AGAUR) of the Government of Catalonia under Ajuts per a la contractació de personal investigador novell fellowship No. 2021FI B00994. A. Armejach has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Juan de la Cierva postdoctoral fellowship number IJCI-2017-33945. M. Moretó has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramón y Cajal fellowship No. RYC-2016-21104.Peer ReviewedPostprint (author's final draft

    Characterization of a coherent hardware accelerator framework for SoCs

    No full text
    Accelerators rich architectures have become the standard in today’s SoCs. After Moore’s law diminish, it is common to only dedicate a fraction of the area of the SoC to traditional cores and leave the rest of space for specialized hardware. This motivates the need for better interconnects and interfaces between accelerators and the SoC both in hardware and software. Recent proposals from industry have put the focus on coherent interconnects for big external accelerators. However, there are still many cases where accelerators benefit from being directly connected to the memory hierarchy of the CPU inside the same chip. In this work, we demonstrate the usability of these interfaces with a characterization of a framework that connects accelerators that benefit from having coherent access to the memory hierarchy. We have evaluated some kernels from the Machsuite benchmark suite in a FPGA environment obtaining performance and area numbers. We obtain speedups from up to only requiring 45k LUTs for the accelerator framework. We conclude that many accelerators can benefit from having this access to the memory hierarchy and more work is needed for a generic framework.This work has been partially supported by the Spanish Ministry of Economy and Competitiveness (PID2019-107255GB-C21, and TED2021-132634A-I00), by the Generalitat de Catalunya (2021-SGR-00763), and by Arm through the Arm-BSC Center of Excellence. G. López-Paradís has been supported by the Generalitat de Catalunya through a FI fellowship 2021FI-B00994, M. Moretó by a Ramon y Cajal fellowship no. RYC-2016-21104, and A. Armejach is a Serra Hunter Fellow.Peer ReviewedPostprint (author's final draft

    An academic RISC-V silicon implementation based on open-source components

    Get PDF
    ©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.The design presented in this paper, called preDRAC, is a RISC-V general purpose processor capable of booting Linux jointly developed by BSC, CIC-IPN, IMB-CNM (CSIC), and UPC. The preDRAC processor is the first RISC-V processor designed and fabricated by a Spanish or Mexican academic institution, and will be the basis of future RISC-V designs jointly developed by these institutions. This paper summarizes the design tasks, for FPGA first and for SoC later, from high architectural level descriptions down to RTL and then going through logic synthesis and physical design to get the layout ready for its final tapeout in CMOS 65nm technology.The DRAC project is co-financed by the European Union Regional Development Fund within the framework of the ERDF Operational Program of Catalonia 2014-2020 with a grant of 50% of total eligible cost. The authors are part of RedRISCV which promotes activities around open hardware. The Lagarto Project is supported by the Research and Graduate Secretary (SIP) of the Instituto Politecnico Nacional (IPN) ´ from Mexico, and by the CONACyT scholarship for Center for Research in Computing (CIC-IPN).Peer ReviewedPostprint (author's final draft
    corecore