756 research outputs found

    Null convention logic circuits for asynchronous computer architecture

    Get PDF
    For most of its history, computer architecture has been able to benefit from a rapid scaling in semiconductor technology, resulting in continuous improvements to CPU design. During that period, synchronous logic has dominated because of its inherent ease of design and abundant tools. However, with the scaling of semiconductor processes into deep sub-micron and then to nano-scale dimensions, computer architecture is hitting a number of roadblocks such as high power and increased process variability. Asynchronous techniques can potentially offer many advantages compared to conventional synchronous design, including average case vs. worse case performance, robustness in the face of process and operating point variability and the ready availability of high performance, fine grained pipeline architectures. Of the many alternative approaches to asynchronous design, Null Convention Logic (NCL) has the advantage that its quasi delay-insensitive behavior makes it relatively easy to set up complex circuits without the need for exhaustive timing analysis. This thesis examines the characteristics of an NCL based asynchronous RISC-V CPU and analyses the problems with applying NCL to CPU design. While a number of university and industry groups have previously developed small 8-bit microprocessor architectures using NCL techniques, it is still unclear whether these offer any real advantages over conventional synchronous design. A key objective of this work has been to analyse the impact of larger word widths and more complex architectures on NCL CPU implementations. The research commenced by re-evaluating existing techniques for implementing NCL on programmable devices such as FPGAs. The little work that has been undertaken previously on FPGA implementations of asynchronous logic has been inconclusive and seems to indicate that asynchronous systems cannot be easily implemented in these devices. However, most of this work related to an alternative technique called bundled data, which is not well suited to FPGA implementation because of the difficulty in controlling and matching delays in a 'bundle' of signals. On the other hand, this thesis clearly shows that such applications are not only possible with NCL, but there are some distinct advantages in being able to prototype complex asynchronous systems in a field-programmable technology such as the FPGA. A large part of the value of NCL derives from its architectural level behavior, inherent pipelining, and optimization opportunities such as the merging of register and combina- tional logic functions. In this work, a number of NCL multiplier architectures have been analyzed to reveal the performance trade-offs between various non-pipelined, 1D and 2D organizations. Two-dimensional pipelining can easily be applied to regular architectures such as array multipliers in a way that is both high performance and area-efficient. It was found that the performance of 2D pipelining for small networks such as multipliers is around 260% faster than the equivalent non-pipelined design. However, the design uses 265% more transistors so the methodology is mainly of benefit where performance is strongly favored over area. A pipelined 32bit x 32bit signed Baugh-Wooley multiplier with Wallace-Tree Carry Save Adders (CSA), which is representative of a real design used for CPUs and DSPs, was used to further explore this concept as it is faster and has fewer pipeline stages compared to the normal array multiplier using Ripple-Carry adders (RCA). It was found that 1D pipelining with ripple-carry chains is an efficient implementation option but becomes less so for larger multipliers, due to the completion logic for which the delay time depends largely on the number of bits involved in the completion network. The average-case performance of ripple-carry adders was explored using random input vectors and it was observed that it offers little advantage on the smaller multiplier blocks, but this particular timing characteristic of asynchronous design styles be- comes increasingly more important as word size grows. Finally, this research has resulted in the development of the first 32-Bit asynchronous RISC-V CPU core. Called the Redback RISC, the architecture is a structure of pipeline rings composed of computational oscillations linked with flow completeness relationships. It has been written using NELL, a commercial description/synthesis tool that outputs standard Verilog. The Redback has been analysed and compared to two approximately equivalent industry standard 32-Bit synchronous RISC-V cores (PicoRV32 and Rocket) that are already fabricated and used in industry. While the NCL implementation is larger than both commercial cores it has similar performance and lower power compared to the PicoRV32. The implementation results were also compared against an existing NCL design tool flow (UNCLE), which showed how much the results of these implementation strategies differ. The Redback RISC has achieved similar level of throughput and 43% better power and 34% better energy compared to one of the synchronous cores with the same benchmark test and test condition such as input sup- ply voltage. However, it was shown that area is the biggest drawback for NCL CPU design. The core is roughly 2.5× larger than synchronous designs. On the other hand its area is still 2.9× smaller than previous designs using UNCLE tools. The area penalty is largely due to the unavoidable translation into a dual-rail topology when using the standard NCL cell library

    IXIAM: ISA EXtension for Integrated Accelerator Management

    Get PDF
    During the last few years, hardware accelerators have been gaining popularity thanks to their ability to achieve higher performance and efficiency than classic general-purpose solutions. They are fundamentally shaping the current generations of Systems-on-Chip (SoCs), which are becoming increasingly heterogeneous. However, despite their widespread use, a standard, general solution to manage them while providing speed and consistency has not yet been found. Common methodologies rely on OS mediation and a mix of user-space and kernel-space drivers, which can be inefficient, especially for fine-grained tasks. This paper addresses these sources of inefficiencies by proposing an ISA eXtension for Integrated Accelerator Management (IXIAM), a cost-effective HW-SW framework to control a wide variety of accelerators in a standard way, and directly from the cores. The proposed instructions include reservation, work offloading, data transfer, and synchronization. They can be wrapped in a high-level software API or even integrated into a compiler. IXIAM features also a user-space interrupt mechanism to signal events directly to the user process. We implement it as a RISC-V extension in the gem5 simulator and demonstrate detailed support for complex accelerators, as well as the ability to specify sequences of memory transfers and computations directly from the ISA and with significantly lower overhead than driver-based schemes. IXIAM provides a performance advantage that is more evident for small and medium workloads, reaching around 90x in the best case. This way, we enlarge the set of workloads that would benefit from hardware acceleration

    BRISC-V: An Open-Source Architecture Design Space Exploration Toolbox

    Full text link
    In this work, we introduce a platform for register-transfer level (RTL) architecture design space exploration. The platform is an open-source, parameterized, synthesizable set of RTL modules for designing RISC-V based single and multi-core architecture systems. The platform is designed with a high degree of modularity. It provides highly-parameterized, composable RTL modules for fast and accurate exploration of different RISC-V based core complexities, multi-level caching and memory organizations, system topologies, router architectures, and routing schemes. The platform can be used for both RTL simulation and FPGA based emulation. The hardware modules are implemented in synthesizable Verilog using no vendor-specific blocks. The platform includes a RISC-V compiler toolchain to assist in developing software for the cores, a web-based system configuration graphical user interface (GUI) and a web-based RISC-V assembly simulator. The platform supports a myriad of RISC-V architectures, ranging from a simple single cycle processor to a multi-core SoC with a complex memory hierarchy and a network-on-chip. The modules are designed to support incremental additions and modifications. The interfaces between components are particularly designed to allow parts of the processor such as whole cache modules, cores or individual pipeline stages, to be modified or replaced without impacting the rest of the system. The platform allows researchers to quickly instantiate complete working RISC-V multi-core systems with synthesizable RTL and make targeted modifications to fit their needs. The complete platform (including Verilog source code) can be downloaded at https://ascslab.org/research/briscv/explorer/explorer.html.Comment: In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA '19

    A General Framework for Accelerator Management Based on ISA Extension

    Get PDF
    Thanks to the promised improvements in performance and energy efficiency, hardware accelerators are taking momentum in many computing contexts, both in terms of variety and relative weight in the silicon area of many chips. Commonly, the way an application interacts with these hardware modules has many accelerator-specific traits and requires ad-hoc drivers that usually rely on potentially expensive system calls to manage accelerator resources and access orchestration. As a consequence, driver-based interfacing is far from uniform and can expose high latency, limiting the set of tasks suitable for acceleration. In this paper, we propose a uniform and low-latency interface based on Instruction Set Architecture (ISA) extension. All the previous studies that proposed extensions, were deeply tailored to address a single accelerator. One of the biggest disadvantages of those methods is their inability to scale. Adding more of these accelerators to one System-on-Chip (SoC) would result in ISA bloat, increasing power consumption and complexifying the decoding phase proportionally. Our proposed framework consists of a six-instruction ISA extension and the corresponding architectural support that implements the interface abstraction and the reservation logic at the hardware level. Our proposal allows controlling a broad class of integrated accelerators directly from the CPU. The proposed framework is ISA-independent, which means that it is applicable to all the existing ISAs. We implement it on the gem5 simulator by extending the RISC-V ISA. We evaluate it by simulating three compute-intensive accelerators and comparing our interfacing with a conventional driver-based one. The benchmarks highlight the performance benefits brought by our framework, with up to 10.38x speed up, as well as the ability to seamlessly support different accelerators with the same interface. The speed up advantage of our technique diminishes as the granularity of the workloads increases and the overhead for driver-based accelerators becomes less important. We also show that the impact of its hardware components on chip area and power consumption is limited

    Neural network computing using on-chip accelerators

    Get PDF
    The use of neural networks, machine learning, or artificial intelligence, in its broadest and most controversial sense, has been a tumultuous journey involving three distinct hype cycles and a history dating back to the 1960s. Resurgent, enthusiastic interest in machine learning and its applications bolsters the case for machine learning as a fundamental computational kernel. Furthermore, researchers have demonstrated that machine learning can be utilized as an auxiliary component of applications to enhance or enable new types of computation such as approximate computing or automatic parallelization. In our view, machine learning becomes not the underlying application, but a ubiquitous component of applications. This view necessitates a different approach towards the deployment of machine learning computation that spans not only hardware design of accelerator architectures, but also user and supervisor software to enable the safe, simultaneous use of machine learning accelerator resources. In this dissertation, we propose a multi-transaction model of neural network computation to meet the needs of future machine learning applications. We demonstrate that this model, encompassing a decoupled backend accelerator for inference and learning from hardware and software for managing neural network transactions can be achieved with low overhead and integrated with a modern RISC-V microprocessor. Our extensions span user and supervisor software and data structures and, coupled with our hardware, enable multiple transactions from different address spaces to execute simultaneously, yet safely. Together, our system demonstrates the utility of a multi-transaction model to increase energy efficiency improvements and improve overall accelerator throughput for machine learning applications

    Optimal kernel development for real-time communications

    Get PDF
    The purpose of this research is to develop an optimal kernel which would be used in a real-time engineering and communications system. Since the application is a real-time system, relevant real-time issues are studied in conjunction with kernel related issues. The emphasis of the research is the development of a kernel which would not only adhere to the criteria of a real-time environment, namely determinism and performance, but also provide the flexibility and portability associated with non-real-time environments. The essence of the research is to study how the features found in non-real-time systems could be applied to the real-time system in order to generate an optimal kernel which would provide flexibility and architecture independence while maintaining the performance needed by most of the engineering applications. Traditionally, development of real-time kernels has been done using assembly language. By utilizing the powerful constructs of the C language, a real-time kernel was developed which addressed the goals of flexibility and portability while still meeting the real-time criteria. The implementation of the kernel is carried out using the powerful 68010/20/30/40 microprocessor based systems

    Systems Support for Trusted Execution Environments

    Get PDF
    Cloud computing has become a default choice for data processing by both large corporations and individuals due to its economy of scale and ease of system management. However, the question of trust and trustoworthy computing inside the Cloud environments has been long neglected in practice and further exacerbated by the proliferation of AI and its use for processing of sensitive user data. Attempts to implement the mechanisms for trustworthy computing in the cloud have previously remained theoretical due to lack of hardware primitives in the commodity CPUs, while a combination of Secure Boot, TPMs, and virtualization has seen only limited adoption. The situation has changed in 2016, when Intel introduced the Software Guard Extensions (SGX) and its enclaves to the x86 ISA CPUs: for the first time, it became possible to build trustworthy applications relying on a commonly available technology. However, Intel SGX posed challenges to the practitioners who discovered the limitations of this technology, from the limited support of legacy applications and integration of SGX enclaves into the existing system, to the performance bottlenecks on communication, startup, and memory utilization. In this thesis, our goal is enable trustworthy computing in the cloud by relying on the imperfect SGX promitives. To this end, we develop and evaluate solutions to issues stemming from limited systems support of Intel SGX: we investigate the mechanisms for runtime support of POSIX applications with SCONE, an efficient SGX runtime library developed with performance limitations of SGX in mind. We further develop this topic with FFQ, which is a concurrent queue for SCONE's asynchronous system call interface. ShieldBox is our study of interplay of kernel bypass and trusted execution technologies for NFV, which also tackles the problem of low-latency clocks inside enclave. The two last systems, Clemmys and T-Lease are built on a more recent SGXv2 ISA extension. In Clemmys, SGXv2 allows us to significantly reduce the startup time of SGX-enabled functions inside a Function-as-a-Service platform. Finally, in T-Lease we solve the problem of trusted time by introducing a trusted lease primitive for distributed systems. We perform evaluation of all of these systems and prove that they can be practically utilized in existing systems with minimal overhead, and can be combined with both legacy systems and other SGX-based solutions. In the course of the thesis, we enable trusted computing for individual applications, high-performance network functions, and distributed computing framework, making a <vision of trusted cloud computing a reality

    EKKO: an open-source RISC-V soft-core microcontroller

    Get PDF
    Dissertação de mestrado em Engenharia Eletrónica Industrial e Computadores (especialização em Sistemas Embebidos e Computadores)Com o surgimento da Internet das Coisas (IoT em inglês) nos últimos anos, o número de “coisas” conectadas está a crescer a um ritmo bastante rápido. Estes dispositivos tornaram-se rapidamente parte do nosso dia a dia e já podem ser encontrados nos mais diversos domínios de aplicação, tais como, telecomunicações, saúde, agricultura, e automação industrial. Devido a este crescimento exponencial, a demanda por sistemas embebidos é cada vez maior, trazendo assim diversos desafios no seu desenvolvimento. De todos os desafios, o time-to-market e os custos de desenvolvimento são de inegável importância, logo, a escolha de uma plataforma de desenvolvimento adequada é essencial no desenho destes sistemas. Devido a este novo paradigma, o grupo de investigação da Universidade do Minho onde esta dissertação se insere tem desenvolvido aplicações neste domínio. No entanto, as atuais plataformas de desenvolvimento utilizadas são complexas, têm custos associados e são de código fechado. Por estas razões, o grupo de investigação tem interesse em ter a sua própria plataforma de desenvolvimento. De modo a solucionar os problemas enumerados acima, esta dissertação tem como objetivo desenvolver uma plataforma de desenvolvimento tanto para hardware como para software. A plataforma deve ser simples de utilizar e open-source, reduzindo assim os custos e a tornando a gestão de licenças mais simples. Para além disto, o facto de o sistema ser de código aberto faz também com que este possa ser facilmente estendido e customizado de acordo com os requisitos da aplicação. Neste sentido, esta dissertação apresenta um soft-core microcontroller, o qual contem um processador RISC-V, uma RAM, uma unidade de depuração, um temporizador, um periférico I2C e um barra mento AXI. Em adição, este contem também um kit de desenvolvimento de software (SDK em inglês), o qual inclui um depurador, a opção de utilizar o sistema operativo Azure RTOS ThreadX, e outras ferramentas importantes, tornando o ciclo de desenvolvimento mais fácil, rápido e seguro.With the advent of the Internet of Things (IoT) in most recent years, the number of connected “things” is increasing quickly. These devices rapidly became part of our daily lives and can be found in the most different applications domains, such as telecommunications, health care, agriculture and industrial automation. With this exponential growth, the demand for embedded devices is increasing, bringing several challenges to the development of these systems. From these challenges, the time-to-market and development costs are undeniable extremely important. Thus, choosing a suitable development platform is essential when designing an embedded system. Due to this new paradigm, the University of Minho research group where this dissertation fits has been developing applications in this domain. However, the current development platforms are complex, have associated costs and are closed-source. For these reasons, the research group has interesting in having its development platform. To solve these problems, this dissertation aims to build a development platform for both hardware and software. The platform must be simple and open-source, reducing development costs and simplifying license management. Besides, due to its open nature, it will also be easier to extend and modify the system according to the application’s needs. In this context, this dissertation presents EKKO, an open-source soft-core microcontroller that contains a RISC-V core, an on-chip RAM, a debug unit, a timer and an I2C peripheral, and an AXI bus. In addition, it also contains a Software Development Kit (SDK), which includes a debugger, the option to use Azure RTOS ThreadX, and other crucial tools, turning the development cycle more accessible, faster and safer

    Digital System Design - Use of Microcontroller

    Get PDF
    Embedded systems are today, widely deployed in just about every piece of machinery from toasters to spacecraft. Embedded system designers face many challenges. They are asked to produce increasingly complex systems using the latest technologies, but these technologies are changing faster than ever. They are asked to produce better quality designs with a shorter time-to-market. They are asked to implement increasingly complex functionality but more importantly to satisfy numerous other constraints. To achieve the current goals of design, the designer must be aware with such design constraints and more importantly, the factors that have a direct effect on them.One of the challenges facing embedded system designers is the selection of the optimum processor for the application in hand; single-purpose, general-purpose or application specific. Microcontrollers are one member of the family of the application specific processors.The book concentrates on the use of microcontroller as the embedded system?s processor, and how to use it in many embedded system applications. The book covers both the hardware and software aspects needed to design using microcontroller.The book is ideal for undergraduate students and also the engineers that are working in the field of digital system design.Contents• Preface;• Process design metrics;• A systems approach to digital system design;• Introduction to microcontrollers and microprocessors;• Instructions and Instruction sets;• Machine language and assembly language;• System memory; Timers, counters and watchdog timer;• Interfacing to local devices / peripherals;• Analogue data and the analogue I/O subsystem;• Multiprocessor communications;• Serial Communications and Network-based interfaces
    corecore