53 research outputs found
Design of asynchronous microprocessor for power proportionality
PhD ThesisMicroprocessors continue to get exponentially cheaper for end users following Moore’s
law, while the costs involved in their design keep growing, also at an exponential rate.
The reason is the ever increasing complexity of processors, which modern EDA tools
struggle to keep up with. This makes further scaling for performance subject to a high
risk in the reliability of the system. To keep this risk low, yet improve the performance,
CPU designers try to optimise various parts of the processor. Instruction Set Architecture
(ISA) is a significant part of the whole processor design flow, whose optimal design
for a particular combination of available hardware resources and software requirements
is crucial for building processors with high performance and efficient energy utilisation.
This is a challenging task involving a lot of heuristics and high-level design decisions.
Another issue impacting CPU reliability is continuous scaling for power consumption. For
the last decades CPU designers have been mainly focused on improving performance, but
“keeping energy and power consumption in mind”. The consequence of this was a development
of energy-efficient systems, where energy was considered as a resource whose
consumption should be optimised. As CMOS technology was progressing, with feature
size decreasing and power delivered to circuit components becoming less stable, the
energy resource turned from an optimisation criterion into a constraint, sometimes a critical
one. At this point power proportionality becomes one of the most important aspects
in system design. Developing methods and techniques which will address the problem
of designing a power-proportional microprocessor, capable to adapt to varying operating
conditions (such as low or even unstable voltage levels) and application requirements in
the runtime, is one of today’s grand challenges. In this thesis this challenge is addressed
by proposing a new design flow for the development of an ISA for microprocessors, which
can be altered to suit a particular hardware platform or a specific operating mode. This
flow uses an expressive and powerful formalism for the specification of processor instruction
sets called the Conditional Partial Order Graph (CPOG). The CPOG model captures
large sets of behavioural scenarios for a microarchitectural level in a computationally
efficient form amenable to formal transformations for synthesis, verification and automated
derivation of asynchronous hardware for the CPU microcontrol. The feasibility of
the methodology, novel design flow and a number of optimisation techniques was proven
in a full size asynchronous Intel 8051 microprocessor and its demonstrator silicon. The
chip showed the ability to work in a wide range of operating voltage and environmental
conditions. Depending on application requirements and power budget our ASIC supports
several operating modes: one optimised for energy consumption and the other one for
performance. This was achieved by extending a traditional datapath structure with an
auxiliary control layer for adaptable and fault tolerant operation. These and other optimisations
resulted in a reconfigurable and adaptable implementation, which was proven
by measurements, analysis and evaluation of the chip.EPSR
HAL-ASOS - Linux com aceleração em hardware para sistemas operativos dedicados à aplicação
Programa doutoral em Engenharia Eletrónica e de Computadores (PDEEC) (especialidade de Informática Industrial e Sistemas Embebidos)O ecossistema de sistemas embebidos de hoje tornou-se enorme, cobrindo vários e diferentes sistemas,
exigindo desempenho e mobilidade completa enquanto atingem autonomias de bateria cada vez maiores.
Mas a crescente frequência de relógio que resultou em dispositivos cada vez mais rápidos começou a
estagnar antes dos transístores pararem de encolher. Plataformas Field Programmable Gate Array (FPGA)
são uma solução alternativa para a implementação de sistemas completos e reconfiguráveis. Fornecem
desempenho e eficiência computacional para satisfazer requisitos da aplicação e do sistema embebido.
Vários Sistemas Operativos (SO) assistidos por FPGA foram propostos, mas ao estreitar seu foco na síntese
do datapath do acelerador de hardware, a grande maioria ignora a integração semântica destes no
SO. Ambientes de síntese de alto nível (HLS) elevaram a abstração além da linguagem de transferência de
registo (RTL), seguindo uma abordagem específica de domínio enquanto misturam software e abstrações
de hardware ad hoc, que dificultam as otimizações. Além disso, os modelos de programação para software
e hardware reconfigurável carecem de semelhanças, o que com o tempo dificultará a Exploração
do Ambiente de Design (DSE) e diminuirá o potencial de reutilização de código. Para responder a estas
necessidades, propomos HAL-ASOS, uma ferramenta para implementar sistemas embebidos baseados
em Linux que fornece (1) elasticidade no design em conformidade com a natureza evolutiva deste SO, (2)
integração semântica profunda de tarefas de hardware nos modelos de programação do Linux, (3) facilidade
na gestão de complexidade através de metodologia e ferramentas para apoiar o design, verificação
e implementação, (4) orientada por princípios de design híbridos e eficiência no sistema. Para avaliar as
funcionalidades da ferramenta, foi implementado um aplicativo criptográfico que demonstra alcance de
desempenho enquanto se emprega a metodologia de design. Novos níveis de desempenho são atingidos
numa aplicação de Visão por Computador que explora recursos de programação assíncrona-síncrona. Os
resultados demonstram uma abordagem flexível na reconfiguração entre hardware e software, e desempenho
que aumenta consistentemente com acréscimo de recursos ou frequência de relógio.Today’s embedded systems ecosystem became huge while covering several and different computer-based
systems, demanding for performance and complete mobility while experiencing longer battery lives. But
the rampant frequency that resulted in faster devices began hitting a wall even before transistors stopped
shrinking. Field Programmable Gate Array (FPGA) platforms are an alternative solution towards implementing
complete reconfigurable systems. They provide computational power, efficiency, in a lightweight
solution to serve the application requirements and increase performance in the overall system. Several
FPGA-assisted Operating Systems (OS) have been proposed, but by narrowing their focus on datapath
synthesis of the hardware accelerator, they completely ignore the deep semantic integration of these accelerators
into the OS. State-of-the-art High-Level Synthesis (HLS) environments have raised the level of
abstraction beyond Register Transfer Language (RTL) by following a domain-specific approach while mixing
ad hoc software and hardware abstractions, making harder for performance optimizations. Furthermore,
the programming models for software and reconfigurable hardware lack commonalities, which in time will
hinder the Design Space Exploration (DSE) and lower the potential for code reuse. To overcome these
issues, we propose HAL-ASOS, a framework to implement Linux-based Embedded systems which provides
(1) elasticity by design to comply with the evolutive nature of Linux, (2) deep semantic integration of the
hardware tasks in the Linux programming models, (3) easy complexity management using methodology
and tools to fully support design, verification and deployment, (4) hybrid and efficiency-oriented design
principles. To evaluate the framework functionalities, a cryptographic application was implemented and
demonstrates performance achievements while using the promoted application-driven design methodology.
To demonstrate new levels of performance that can be achieved, a Computer Vision application
explores several mixed asynchronous-synchronous programming features. Experiments demonstrate a
flexible design approach in terms of hardware and software reconfiguration, and significant performance
that increases consistently with the rising in processing resources or clock frequencies.Financial support received from Portuguese Foundation for Science and Technology (FCT) with the PhD grant SFRH/BD/82732/2011
Image Processing Using FPGAs
This book presents a selection of papers representing current research on using field programmable gate arrays (FPGAs) for realising image processing algorithms. These papers are reprints of papers selected for a Special Issue of the Journal of Imaging on image processing using FPGAs. A diverse range of topics is covered, including parallel soft processors, memory management, image filters, segmentation, clustering, image analysis, and image compression. Applications include traffic sign recognition for autonomous driving, cell detection for histopathology, and video compression. Collectively, they represent the current state-of-the-art on image processing using FPGAs
Generation of Application Specific Hardware Extensions for Hybrid Architectures: The Development of PIRANHA - A GCC Plugin for High-Level-Synthesis
Architectures combining a field programmable gate array (FPGA) and a general-purpose processor on a single chip became increasingly popular in recent years. On the one hand, such hybrid architectures facilitate the use of application specific hardware accelerators that improve the performance of the software on the host processor. On the other hand, it obliges system designers to handle the whole process of hardware/software co-design. The complexity of this process is still one of the main reasons, that hinders the widespread use of hybrid architectures. Thus, an automated process that aids programmers with the hardware/software partitioning and the generation of application specific accelerators is an important issue. The method presented in this thesis neither requires restrictions of the used high-level-language nor special source code annotations. Usually, this is an entry barrier for programmers without deeper understanding of the underlying hardware platform.
This thesis introduces a seamless programming flow that allows generating hardware accelerators for unrestricted, legacy C code. The implementation consists of a GCC plugin that automatically identifies application hot-spots and generates hardware accelerators accordingly. Apart from the accelerator implementation in a hardware description language, the compiler plugin provides the generation of a host processor interfaces and, if necessary, a prototypical integration with the host operating system. An evaluation with typical embedded applications shows general benefits of the approach, but also reveals limiting factors that hamper possible performance improvements
Methodology and Ecosystem for the Design of a Complex Network ASIC
Performance of HPC systems has risen steadily. While the 10 Petaflop/s barrier has been breached in the year 2011 the next large step into the exascale era is expected sometime between the years 2018 and 2020. The EXTOLL project will be an integral part in this venture. Originally designed as a research project on FPGA basis it will make the transition to an ASIC to improve its already excelling performance even further. This transition poses many challenges that will be presented in this thesis. Nowadays, it is not enough to look only at single components in a system. EXTOLL is part of complex ecosystem which must be optimized overall since everything is tightly interwoven and disregarding some aspects can cause the whole system either to work with limited performance or even to fail.
This thesis examines four different aspects in the design hierarchy and proposes efficient solutions or improvements for each of them. At first it takes a look at the design implementation and the differences between FPGA and ASIC design. It introduces a methodology to equip all on-chip memory with ECC logic automatically without the user’s input and in a transparent way so that the underlying code that uses the memory does not have to be changed. In the next step the floorplanning process is analyzed and an iterative solution is worked out based on physical and logical constraints of the EXTOLL design. Besides, a work flow for collaborative design is presented that allows multiple users to work on the design concurrently. The third part concentrates on the high-speed signal path from the chip to the connector and how it is affected by technological limitations. All constraints are analyzed and a package layout for the EXTOLL chip is proposed that is seen as the optimal solution. The last part develops a cost model for wafer and package level test and raises technological concerns that will affect the testing methodology. In order to run testing internally it proposes the development of a stand-alone test platform that is able to test packaged EXTOLL chips in every aspect
Vector-thread architecture and implementation
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Includes bibliographical references (p. 181-186).This thesis proposes vector-thread architectures as a performance-efficient solution for all-purpose computing. The VT architectural paradigm unifies the vector and multithreaded compute models. VT provides the programmer with a control processor and a vector of virtual processors. The control processor can use vector-fetch commands to broadcast instructions to all the VPs or each VP can use thread-fetches to direct its own control flow. A seamless intermixing of the vector and threaded control mechanisms allows a VT architecture to flexibly and compactly encode application parallelism and locality. VT architectures can efficiently exploit a wide variety of loop-level parallelism, including non-vectorizable loops with cross-iteration dependencies or internal control flow. The Scale VT architecture is an instantiation of the vector-thread paradigm designed for low-power and high-performance embedded systems. Scale includes a scalar RISC control processor and a four-lane vector-thread unit that can execute 16 operations per cycle and supports up to 128 simultaneously active virtual processor threads. Scale provides unit-stride and strided-segment vector loads and stores, and it implements cache refill/access decoupling. The Scale memory system includes a four-port, non-blocking, 32-way set-associative, 32 KB cache. A prototype Scale VT processor was implemented in 180 nm technology using an ASIC-style design flow. The chip has 7.1 million transistors and a core area of 16.6 mm2, and it runs at 260 MHz while consuming 0.4-1.1 W. This thesis evaluates Scale using a diverse selection of embedded benchmarks, including example kernels for image processing, audio processing, text and data processing, cryptography, network processing, and wireless communication.(cont.) Larger applications also include a JPEG image encoder and an IEEE 802.11 la wireless transmitter. Scale achieves high performance on a range of different types of codes, generally executing 3-11 compute operations per cycle. Unlike other architectures which improve performance at the expense of increased energy consumption, Scale is generally even more energy efficient than a scalar RISC processor.by Ronny Meir Krashinsky.Ph.D
The 1991 3rd NASA Symposium on VLSI Design
Papers from the symposium are presented from the following sessions: (1) featured presentations 1; (2) very large scale integration (VLSI) circuit design; (3) VLSI architecture 1; (4) featured presentations 2; (5) neural networks; (6) VLSI architectures 2; (7) featured presentations 3; (8) verification 1; (9) analog design; (10) verification 2; (11) design innovations 1; (12) asynchronous design; and (13) design innovations 2
Built-In Self-Test (BIST) for Multi-Threshold NULL Convention Logic (MTNCL) Circuits
This dissertation proposes a Built-In Self-Test (BIST) hardware implementation for Multi-Threshold NULL Convention Logic (MTNCL) circuits. Two different methods are proposed: an area-optimized topology that requires minimal area overhead, and a test-performance-optimized topology that utilizes parallelism and internal hardware to reduce the overall test time through additional controllability points. Furthermore, an automated software flow is proposed to insert, simulate, and analyze an input MTNCL netlist to obtain a desired fault coverage, if possible, through iterative digital and fault simulations. The proposed automated flow is capable of producing both area-optimized and test-performance-optimized BIST circuits and scripts for digital and fault simulation using commercial software that may be utilized to manually verify or adjust further, if desired
- …