190 research outputs found
Code optimizations for narrow bitwidth architectures
This thesis takes a HW/SW collaborative approach to tackle the problem of computational inefficiency in a holistic manner.
The hardware is redesigned by restraining the datapath to merely 16-bit datawidth (integer datapath only) to provide an
extremely simple, low-cost, low-complexity execution core which is best at executing the most common case efficiently. This
redesign, referred to as the Narrow Bitwidth Architecture, is unique in that although the datapath is squeezed to 16-bits, it
continues to offer the advantage of higher memory addressability like the contemporary wider datapath architectures. Its
interface to the outside (software) world is termed as the Narrow ISA. The software is responsible for efficiently mapping the
current stack of 64-bit applications onto the 16-bit hardware. However, this HW/SW approach introduces a non-negligible
penalty both in dynamic code-size and performance-impact even with a reasonably smart code-translator that maps the 64-
bit applications on to the 16-bit processor.
The goal of this thesis is to design a software layer that harnesses the power of compiler optimizations to assuage this
negative performance penalty of the Narrow ISA. More specifically, this thesis focuses on compiler optimizations targeting the
problem of how to compile a 64-bit program to a 16-bit datapath machine from the perspective of Minimum Required
Computations (MRC). Given a program, the notion of MRC aims to infer how much computation is really required to generate
the same (correct) output as the original program.
Approaching perfect MRC is an intrinsically ambitious goal and it requires oracle predictions of program behavior. Towards
this end, the thesis proposes three heuristic-based optimizations to closely infer the MRC. The perspective of MRC unfolds
into a definition of productiveness - if a computation does not alter the storage location, it is non-productive and hence, not
necessary to be performed. In this research, the definition of productiveness has been applied to different granularities of the
data-flow as well as control-flow of the programs.
Three profile-based, code optimization techniques have been proposed :
1. Global Productiveness Propagation (GPP) which applies the concept of productiveness at the granularity of a function.
2. Local Productiveness Pruning (LPP) applies the same concept but at a much finer granularity of a single instruction.
3. Minimal Branch Computation (MBC) is an profile-based, code-reordering optimization technique which applies the
principles of MRC for conditional branches.
The primary aim of all these techniques is to reduce the dynamic code footprint of the Narrow ISA. The first two optimizations
(GPP and LPP) perform the task of speculatively pruning the non-productive (useless) computations using profiles. Further,
these two optimization techniques perform backward traversal of the optimization regions to embed checks into the nonspeculative
slices, hence, making them self-sufficient to detect mis-speculation dynamically.
The MBC optimization is a use case of a broader concept of a lazy computation model. The idea behind MBC is to reorder the
backslices containing narrow computations such that the minimal necessary computations to generate the same (correct)
output are performed in the most-frequent case; the rest of the computations are performed only when necessary.
With the proposed optimizations, it can be concluded that there do exist ways to smartly compile a 64-bit application to a 16-
bit ISA such that the overheads are considerably reduced.Esta tesis deriva su motivaciĂłn en la inherente ineficiencia computacional de los procesadores actuales: a pesar de que
muchas aplicaciones contemporáneas tienen unos requisitos de ancho de bits estrechos (aplicaciones de enteros, de red y
multimedia), el hardware acaba utilizando el camino de datos completo, utilizando más recursos de los necesarios y
consumiendo más energĂa.
Esta tesis utiliza una aproximaciĂłn HW/SW para atacar, de forma Ăntegra, el problema de la ineficiencia computacional. El
hardware se ha rediseñado para restringir el ancho de bits del camino de datos a sólo 16 bits (únicamente el de enteros) y
ofrecer asà un núcleo de ejecución simple, de bajo consumo y baja complejidad, el cual está diseñado para ejecutar de
forma eficiente el caso común. El rediseño, llamado en esta tesis Arquitectura de Ancho de Bits Estrecho (narrow bitwidth
en inglés), es único en el sentido que aunque el camino de datos se ha estrechado a 16 bits, el sistema continúa
ofreciendo las ventajas de direccionar grandes cantidades de memoria tal como procesadores con caminos de datos más
anchos (64 bits actualmente). Su interface con el mundo exterior se denomina ISA estrecho. En nuestra propuesta el
software es responsable de mapear eficientemente la actual pila software de las aplicaciones de 64 bits en el hardware de
16 bits. Sin embargo, esta aproximación HW/SW introduce penalizaciones no despreciables tanto en el tamaño del código
dinámico como en el rendimiento, incluso con un traductor de código inteligente que mapea las aplicaciones de 64 bits en
el procesador de 16 bits.
El objetivo de esta tesis es el de diseñar una capa software que aproveche la capacidad de las optimizaciones para reducir
el efecto negativo en el rendimiento del ISA estrecho. Concretamente, esta tesis se centra en optimizaciones que tratan el
problema de como compilar programas de 64 bits para una máquina de 16 bits desde la perspectiva de las MĂnimas
Computaciones Requeridas (MRC en inglés). Dado un programa, la noción de MRC intenta deducir la cantidad de cómputo
que realmente se necesita para generar la misma (correcta) salida que el programa original.
Aproximarse al MRC perfecto es una meta intrĂnsecamente ambiciosa y que requiere predicciones perfectas de
comportamiento del programa. Con este fin, la tesis propone tres heurĂsticas basadas en optimizaciones que tratan de
inferir el MRC. La utilización de MRC se desarrolla en la definición de productividad: si un cálculo no altera el dato que ya
habĂa almacenado, entonces no es productivo y por lo tanto, no es necesario llevarlo a cabo.
Se han propuesto tres optimizaciones del cĂłdigo basadas en profile:
1. Propagación Global de la Productividad (GPP en inglés) aplica el concepto de productividad a la granularidad de función.
2. Poda Local de Productividad (LPP en inglés) aplica el mismo concepto pero a una granularidad mucho más fina, la de
una Ăşnica instrucciĂłn.
3. ComputaciĂłn MĂnima del Salto (MBC en inglĂ©s) es una tĂ©cnica de reordenaciĂłn de cĂłdigo que aplica los principios de
MRC a los saltos condicionales.
El objetivo principal de todas esta técnicas es el de reducir el tamaño dinámico del código estrecho. Las primeras dos
optimizaciones (GPP y LPP) realizan la tarea de podar especulativamente las computaciones no productivas (innecesarias)
utilizando profiles. Además, estas dos optimizaciones realizan un recorrido hacia atrás de las regiones a optimizar para
añadir chequeos en el código no especulativo, haciendo de esta forma la técnica autosuficiente para detectar,
dinámicamente, los casos de fallo en la especulación.
La idea de la optimizaciĂłn MBC es reordenar las instrucciones que generan el salto condicional tal que las mĂnimas
computaciones que general la misma (correcta) salida se ejecuten en la mayorĂa de los casos; el resto de las
computaciones se ejecutarán sólo cuando sea necesario
Chapter One – An Overview of Architecture-Level Power- and Energy-Efficient Design Techniques
Power dissipation and energy consumption became the primary design constraint for almost all computer systems in the last 15 years. Both computer architects and circuit designers intent to reduce power and energy (without a performance degradation) at all design levels, as it is currently the main obstacle to continue with further scaling according to Moore's law. The aim of this survey is to provide a comprehensive overview of power- and energy-efficient “state-of-the-art” techniques. We classify techniques by component where they apply to, which is the most natural way from a designer point of view. We further divide the techniques by the component of power/energy they optimize (static or dynamic), covering in that way complete low-power design flow at the architectural level. At the end, we conclude that only a holistic approach that assumes optimizations at all design levels can lead to significant savings.Peer ReviewedPostprint (published version
SWIFT: A Low-Power Network-On-Chip Implementing the Token Flow Control Router Architecture With Swing-Reduced Interconnects
A 64-bit, 8 Ă— 8 mesh network-on-chip (NoC) is presented that uses both new architectural and circuit design techniques to improve on-chip network energy-efficiency, latency, and throughput. First, we propose token flow control, which enables bypassing of flit buffering in routers, thereby reducing buffer size and their power consumption. We also incorporate reduced-swing signaling in on-chip links and crossbars to minimize datapath interconnect energy. The 64-node NoC is experimentally validated with a 2 Ă— 2 test chip in 90 nm, 1.2 V CMOS that incorporates traffic generators to emulate the traffic of the full network. Compared with a fully synthesized baseline 8 Ă— 8 NoC architecture designed to meet the same peak throughput, the fabricated prototype reduces network latency by 20% under uniform random traffic, when both networks are run at their maximum operating frequencies. When operated at the same frequencies, the SWIFT NoC reduces network power by 38% and 25% at saturation and low loads, respectively
Profile-directed specialisation of custom floating-point hardware
We present a methodology for generating
floating-point arithmetic hardware
designs which are, for suitable applications, much reduced in size, while still
retaining performance and IEEE-754 compliance. Our system uses three
key parts: a profiling tool, a set of customisable
floating-point units and a
selection of system integration methods.
We use a profiling tool for
floating-point behaviour to identify arithmetic
operations where fundamental elements of IEEE-754
floating-point may be
compromised, without generating erroneous results in the common case.
In the uncommon case, we use simple detection logic to determine when
operands lie outside the range of capabilities of the optimised hardware.
Out-of-range operations are handled by a separate, fully capable,
floatingpoint
implementation, either on-chip or by returning calculations to a host
processor. We present methods of system integration to achieve this errorcorrection.
Thus the system suffers no compromise in IEEE-754 compliance,
even when the synthesised hardware would generate erroneous results.
In particular, we identify from input operands the shift amounts required
for input operand alignment and post-operation normalisation. For operations
where these are small, we synthesise hardware with reduced-size
barrel-shifters. We also propose optimisations to take advantage of other
profile-exposed behaviours, including removing the hardware required to
swap operands in a floating-point adder or subtractor, and reducing the
exponent range to fit observed values.
We present profiling results for a range of applications, including a selection
of computational science programs, Spec FP 95 benchmarks and the
FFMPEG media processing tool, indicating which would be amenable to
our method. Selected applications which demonstrate potential for optimisation
are then taken through to a hardware implementation. We show up
to a 45% decrease in hardware size for a
floating-point datapath, with a
correctable error-rate of less then 3%, even with non-profiled datasets
A configurable vector processor for accelerating speech coding algorithms
The growing demand for voice-over-packer (VoIP) services and multimedia-rich
applications has made increasingly important the efficient, real-time implementation of
low-bit rates speech coders on embedded VLSI platforms. Such speech coders are
designed to substantially reduce the bandwidth requirements thus enabling dense multichannel
gateways in small form factor. This however comes at a high computational cost
which mandates the use of very high performance embedded processors.
This thesis investigates the potential acceleration of two major ITU-T speech coding
algorithms, namely G.729A and G.723.1, through their efficient implementation on a
configurable extensible vector embedded CPU architecture. New scalar and vector ISAs
were introduced which resulted in up to 80% reduction in the dynamic instruction count
of both workloads. These instructions were subsequently encapsulated into a parametric,
hybrid SISD (scalar processor)–SIMD (vector) processor. This work presents the research
and implementation of the vector datapath of this vector coprocessor which is tightly-coupled
to a Sparc-V8 compliant CPU, the optimization and simulation methodologies
employed and the use of Electronic System Level (ESL) techniques to rapidly design
SIMD datapaths
Generation of Application Specific Hardware Extensions for Hybrid Architectures: The Development of PIRANHA - A GCC Plugin for High-Level-Synthesis
Architectures combining a field programmable gate array (FPGA) and a general-purpose processor on a single chip became increasingly popular in recent years. On the one hand, such hybrid architectures facilitate the use of application specific hardware accelerators that improve the performance of the software on the host processor. On the other hand, it obliges system designers to handle the whole process of hardware/software co-design. The complexity of this process is still one of the main reasons, that hinders the widespread use of hybrid architectures. Thus, an automated process that aids programmers with the hardware/software partitioning and the generation of application specific accelerators is an important issue. The method presented in this thesis neither requires restrictions of the used high-level-language nor special source code annotations. Usually, this is an entry barrier for programmers without deeper understanding of the underlying hardware platform.
This thesis introduces a seamless programming flow that allows generating hardware accelerators for unrestricted, legacy C code. The implementation consists of a GCC plugin that automatically identifies application hot-spots and generates hardware accelerators accordingly. Apart from the accelerator implementation in a hardware description language, the compiler plugin provides the generation of a host processor interfaces and, if necessary, a prototypical integration with the host operating system. An evaluation with typical embedded applications shows general benefits of the approach, but also reveals limiting factors that hamper possible performance improvements
Vega: A Ten-Core SoC for IoT Endnodes with DNN Acceleration and Cognitive Wake-Up from MRAM-Based State-Retentive Sleep Mode
The Internet-of-Things (IoT) requires endnodes with ultra-low-power always-on capability for a long battery lifetime, as well as high performance, energy efficiency, and extreme flexibility to deal with complex and fast-evolving near-sensor analytics algorithms (NSAAs). We present Vega, an IoT endnode system on chip (SoC) capable of scaling from a 1.7- ÎĽW fully retentive cognitive sleep mode up to 32.2-GOPS (at 49.4 mW) peak performance on NSAAs, including mobile deep neural network (DNN) inference, exploiting 1.6 MB of state-retentive SRAM, and 4 MB of non-volatile magnetoresistive random access memory (MRAM). To meet the performance and flexibility requirements of NSAAs, the SoC features ten RISC-V cores: one core for SoC and IO management and a nine-core cluster supporting multi-precision single instruction multiple data (SIMD) integer and floating-point (FP) computation. Vega achieves the state-of-the-art (SoA)-leading efficiency of 615 GOPS/W on 8-bit INT computation (boosted to 1.3 TOPS/W for 8-bit DNN inference with hardware acceleration). On FP computation, it achieves the SoA-leading efficiency of 79 and 129 GFLOPS/W on 32- and 16-bit FP, respectively. Two programmable machine learning (ML) accelerators boost energy efficiency in cognitive sleep and active states
HW/SW mechanisms for instruction fusion, issue and commit in modern u-processors
In this thesis we have explored the co-designed paradigm to show alternative processor design points. Specifically, we have provided HW/SW mechanisms for instruction fusion, issue and commit for modern processors. We have implemented a co-designed virtual machine monitor that binary translates x86 instructions into RISC like micro-ops. Moreover, the translations are stored as superblocks, which are a trace of basic blocks. These superblocks are further optimized using speculative and non-speculative optimizations. Hardware mechanisms exists in-order to take corrective action in case of misspeculations. During the course of this PhD we have made following contributions.
Firstly, we have provided a novel Programmable Functional unit, in-order to speed up general-purpose applications. The PFU consists of a grid of functional units, similar to CCA, and a distributed internal register file. The inputs of the macro-op are brought from the Physical Register File to the internal register file using a set of moves and a set of loads. A macro-op fusion algorithm fuses micro-ops at runtime. The fusion algorithm is based on a scheduling step that indicates whether the current fused instruction is beneficial or not. The micro-ops corresponding to the macro-ops are stored as control signals in a configuration. The macro-op consists of a configuration ID which helps in locating the configurations. A small configuration cache is present inside the Programmable Functional unit, that holds these configurations. In case of a miss in the configuration cache configurations are loaded from I-Cache. Moreover, in-order to support bulk commit of atomic superblocks that are larger
than the ROB we have proposed a speculative commit mechanism. For this we have proposed a Speculative commit register map table that holds the mappings of the speculatively committed instructions. When all the instructions of the superblock have committed the speculative state is copied to Backend Register Rename Table.
Secondly, we proposed a co-designed in-order processor with with two kinds of accelerators. These FU based accelerators run a pair of fused instructions. We have considered two kinds of instruction fusion. First, we fused a pair of independent loads together into vector loads and execute them on vector load units. For the second kind of instruction fusion we have fused a pair of dependent simple ALU instructions and execute them in Interlock Collapsing ALUs (ICALU). Moreover, we have evaluated performance of various code optimizations such as list-scheduling, load-store telescoping and load hoisting among others. We have compared our co-designed processor with small instruction window out-of-order processors.
Thirdly, we have proposed a co-designed out-of-order processor. Specifically we have reduced complexity in two areas. First
of all, we have co-designed the commit mechanism, that enable bulk commit of atomic superblocks. In this solution we got rid of the conventional ROB, instead we introduce the Superblock Ordering Buffer (SOB). SOB ensures program order is maintained at the granularity of the superblock, by bulk committing the program state. The program state consists of the register state and the memory state. The register state is held in a per superblock register map table, whereas the memory state is held in gated store buffer and updated in bulk. Furthermore, we have tackled the complexity of Out-of-Order issue logic by using FIFOs. We have proposed an enhanced steering heuristic that fixes the inefficiencies of the existing dependence-based heuristic. Moreover, a mechanism to release the FIFO entries earlier is also proposed that further improves the performance of the steering heuristic.En aquesta tesis hem explorat el paradigma de les mà quines issue i commit per processadors actuals. Hem implementat una mà quina virtual que tradueix binaris x86 a micro-ops de tipus RISC. Aquestes traduccions es guarden com a superblocks, que en realitat no és més que una traça de virtuals co-dissenyades. En particular, hem proposat mecanismes hw/sw per a la fusió d’instruccions, blocs bà sics. Aquests superblocks s’optimitzen utilitzant optimizacions especualtives i d’altres no speculatives. En cas de les optimizations especulatives es consideren mecanismes per a la gestió de errades en l’especulació. Al llarg d’aquesta tesis s’han fet les següents contribucions:
Primer, hem proposat una nova unitat functional programmable (PFU) per tal de millorar l’execuciĂł d’aplicacions de proposit general. La PFU estĂ formada per un conjunt d’unitats funcionals, similar al CCA, amb un banc de registres intern a la PFU distribuĂŻt a les unitats funcionals que la composen. Les entrades de la macro-operaciĂł que s’executa en la PFU es mouen del banc de registres fĂsic convencional al intern fent servir un conjunt de moves i loads. Un algorisme de fusiĂł combina mĂ©s micro-operacions en temps d’execuciĂł. Aquest algorisme es basa en un pas de planificaciĂł que mesura el benefici de les decisions de fusiĂł. Les micro operacions corresponents a la macro operaciĂł s’emmagatzemen com a senyals de control en una configuraciĂł. Les macro-operacions tenen associat un identificador de configuraciĂł que ajuda a localitzar d’aquestes. Una petita cache de configuracions estĂ present dintre de la PFU per tal de guardar-les. En cas de que la configuraciĂł no estigui a la cache, les configuracions es carreguen de la cache d’instruccions. Per altre banda, per tal de donar support al commit atòmic dels superblocks que sobrepassen el tamany del ROB s’ha proposat un mecanisme de commit especulatiu. Per aquest mecanisme hem proposat una taula de mapeig especulativa dels registres, que es copia a la taula no especulativa quan totes les instruccions del superblock han comitejat.
Segon, hem proposat un processador en order co-dissenyat que combina dos tipus d’acceleradors. Aquests acceleradors executen un parell d’instruccions fusionades. S’han considerat dos tipus de fusió d’instructions. Primer, combinem un parell de loads independents formant loads vectorials i els executem en una unitat vectorial. Segon, fusionem parells d’instruccions simples d’alu que són dependents i que s’executaran en una Interlock Collapsing ALU (ICALU). Per altra aquestes tecniques les hem evaluat conjuntament amb diverses optimizacions com list scheduling, load-store telescoping i hoisting de loads, entre d’altres. Aquesta proposta ha estat comparada amb un processador fora d’ordre.
Tercer, hem proposat un processador fora d’ordre co-dissenyat efficient reduint-ne la complexitat en dos areas principals. En primer lloc, hem co-disenyat el mecanisme de commit per tal de permetre un eficient commit atòmic del superblocks. En aquesta soluciĂł hem substituĂŻt el ROB convencional, i en lloc hem introduĂŻt el Superblock Ordering Buffer (SOB). El SOB mantĂ© l’odre de programa a granularitat de superblock. L’estat del programa consisteix en registres i memòria. L’estat dels registres es mantĂ© en una taula per superblock, mentre que l’estat de memòria es guarda en un buffer i s’actulitza atòmicament. La segona gran area de reducciĂł de complexitat considerarada Ă©s l’ús de FIFOs a la lògica d’issue. En aquest Ăşltim Ă mbit hem proposat una heurĂstica de distribuciĂł que solventa les ineficiències de l’heurĂstica basada en dependències anteriorment proposada. Finalment, i junt amb les FIFOs, s’ha proposat un mecanisme per alliberar les entrades de la FIFO anticipadament
- …