6 research outputs found

    Practical advances in asynchronous design and in asynchronous/synchronous interfaces

    Get PDF
    Journal ArticleAsynchronous systems are being viewed as an increasingly viable alternative to purely synchronous systems. This paper gives an overview of the current state of the art in practical asynchronous circuit and system design in four areas: controllers, datapaths, processors, and the design of asynchronous/synchronous interfaces

    Increasing rendering performance of graphics hardware

    Get PDF
    Graphics Processing Unit (GPU) performance is increasing faster than central processing unit (CPU) performance. This growth is driven by performance improvements that can be divided into the following three categories: algorithmic improvements, architectural improvements, and circuit-level improvements. In this dissertation I present techniques that improve the rendering performance of graphics hardware measured in speed, power consumption or image quality in each of these three areas. At the algorithmic level, I introduce a method for using graphics hardware to rapidly and efficiently generate summed-area tables, which are data structures that hold pre-computed two-dimensional integrals of subsets of a given image, and present several novel rendering techniques that take advantage of summed-area tables to produce dynamic, high-quality images at interactive frame rates. These techniques improve the visual quality of images rendered on current commodity GPUs without requiring modifications to the underlying hardware or architecture. At the architectural level, I propose modifications to the architecture of current GPUs that add conditional streaming capabilities. I describe a novel GPU-based ray-tracing algorithm that takes advantage of conditional output streams to reduce the memory bandwidth requirements by over an order of magnitude times when compared to previous techniques. At the circuit level, I propose a compute-on-demand paradigm for the design of high-speed and energy-efficient graphics components. The goal of the compute-on-demand paradigm is to only perform computation at the bit-level when needed. The compute-on-demand paradigm exploits the data-dependent nature of computation, and thereby obtains speed and energy improvements by optimizing designs for the common case. This approach is illustrated with the design of a high-speed Z-comparator that is implemented using asynchronous logic. Asynchronous or "clockless" circuits were chosen for my implementations since they allow for data-dependent completion times and reduced power consumption by disabling inactive components. The resulting circuit-level implementation runs over 1.5 times faster while on dissipating 25% the energy of a comparable synchronous comparator for the average case. Also at the circuit-level, I introduce a novel implementation of counterflow pipelining, which allows two streams of data to flow in opposite directions within the same pipeline without the need for complex arbitration. The advantages of this implementation are demonstrated by the design of a high-speed asynchronous Booth multiplier. While both the comparator and the multiplier are useful components of a graphics pipeline, the objective of this work was to propose the new design paradigm as a promising alternative to current graphics hardware design practices

    SPARC16 : a new compression approach for SPARC processors

    Get PDF
    Orientadores: Rodolfo Jardim de Azevedo, Paulo César CentoducatteDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Processadores RISC podem ser usados para atender a crescente demanda por desempenho requerida por sistemas embarcados. Entretanto, essas arquiteturas têm como desvantagem uma densidade de código ruim. Recodificações do conjunto de instruções, como o MIPS16 e o Thumb, representam uma abordagem eficiente para lidar com esse problema. Esse trabalho propõe uma codificação alternativa para a arquitetura SPARCv8. A nova codificação, chamada SPARC16, foi projetada com a ajuda de um modelo de programação linear inteira. As novas instruções utilizam 16 bits para serem codificadas e são facilmente traduzidas para suas correspondentes no conjunto de instruções original em tempo de execução, tornando possível posicionar um descompressor antes do estágio de decode de um processador SPARC e usar o restante do pipeline de forma transparente. O descompressor foi projetado e integrado no processador Leon 3 (SPARCv8) e ocasionou um acréscimo de 24% na área e nenhuma penalização na freqüência. Apenas um montador foi implementado para a extensão SPARC16. O descompressor foi validado através de programas que exercitam todas as instruções SPARC16 escritos diretamente em linguagem de montagem. As razões de compressão dos programas dos benchmarks Mediabench e Mibench foram obtidas inferindo como código SPARCv8 seria representado com instruções SPARC16. Através desse método, razões de compressão de até 58% foram atingidas (para o programa cjpeg) com uma média de 61.27% para os programas do Mediabench e 60.77% para os programas do Mibench. Utilizando a mesma abordagem, uma avaliação da mudança trazida pelo uso de SPARC16 nos padrões de acesso à cachê de instruções foi feita e mostrou reduções no número de misses até superiores a 50%Abstract: RISC processors can be used to face the ever increasing demand for performance required by embedded systems. Nevertheless, these architectures have as drawback a poor code density. Alternate encodings for instruction sets, such as MIPS16 and Thumb, represent an effective approach to deal with this problem. This work proposes an alternate encoding for the SPARCv8 architecture. The new encoding, called SPARC16, was designed with the aid of an integer linear programming model. The new instructions are 16-bits wide and are easily translated to its 32-bit counterparts during execution time, making it possible to place a decompressor engine before the decode stage of a SPARC processor and use the remaining of the pipeline transparently. The decompressor engine was designed and integrated into the Leon 3 processor (SPARCv8) and caused an increase of 24% in area and no timing overhead. Only an assembler was implemented for the SPARC16 extension. The decompressor engine was validated using programs that cover all the SPARC16 instructions written directly in assembly language. The compression ratios for the programs belonging to the Mediabench and Mibench benchmarks were obtained inferring how SPARCv8 code would be represented with SPARC16 instructions. Through this method, compression ratios as low as 58% were achieved (for the cjpeg program) with an average of 61.27% for the Mediabench programs and 60.77% for the Mibench programs. Using the same approach, an evaluation of the change brought by the use of SPARC16 in the instruction cache access patterns was performed and showed reductions in the number of misses even greater than 50%MestradoCiência da ComputaçãoMestre em Ciência da Computaçã

    Estudo e avaliação de conjuntos de instruções compactos

    Get PDF
    Orientador: Rodolfo Jardim de AzevedoTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Sistemas embarcados modernos são compostos de SoC heterogêneos, variando entre processadores de baixo e alto custo. Apesar de processadores RISC serem o padrão para estes dispositivos, a situação mudou recentemente: fabricantes estão construindo sistemas embarcados utilizando processadores RISC - ARM e MIPS - e CISC (x86). A adição de novas funcionalidades em software embarcados requer maior utilização da memória, um recurso caro e escasso em SoCs. Assim, o tamanho de código executável é crítico, porque afeta diretamente o número de misses na cache de instruções. Processadores CISC costumavam possuir maior densidade de código do que processadores RISC, uma vez que a codificação de instruções com tamanho variável beneficia as instruções mais usadas, os programas são menores. No entanto, com a adição de novas extensões e instruções mais longas, a densidade do CISC em aplicativos recentes tornou-se similar ao RISC. Nesta tese de doutorado, investigamos a compressibilidade de processadores RISC e CISC; SPARC e x86. Nós propomos uma extensão de 16-bits para o processador SPARC, o SPARC16. Apresentamos também, a primeira metodologia para gerar ISAs de 16-bits e avaliamos a compressão atingida em comparação com outras extensões de 16-bits. Programas do SPARC16 podem atingir taxas de compressão melhores do que outros ISAs, atingindo taxas de até 67%. O SPARC16 também reduz taxas de cache miss em até 9%, podendo usar caches menores do que processadores SPARC mas atingindo o mesmo desempenho; a redução pode chegar à um fator de 16. Estudamos também como novas extensões constantemente introduzem novas funcionalidades para o x86, levando ao inchaço do ISA - com o total de 1300 instruções em 2013. Alem disso, 57 instruções se tornam inutilizadas entre 1995 e 2012. Resolvemos este problema propondo um mecanismo de reciclagem de opcodes utilizando emulação de instruções legadas, sem quebrar compatibilidade com softwares antigos. Incluímos um estudo de caso onde instruções x86 da extensão AVX são recodificadas usando codificações menores, oriundas de instruções inutilizadas, atingindo até 14% de redução no tamanho de código e 53% de diminuição do número de cache misses. Os resultados finais mostram que usando nossa técnica, até 40% das instruções do x86 podem ser removidas com menos de 5% de perda de desempenhoAbstract: Modern embedded devices are composed of heterogeneous SoC systems ranging from low to high-end processor chips. Although RISC has been the traditional processor for these devices, the situation changed recently; manufacturers are building embedded systems using both RISC - ARM and MIPS - and CISC processors (x86). New functionalities in embedded software require more memory space, an expensive and rare resource in SoCs. Hence, executable code size is critical since performance is directly affected by instruction cache misses. CISC processors used to have a higher code density than RISC since variable length encoding benefits most used instructions, yielding smaller programs. However, with the addition of new extensions and longer instructions, CISC density in recent applications became similar to RISC. In this thesis, we investigate compressibility of RISC and CISC processors, namely SPARC and x86. We propose a 16-bit extension to the SPARC processor, the SPARC16. Additionally, we provide the first methodology for generating 16-bit ISAs and evaluate compression among different 16-bit extensions. SPARC16 programs can achieve better compression ratios than other ISAs, attaining results as low as 67%. SPARC16 also reduces cache miss rates up to 9%, requiring smaller caches than SPARC processors to achieve the same performance; a cache size reduction that can reach a factor of 16. Furthermore, we study how new extensions are constantly introducing new functionalities to x86, leading to the ISA bloat at the cost a complex microprocessor front-end design, area and energy consumption - the x86 ISA reached over 1300 different instructions in 2013. Moreover, analyzed x86 code from 5 Windows versions and 7 Linux distributions in the range from 1995 to 2012 shows that up to 57 instructions get unused with time. To solve this problem, we propose a mechanism to recycle instruction opcodes through legacy instruction emulation without breaking backward software compatibility. We present a case study of the AVX x86 SIMD instructions with shorter instruction encodings from other unused instructions to yield up to 14% code size reduction and 53% instruction cache miss reduction in SPEC CPU2006 floating-point programs. Finally, our results show that up to 40% of the x86 instructions can be removed with less than 5% of overhead through our technique without breaking any legacy codeDoutoradoCiência da ComputaçãoDoutor em Ciência da Computaçã

    Asynchronous Circuits and Systems ” (“Async98 ” Symposium),San Diego,CA A Fast Asynchronous Huffman Decoder for Compressed-Code Embedded Processors

    No full text
    This paper presents the architecture and design of a high-performance asynchronous Huffman decoder for compressed-code embedded processors. In such processors, embedded programs are stored in compressed form in instruction ROM, then are decompressed on demand during instruction cache refill. The Huffman decoder is used as a code decompression engine. The circuit is non-pipelined, and is implemented as an iterative self-timed ring. It achieves a high-speed decode rate with very low area overhead. Simulations using Lsim show an average throughput of 32 bits/25 ns on the output side (or 163 MBytes/sec, or 1303 Mbit/sec), corresponding to about 889 Mbit/sec on the input side. The area of the design is extremely small: under 1 mm 2 in a 0.8 micron fullcustom layout. The decoder is estimated to have higher throughput than any comparable synchronous Huffman decoder (after normalizing for feature size and voltage), yet is much smaller than synchronous designs. Its performance is also 83% faster than a recently published asynchronous Huffman decoder using the same technology.
    corecore