Search CORE

32 research outputs found

SPARC16 : a new compression approach for SPARC processors

Author: Ecco Leonardo Luiz
Publication venue: [s.n.]
Publication date: 17/08/2018
Field of study

Orientadores: Rodolfo Jardim de Azevedo, Paulo César CentoducatteDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Processadores RISC podem ser usados para atender a crescente demanda por desempenho requerida por sistemas embarcados. Entretanto, essas arquiteturas têm como desvantagem uma densidade de código ruim. Recodificações do conjunto de instruções, como o MIPS16 e o Thumb, representam uma abordagem eficiente para lidar com esse problema. Esse trabalho propõe uma codificação alternativa para a arquitetura SPARCv8. A nova codificação, chamada SPARC16, foi projetada com a ajuda de um modelo de programação linear inteira. As novas instruções utilizam 16 bits para serem codificadas e são facilmente traduzidas para suas correspondentes no conjunto de instruções original em tempo de execução, tornando possível posicionar um descompressor antes do estágio de decode de um processador SPARC e usar o restante do pipeline de forma transparente. O descompressor foi projetado e integrado no processador Leon 3 (SPARCv8) e ocasionou um acréscimo de 24% na área e nenhuma penalização na freqüência. Apenas um montador foi implementado para a extensão SPARC16. O descompressor foi validado através de programas que exercitam todas as instruções SPARC16 escritos diretamente em linguagem de montagem. As razões de compressão dos programas dos benchmarks Mediabench e Mibench foram obtidas inferindo como código SPARCv8 seria representado com instruções SPARC16. Através desse método, razões de compressão de até 58% foram atingidas (para o programa cjpeg) com uma média de 61.27% para os programas do Mediabench e 60.77% para os programas do Mibench. Utilizando a mesma abordagem, uma avaliação da mudança trazida pelo uso de SPARC16 nos padrões de acesso à cachê de instruções foi feita e mostrou reduções no número de misses até superiores a 50%Abstract: RISC processors can be used to face the ever increasing demand for performance required by embedded systems. Nevertheless, these architectures have as drawback a poor code density. Alternate encodings for instruction sets, such as MIPS16 and Thumb, represent an effective approach to deal with this problem. This work proposes an alternate encoding for the SPARCv8 architecture. The new encoding, called SPARC16, was designed with the aid of an integer linear programming model. The new instructions are 16-bits wide and are easily translated to its 32-bit counterparts during execution time, making it possible to place a decompressor engine before the decode stage of a SPARC processor and use the remaining of the pipeline transparently. The decompressor engine was designed and integrated into the Leon 3 processor (SPARCv8) and caused an increase of 24% in area and no timing overhead. Only an assembler was implemented for the SPARC16 extension. The decompressor engine was validated using programs that cover all the SPARC16 instructions written directly in assembly language. The compression ratios for the programs belonging to the Mediabench and Mibench benchmarks were obtained inferring how SPARCv8 code would be represented with SPARC16 instructions. Through this method, compression ratios as low as 58% were achieved (for the cjpeg program) with an average of 61.27% for the Mediabench programs and 60.77% for the Mibench programs. Using the same approach, an evaluation of the change brought by the use of SPARC16 in the instruction cache access patterns was performed and showed reductions in the number of misses even greater than 50%MestradoCiência da ComputaçãoMestre em Ciência da Computaçã

Repositorio da Producao Cientifica e Intelectual da Unicamp

Estudo e avaliação de conjuntos de instruções compactos

Author: Lopes Bruno Cardoso, 1985-
Publication venue: [s.n.]
Publication date: 24/08/2018
Field of study

Orientador: Rodolfo Jardim de AzevedoTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Sistemas embarcados modernos são compostos de SoC heterogêneos, variando entre processadores de baixo e alto custo. Apesar de processadores RISC serem o padrão para estes dispositivos, a situação mudou recentemente: fabricantes estão construindo sistemas embarcados utilizando processadores RISC - ARM e MIPS - e CISC (x86). A adição de novas funcionalidades em software embarcados requer maior utilização da memória, um recurso caro e escasso em SoCs. Assim, o tamanho de código executável é crítico, porque afeta diretamente o número de misses na cache de instruções. Processadores CISC costumavam possuir maior densidade de código do que processadores RISC, uma vez que a codificação de instruções com tamanho variável beneficia as instruções mais usadas, os programas são menores. No entanto, com a adição de novas extensões e instruções mais longas, a densidade do CISC em aplicativos recentes tornou-se similar ao RISC. Nesta tese de doutorado, investigamos a compressibilidade de processadores RISC e CISC; SPARC e x86. Nós propomos uma extensão de 16-bits para o processador SPARC, o SPARC16. Apresentamos também, a primeira metodologia para gerar ISAs de 16-bits e avaliamos a compressão atingida em comparação com outras extensões de 16-bits. Programas do SPARC16 podem atingir taxas de compressão melhores do que outros ISAs, atingindo taxas de até 67%. O SPARC16 também reduz taxas de cache miss em até 9%, podendo usar caches menores do que processadores SPARC mas atingindo o mesmo desempenho; a redução pode chegar à um fator de 16. Estudamos também como novas extensões constantemente introduzem novas funcionalidades para o x86, levando ao inchaço do ISA - com o total de 1300 instruções em 2013. Alem disso, 57 instruções se tornam inutilizadas entre 1995 e 2012. Resolvemos este problema propondo um mecanismo de reciclagem de opcodes utilizando emulação de instruções legadas, sem quebrar compatibilidade com softwares antigos. Incluímos um estudo de caso onde instruções x86 da extensão AVX são recodificadas usando codificações menores, oriundas de instruções inutilizadas, atingindo até 14% de redução no tamanho de código e 53% de diminuição do número de cache misses. Os resultados finais mostram que usando nossa técnica, até 40% das instruções do x86 podem ser removidas com menos de 5% de perda de desempenhoAbstract: Modern embedded devices are composed of heterogeneous SoC systems ranging from low to high-end processor chips. Although RISC has been the traditional processor for these devices, the situation changed recently; manufacturers are building embedded systems using both RISC - ARM and MIPS - and CISC processors (x86). New functionalities in embedded software require more memory space, an expensive and rare resource in SoCs. Hence, executable code size is critical since performance is directly affected by instruction cache misses. CISC processors used to have a higher code density than RISC since variable length encoding benefits most used instructions, yielding smaller programs. However, with the addition of new extensions and longer instructions, CISC density in recent applications became similar to RISC. In this thesis, we investigate compressibility of RISC and CISC processors, namely SPARC and x86. We propose a 16-bit extension to the SPARC processor, the SPARC16. Additionally, we provide the first methodology for generating 16-bit ISAs and evaluate compression among different 16-bit extensions. SPARC16 programs can achieve better compression ratios than other ISAs, attaining results as low as 67%. SPARC16 also reduces cache miss rates up to 9%, requiring smaller caches than SPARC processors to achieve the same performance; a cache size reduction that can reach a factor of 16. Furthermore, we study how new extensions are constantly introducing new functionalities to x86, leading to the ISA bloat at the cost a complex microprocessor front-end design, area and energy consumption - the x86 ISA reached over 1300 different instructions in 2013. Moreover, analyzed x86 code from 5 Windows versions and 7 Linux distributions in the range from 1995 to 2012 shows that up to 57 instructions get unused with time. To solve this problem, we propose a mechanism to recycle instruction opcodes through legacy instruction emulation without breaking backward software compatibility. We present a case study of the AVX x86 SIMD instructions with shorter instruction encodings from other unused instructions to yield up to 14% code size reduction and 53% instruction cache miss reduction in SPEC CPU2006 floating-point programs. Finally, our results show that up to 40% of the x86 instructions can be removed with less than 5% of overhead through our technique without breaking any legacy codeDoutoradoCiência da ComputaçãoDoutor em Ciência da Computaçã

Repositorio da Producao Cientifica e Intelectual da Unicamp

Uma arquitetura para execução de codigo comprimido em sistemas dedicados

Author: Azevedo Rodolfo Jardim de, 1974-
Publication venue: [s.n.]
Publication date: 01/08/2018
Field of study

Orientador : Guido Costa Souza de AraujoTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Projetos de sistemas dedicados modernos têm exigido cada vez mais memória de programa para incluir novas funcionalidades como interface com o usuário, suporte a novos componentes, etc. O aumento no tamanho dos programas tem feito com que a área ocupada pela memória em um circuito integrado moderno seja um dos fatores determinantes no seu custo final bem como um dos maiores responsáveis pelo consumo de potência nestes dispositivos. A compressão de código de programa vem sendo considerada como uma estratégia importante na minimização deste problema. Esta tese trata da compressão de programas para execução em sistemas dedicados baseados em arquiteturas RISC. Um amplo estudo demonstra que a utilização do método proposto neste trabalho, Instruction Based Compression (IBC), resulta em boas razões de compressão e implementações eficientes de descompressores. Para a arquitetura MIPS foi obtida a melhor razão de compressão (tamanho final do programa comprimido e do descompressor em relação ao programa original) conhecida (53,6%) utilizando como benchmark programas do SPEC CINT'95. Uma arquitetura pipelined para o descompressor é proposta e um protótipo foi implementado para o processador Leon (SPARC V8). Esta é a primeira implementação em hardware de um descompressor para a arquitetura SPARC, tendo produzido uma razão de compressão de 61,8% para o mesmo benchmark e uma queda de apenas 5,89% no desempenho médio do sistemaAbstract: The demand for program memory in embedded systems has grown considerably in recent years, as a result of the need to accommodate new system functionalities such as novel user interfaces, additional hardware devices, etc. The increase in program size has turned memory into the largest single factor in the total area and power dissipation of a modern System-on-a-Chíp (SoC). Program code compression has been considered recently a central technique in reducing the cost of memory in such systems. This thesis studies the code compression problem for RISC architectures. A thorough experimental study shows that the Instructíon Based Compressíon (IBC) technique proposed herein results in very good compression ratios and efficient decompressor engine implementations. For the MIPS architecture this approach results in the best compression ratio (size of the compressed program divided by the size of the original program) known in the literature (53.6%), when it is evaluated using the SPEC CINT'95 benchmark programs. A decompressor pipelined architecture was developed and prototyped for the Leon (SPARC V8) processor. This is the first implementation of a hardware decompressor on the SPARC architecture, having resulted in a 61.8% compression ratio for the same benchmark, at the expense of a fairly small performance overhead (5.89% on average)DoutoradoDoutor em Ciência da Computaçã

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio da Producao Cientifica e Intelectual da Unicamp

Industry/University Collaboration at the University of Michigan-Dearborn: A Focus on Relevant Technology

Author: Argento Alan
Awad Selim S.
Bartlett Andrew
Chen Yubao
Cherng John
Chow C.L.
Kampfner Roberto
Kamrani Ali
Lakshmanan Sridhar
Mallick Pankaj K.
Maxim Bruce
Miller John
Modesitt Kenneth
Murphey Yi Lu
Murtuza Syed
Orady Elsayed
Ratts Eric
Sengupta Subrata
Shridhar Malayappan
Varde Keshav
Yoon David
Zhang Yi
Zhao Dongming
Zhu Qiang
Publication venue
Publication date: 19/06/1996
Field of study

https://deepblue.lib.umich.edu/bitstream/2027.42/154104/1/kampfner1996.pd

Deep Blue Documents at the University of Michigan

A novel approach for the hardware implementation of a PPMC statistical data compressor

Author: Claudia Feregrino Uribe (7203002)
Publication venue
Publication date: 01/01/2001
Field of study

This thesis aims to understand how to design high-performance compression algorithms suitable for hardware implementation and to provide hardware support for an efficient compression algorithm. Lossless data compression techniques have been developed to exploit the available bandwidth of applications in data communications and computer systems by reducing the amount of data they transmit or store. As the amount of data to handle is ever increasing, traditional methods for compressing data become· insufficient. To overcome this problem, more powerful methods have been developed. Among those are the so-called statistical data compression methods that compress data based on their statistics. However, their high complexity and space requirements have prevented their hardware implementation and the full exploitation of their potential benefits. This thesis looks into the feasibility of the hardware implementation of one of these statistical data compression methods by exploring the potential for reorganising and restructuring the method for hardware implementation and investigating ways of achieving efficient and effective designs to achieve an efficient and cost-effective algorithm. [Continues.

Loughborough University Institutional Repository

Dynamic Binary Translation for Embedded Systems with Scratchpad Memory

Author: Baiocchi Paredes Jose Americo
Publication venue
Publication date: 01/01/2011
Field of study

Embedded software development has recently changed with advances in computing. Rather than fully co-designing software and hardware to perform a relatively simple task, nowadays embedded and mobile devices are designed as a platform where multiple applications can be run, new applications can be added, and existing applications can be updated. In this scenario, traditional constraints in embedded systems design (i.e., performance, memory and energy consumption and real-time guarantees) are more difficult to address. New concerns (e.g., security) have become important and increase software complexity as well. In general-purpose systems, Dynamic Binary Translation (DBT) has been used to address these issues with services such as Just-In-Time (JIT) compilation, dynamic optimization, virtualization, power management and code security. In embedded systems, however, DBT is not usually employed due to performance, memory and power overhead. This dissertation presents StrataX, a low-overhead DBT framework for embedded systems. StrataX addresses the challenges faced by DBT in embedded systems using novel techniques. To reduce DBT overhead, StrataX loads code from NAND-Flash storage and translates it into a Scratchpad Memory (SPM), a software-managed on-chip SRAM with limited capacity. SPM has similar access latency as a hardware cache, but consumes less power and chip area. StrataX manages SPM as a software instruction cache, and employs victim compression and pinning to reduce retranslation cost and capture frequently executed code in the SPM. To prevent performance loss due to excessive code expansion, StrataX minimizes the amount of code inserted by DBT to maintain control of program execution. When a hardware instruction cache is available, StrataX dynamically partitions translated code among the SPM and main memory. With these techniques, StrataX has low performance overhead relative to native execution for MiBench programs. Further, it simplifies embedded software and hardware design by operating transparently to applications without any special hardware support. StrataX achieves sufficiently low overhead to make it feasible to use DBT in embedded systems to address important design goals and requirements

CiteSeerX

D-Scholarship@Pitt

Parallelism and the software-hardware interface in embedded systems

Author: Chouliaras V A
Publication venue
Publication date: 01/01/2005
Field of study

This thesis by publications addresses issues in the architecture and microarchitecture of next generation, high performance streaming Systems-on-Chip through quantifying the most important forms of parallelism in current and emerging embedded system workloads. The work consists of three major research tracks, relating to data level parallelism, thread level parallelism and the software-hardware interface which together reflect the research interests of the author as they have been formed in the last nine years. Published works confirm that parallelism at the data level is widely accepted as the most important performance leverage for the efficient execution of embedded media and telecom applications and has been exploited via a number of approaches the most efficient being vectorlSIMD architectures. A further, complementary and substantial form of parallelism exists at the thread level but this has not been researched to the same extent in the context of embedded workloads. For the efficient execution of such applications, exploitation of both forms of parallelism is of paramount importance. This calls for a new architectural approach in the software-hardware interface as its rigidity, manifested in all desktop-based and the majority of embedded CPU's, directly affects the performance ofvectorized, threaded codes. The author advocates a holistic, mature approach where parallelism is extracted via automatic means while at the same time, the traditionally rigid hardware-software interface is optimized to match the temporal and spatial behaviour of the embedded workload. This ultimate goal calls for the precise study of these forms of parallelism for a number of applications executing on theoretical models such as instruction set simulators and parallel RAM machines as well as the development of highly parametric microarchitectural frameworks to encapSUlate that functionality.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

Loughborough University Institutional Repository

OpenGrey Repository

Harnessing Simulation Acceleration to Solve the Digital Design Verification Challenge.

Author: Chatterjee Debapriya
Publication venue
Publication date: 01/01/2013
Field of study

Today, design verification is by far the most resource and time-consuming activity of any new digital integrated circuit development. Within this area, the vast majority of the verification effort in industry relies on simulation platforms, which are implemented either in hardware or software. A "simulator" includes a model of each component of a design and has the capability of simulating its behavior under any input scenario provided by an engineer. Thus, simulators are deployed to evaluate the behavior of a design under as many input scenarios as possible and to identify and debug all incorrect functionality. Two features are critical in simulators for the validation effort to be effective: performance and checking/debugging capabilities. A wide range of simulator platforms are available today: on one end of the spectrum there are software-based simulators, providing a very rich software infrastructure for checking and debugging the design's functionality, but executing only at 1-10 simulation cycles per second (while actual chips operate at GHz speeds). At the other end of the spectrum, there are hardware-based platforms, such as accelerators, emulators and even prototype silicon chips, providing higher performances by 4 to 9 orders of magnitude, at the cost of very limited or non-existent checking/debugging capabilities. As a result, today, simulation-based validation is crippled: one can either have satisfactory performance on hardware-accelerated platforms or critical infrastructures for checking/debugging on software simulators, but not both. This dissertation brings together these two ends of the spectrum by presenting solutions that offer high-performance simulation with effective checking and debugging capabilities. Specifically, it addresses the performance challenge of software simulators by leveraging inexpensive off-the-shelf graphics processors as massively parallel execution substrates, and then exposing the parallelism inherent in the design model to that architecture. For hardware-based platforms, the dissertation provides solutions that offer enhanced checking and debugging capabilities by abstracting the relevant data to be logged during simulation so to minimize the cost of collection, transfer and processing. Altogether, the contribution of this dissertation has the potential to solve the challenge of digital design verification by enabling effective high-performance simulation-based validation.PHDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/99781/1/dchatt_1.pd

Deep Blue Documents at the University of Michigan

Recommended from our members

JavaFlow : a Java DataFlow Machine

Author: Ascott Robert John
Publication venue
Publication date: 10/02/2015
Field of study

textThe JavaFlow, a Java DataFlow Machine is a machine design concept implementing a Java Virtual Machine aimed at addressing technology roadmap issues along with the ability to effectively utilize and manage very large numbers of processing cores. Specific design challenges addressed include: design complexity through a common set of repeatable structures; low power by featuring unused circuits and ability to power off sections of the chip; clock propagation and wire limits by using locality to bring data to processing elements and a Globally Asynchronous Locally Synchronous (GALS) design; and reliability by allowing portions of the design to be bypassed in case of failures. A Data Flow Architecture is used with multiple heterogeneous networks to connect processing elements capable of executing a single Java ByteCode instruction. Whole methods are cached in this DataFlow fabric, and the networks plus distributed intelligence are used for their management and execution. A mesh network is used for the DataFlow transfers; two ordered networks are used for management and control flow mapping; and multiple high speed rings are used to access the storage subsystem and a controlling General Purpose Processor (GPP). Analysis of benchmarks demonstrates the potential for this design concept. The design process was initiated by analyzing SPEC JVM benchmarks which identified a small number methods contributing to a significant percentage of the overall ByteCode operations. Additional analysis established static instruction mixes to prioritize the types of processing elements used in the DataFlow Fabric. The overall objective of the machine is to provide multi-threading performance for Java Methods deployed to this DataFlow fabric. With advances in technology it is envisioned that from 1,000 to 10,000 cores/instructions could be deployed and managed using this structure. This size of DataFlow fabric would allow all the key methods from the SPEC benchmarks to be resident. A baseline configuration is defined with a compressed dataflow structure and then compared to multiple configurations of instruction assignments and clock relationships. Using a series of methods from the SPEC benchmark running independently, IPC (Instructions per Cycle) performance of the sparsely populated heterogeneous structure is 40% of the baseline. The average ratio of instructions to required nodes is 3.5. Innovative solutions to the loading and management of Java methods along with the translation from control flow to DataFlow structure are demonstrated.Electrical and Computer Engineerin

Texas ScholarWorks