20 research outputs found

    Alocação global de registradores de endereçamento para referencias a vetores em DSPs

    Get PDF
    Orientador: Guido Costa Souza de AraujoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: O avanço tecnológico dos sistemas computacionais tem proporcionado o crescimento do mercado de sistemas dedicados, cada vez mais comuns no dia-a-dia das pessoas, como por exemplo em telefones celulares, palmtops e sistemas de controle automotivo. Devido às suas características, estas novas aplicações requerem sistemas que aliem baixo custo, alto desempenho e baixo consumo de potência. Uma das maneiras de atender a estes requisitos é utilizando processadores especializados. Contudo, a especialização na arquitetura dos processadores impõe novos desafios para o desenvolvimento de software para estes sistemas. Em especial, os compiladores - geralmente responsáveis pela otimização de código - precisam ser adaptados para produzir código eficiente para estes novos processadores. Na área de processamento de sinais digitais, como em telefonia celular, processadores especializados, denominados DSPs2, são amplamente utilizados. Estes processadores tipicamente possuem poucos registradores de propósito geral e modos de endereçamento bastante limitados. Além disso, muitas das suas aplicações envolvem o processamento de grandes seqüências de dados, as quais são geralmente armazenadas em vetores. Como resultado, o estudo de técnicas de otimização de referências a vetores tornou-se um problema central em compilação para DSPs. Este problema, denominado Global Array Reference Allocation (GARA), é o objeto central desta dissertação. O sub-problema central de GARA consiste em se determinar, para um dado conjunto de referências a vetores que serão alocadas a um mesmo registrador de endereçamento, o menor custo das instruções que são necessárias para manter este registrador com o endereço adequado em cada ponto do programa. Nesta dissertação, este sub-problema é modelado como um problema em grafos, e provado ser NP-difícil. Além disso, é proposto um algoritmo eficiente, baseado em programação dinâmica, para resolver este sub-problema de forma exata sob certas restrições. Com base neste algoritmo, duas técnicas são propostas para resolver o problema de GARA. Resultados experimentais, obtidos pela implementação destas técnicas no compilador GCC, comparam-nas com outros resultados da literatura. Os resultados demonstram a eficácia das técnicas propostas nesta dissertaçãoAbstract: The technological advances in computing systems have stimulated the growth of the embedded systems market, which is continuously becoming more ordinary in people's lives, for example in mobile phones, palmtops and automotive control systems. Because of their characteristics, these new applications demand the combination of low cost, high performance and low power consumption. One way to meet these constraints is through the design of specialized processors. However, processor specialization imposes new challenges to the development of software for these systems. In particular, compilers - generally responsible for code optimization - need to be adapted in order to produce efficient code for these new processors. In the digital signal processing arena, such as in cellular telephones, specialized processors, known as DSPs (Digital Signal Processors), are largely used. DSPs typically have few general purpose registers and very restricted addressing modes. In addition, many DSP applications include large data streams processing, which are usually stored in arrays. As a result, studing array reference optimization techniques became an important task in compiling for DSPs. This work studies this problem, known as Global Array Reference Allocation (GARA). The central GARA subproblem consists of determining, for a given set of array references to be allocated to the same address register, the minimum cost of the instructions required to keep this register with the correct address at alI program points. In this work, this subproblem is modeled as a graph theoretical problem and proved to be NP-hard. In addition, an efficient algorithm, based on dynamic programming, is proposed to optimally solve this subproblem under some restrictions. Based on this algorithm, two techniques to solve GARA are proposed. Experimental results, from the implementation of these techniques in the GCC compiler, compare them with previous work in the literature. The results show the effectiveness of the techniques proposed in this workMestradoMestre em Ciência da Computaçã

    Custom Integrated Circuits

    Get PDF
    Contains table of contents for Part III, table of contents for Section 1 and reports on eleven research projects.IBM CorporationMIT School of EngineeringNational Science Foundation Grant MIP 94-23221Defense Advanced Research Projects Agency/U.S. Army Intelligence Center Contract DABT63-94-C-0053Mitsubishi CorporationNational Science Foundation Young Investigator Award Fellowship MIP 92-58376Joint Industry Program on Offshore Structure AnalysisAnalog DevicesDefense Advanced Research Projects AgencyCadence Design SystemsMAFET ConsortiumConsortium for Superconducting ElectronicsNational Defense Science and Engineering Graduate FellowshipDigital Equipment CorporationMIT Lincoln LaboratorySemiconductor Research CorporationMultiuniversity Research IntiativeNational Science Foundatio

    Optimierende Compiler für DSPs: Was ist verfügbar?

    Get PDF
    Die Softwareentwicklung für eingebettete Prozessoren findet heute größtenteils noch auf Assemblerebene statt. Der Grund für diesen langfristig wohl unhaltbaren Zustand liegt in der mangelnden Verfügbarkeit von guten C-Compilern. In den letzten Jahren wurden allerdings wesentliche Fortschritte in der Codeoptimierung - speziell für DSPs - erzielt, welche bisher nur unzureichend in kommerzielle Produkte umgesetzt wurden. Dieser Beitrag zeigt die prinzipiellen Optimierungsquellen auf und faßt den Stand der Technik zusammen. Die zentralen Methoden hierbei sind komplexe Optimierungsverfahren, welche über die traditionelle Compilertechnologie hinausgehen, sowie die Ausnutzung der DSP-spezifischen Hardware-Architekturen zur effizienten Übersetzung von C-Sprachkonstrukten in DSP-Maschinenbefehle. Die genannten Verfahren lassen sich teilweise auch allgemein auf (durch Compiler generierte oder handgeschriebene) Assemblerprogramme anwenden

    Memory optimization techniques for embedded systems

    Get PDF
    Embedded systems have become ubiquitous and as a result optimization of the design and performance of programs that run on these systems have continued to remain as significant challenges to the computer systems research community. This dissertation addresses several key problems in the optimization of programs for embedded systems which include digital signal processors as the core processor. Chapter 2 develops an efficient and effective algorithm to construct a worm partition graph by finding a longest worm at the moment and maintaining the legality of scheduling. Proper assignment of offsets to variables in embedded DSPs plays a key role in determining the execution time and amount of program memory needed. Chapter 3 proposes a new approach of introducing a weight adjustment function and showed that its experimental results are slightly better and at least as well as the results of the previous works. Our solutions address several problems such as handling fragmented paths resulting from graph-based solutions, dealing with modify registers, and the effective utilization of multiple address registers. In addition to offset assignment, address register allocation is important for embedded DSPs. Chapter 4 develops a lower bound and an algorithm that can eliminate the explicit use of address register instructions in loops with array references. Scheduling of computations and the associated memory requirement are closely inter-related for loop computations. In Chapter 5, we develop a general framework for studying the trade-off between scheduling and storage requirements in nested loops that access multi-dimensional arrays. Tiling has long been used to improve the memory performance of loops. Only a sufficient condition for the legality of tiling was known previously. While it was conjectured that the sufficient condition would also become necessary for large enough tiles, there had been no precise characterization of what is large enough. Chapter 6 develops a new framework for characterizing tiling by viewing tiles as points on a lattice. This also leads to the development of conditions under the legality condition for tiling is both necessary and sufficient

    Heuristics for offset assignment in embedded processors

    Get PDF
    This thesis deals with the optimization of program size and performance in current generation embedded digital signal processors (DSPs) by the design of optimal memory layouts for data. Given the tight constraints on the size, power consumption, cost and performance of these processors, the minimization of code size in terms of the number of instructions required and the associated reduction in execution time are important. Several DSPs provide limited addressing modes and the layout of data, known as offset assignment, plays a critical role in determining the code size and performance. Even the simplest variant of the offset assignment problem is NP-complete. Research effort in this area has focused on the design, implementation and evaluation of effective heuristics for several variants of the offset assignment problem. One of the most important factors in the determination of the size, and hence, execution time of a code is the number of instructions required to access the variables stored in the processor memory. The indirect addressing mode common in DSPs requires memory accesses to be realized through address registers that hold the address of the memory location to be accessed. The architecture provides instructions for adding to and subtracting from the values of the address registers to compute the addresses of subsequent data that need to be accessed. In addition, some DSP processors include multiple memory banks that allow increased parallelism in memory access. Proper partitioning of variables across memory banks is critical to effectively using the increased parallelism. The work reported in this thesis aims to evolve efficient methods for designing memory layouts under the conditions of availability of one address register (SOA) or of multiple address registers (GOA). It also proposes a novel technique for choosing the assignment of variables to the memory banks. This thesis motivates, proposes and evaluates heuristics for all these three problems. For the SOA and GOA problems, the heuristics are implemented and tested on different random sample inputs, and the results obtained are compared to those obtained by prior heuristics. In addition, this thesis provides some insight into the SOA, GOA and the variable partitioning problems

    Address optimizations for embedded processors

    Get PDF
    Embedded processors that are common in electronic devices perform a limited set of tasks compared to general-purpose processor systems. They have limited resources which have to be efficiently used. Optimal utilization of program memory needs a reduction in code size which can be achieved by eliminating unnecessary address computations i.e., generate optimal offset assignment that utilizes built-in addressing modes. Single offset assignment (SOA) solutions, used for processors with one address register; start with the access sequence of variables to determine the optimal assignment. This research uses the basic block to commutatively transform statements to alter the access sequence. Edges in the access graphs are classified into breakable and unbreakable edges. Unbreakable edges are preferred when selecting edges for the assignment. Breakable edges are used to commutatively transform statements such that the assignment cost is reduced. The use of a modify register in some processors allows the address to be modified by a value in MR in addition to post-increment/decrement modes. Though finding the most beneficial value of MR is a common practice, this research shows that modifying the access sequence using edge fold, node swap, and path interleave techniques for an MR value of two has significant benefit. General offset assignment requires variables in the access sequence to be partitioned to various address registers. Use of the node degree in the access graph demonstrates greater benefit than using edge weights and frequency of variables. The Static Single Assignment (SSA) form of the basic block introduces new variables to an access graph, making it sparser. Sparser access graphs usually have lower assignment costs. The SSA form allows reuse of variable space based on variable lifetimes. Offset assignment solutions may be improved by incrementally assignment based on uncovered edges, providing the best cost improvement. This heuristic considers improvements due to all uncovered edges. Optimization techniques have primarily been edge-based. Node-based SOA technique has been tested for use with commutative transformations and shown to be better than edge-based heuristics. Heuristics developed in this research perform address optimizations for embedded processors, employing new techniques that lower address computation costs

    A systematic integration of register allocation and instruction scheduling

    Get PDF
    In order to achieve high performance, processor architecture has become more and more complicated. As a result, compiler-time optimizations have become more and more important for the effective use of a complex processor. One of the promising compiler-time optimizations is the integration of register allocation and instruction scheduling based on register-reuse chains. In the previous approach, however, the generation of register-reuse chains was not completely systematic and consequently created many unnecessary dependencies that restrict instruction scheduling. This research proposes a new register allocation technique based on a systematic generation of register-reuse chains. The first phase of the proposed technique is to generate register-reuse chains that are optimal in the sense that no additional dependencies are created. Thus, register allocation can be done without restricting instruction scheduling. For the case when the optimal register-reuse chains require more than available registers, the second phase reduces the number of required registers by merging the register-reuse chains. A heuristic is developed for the second phase in order to reduce the additional dependencies created by merging chains. The first step of the second phase is to derive a conflict graph in which each node corresponds to a register-reuse chain, while an edge represents where the corresponding two chains cannot be merged. Applying a graph-coloring algorithm to the conflict graph, the number of chains can be effectively reduced. The final step of the second phase is to run the 0-1 knapsack algorithm to make the number of chains exactly the same as the number of available registers. The proposed register allocation is implemented in LCC (Local C Compiler). An instruction scheduler is also implemented in LCC and then integrated with the proposed register allocator. Evaluation results show that the proposed algorithm and heuristic effectively reduce the number of necessary registers

    Address Register Assignment for Reducing Code Size

    Full text link

    Compilation and Scheduling Techniques for Embedded Systems

    Get PDF
    Embedded applications are constantly increasing in size, which has resulted in increasing demand on designers of digital signal processors (DSPs) to meet the tight memory, size and cost constraints. With this trend, memory requirement reduction through code compaction and variable coalescing techniques are gaining more ground. Also, as the current trend in complex embedded systems of using multiprocessor system-on-chip (MPSoC) grows, problems like mapping, memory management and scheduling are gaining more attention. The first part of the dissertation deals with problems related to digital signal processors. Most modern DSPs provide multiple address registers and a dedicated address generation unit (AGU) which performs address generation in parallel to instruction execution. A careful placement of variables in memory is important in decreasing the number of address arithmetic instructions leading to compact and efficient code. Chapters 2 and 3 present effective heuristics for the simple and the general offset assignment problems with variable coalescing. A solution based on simulated annealing is also presented. Chapter 4 presents an optimal integer linear programming (ILP) solution to the offset assignment problem with variable coalescing and operand permutation. A new approach to the general offset assignment problem is introduced. Chapter 5 presents an optimal ILP formulation and a genetic algorithm solution to the address register allocation problem (ARA) with code transformation techniques. The ARA problem is used to generate compact codes for array-intensive embedded applications. In the second part of the dissertation, we study problems related to MPSoCs. MPSoCs provide the flexibility to meet the performance requirements of multimedia applications while respecting the tight embedded system constraints. MPSoC-based embedded systems often employ software-managed memories called scratch-pad memories (SPM). Scheduling the tasks of an application on the processors and partitioning the available SPM budget among those processors are two critical issues in reducing the overall computation time. Traditionally, the step of task scheduling is applied separately from the memory partitioning step. Such a decoupled approach may miss better quality schedules. Chapters 6 and 7 present effective heuristics that integrate task allocation and SPM partitioning to further reduce the execution time of embedded applications for single and multi-application scenarios

    Geração e vetorização de instruções de multiplicação e acumulação para processadores DSP SIMD

    Get PDF
    Orientador : Guido Costa Souza de AraujoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Processadores que são projetados para executar aplicações específicas - em oposição a processadores de propósito geral- representam uma porcentagem cada vez maior do total de processadores vendidos anualmente. Esses processadores são utilizados em aparelhos eletrônicos como telefones celulares e câmeras digitais, dispositivos médicos de monitoração, modems, sistemas militares de radar, componentes eletrônicos de automóveis, set-top boxes, etc. As aplicações que são executadas por esses processadores tipicamente demandam um alto desempenho, combinado com reduzido tamanho de código e dissipação de energia. Esta dissertação aborda um dos problemas presentes durante a geração de código para uma classe desses processadores, os processadores de sinais digitais (DSPs): como o compilador pode utilizar as instruções especializadas desses processadores a fim de aumentar a densidade e melhorar o desempenho do código gerado. É proposto um procedimento que permite a detecçãoj geração de instruções de multiplicação e acumulação (muito comuns nas aplicações desses processadores). É ainda apresentado um método que permite explorar a possibilidade de execução de código em paralelo por duas ou mais unidades funcionais quando essas são capazes de operar simultaneamente sobre diferentes dados. Os métodos aqui apresentados permitem uma exploração bastante agressiva das instruções de multiplicação e acumulação, e se utilizam de algoritmos de análise de fluxo de dados e técnicas de reestruturação de laços. Não é conhecido nenhum trabalho que aborde esse problema da maneira como é apresentada nesteAbstract: Application specific processors - as opposed to general purpose processors - account for an ever increasing percentage of the processors sold each year. These processors are widely used in electronic devices such as cellular phones and digital cameras, medical monitoring devices, modems, military radar systems, electronic components in vehicles and set-top boxes, to name a few. The applications that usually run on these processors demand high performance, reduced code size and low power consuption. This thesis addresses one of the issues that arise when generating code for a class of these processors, the digital signal processors (DSPs): how the compiler can take advantage of their specialized instructions in order to reduce the size and improve performance of the code generated. A method is proposed that allows for the detectionj generation of multiply and accumulate instructions (typically present in these processors' applications). AIso presented in this work is a method that makes it possible to explore the possibility of running code in parallel on two or more functional units when these are capable of operating simultaneously on different data. The methods herein presented allow for an aggressive harnessing of multiply and accumulate instructions; to accomplish this goal they rely on data flow analysis algorithms and on loop restructuring techniques. No other work is known of that addresses this problem the way it is dealt with in this thesisMestradoMestre em Ciência da Computaçã
    corecore