7 research outputs found

    Survey on Combinatorial Register Allocation and Instruction Scheduling

    Full text link
    Register allocation (mapping variables to processor registers or memory) and instruction scheduling (reordering instructions to increase instruction-level parallelism) are essential tasks for generating efficient assembly code in a compiler. In the last three decades, combinatorial optimization has emerged as an alternative to traditional, heuristic algorithms for these two tasks. Combinatorial optimization approaches can deliver optimal solutions according to a model, can precisely capture trade-offs between conflicting decisions, and are more flexible at the expense of increased compilation time. This paper provides an exhaustive literature review and a classification of combinatorial optimization approaches to register allocation and instruction scheduling, with a focus on the techniques that are most applied in this context: integer programming, constraint programming, partitioned Boolean quadratic programming, and enumeration. Researchers in compilers and combinatorial optimization can benefit from identifying developments, trends, and challenges in the area; compiler practitioners may discern opportunities and grasp the potential benefit of applying combinatorial optimization

    tlCell: a software transactional memory for the cell broadband engine architecture

    Get PDF
    Dissertação apresentada na Faculdade de Ciências e Tecnologia da Universidade Nova de Lisboa para a obtenção do Grau de Mestre em Engenharia InformáticaOs computadores evoluíram exponencialmente na ultima década. A performance tem sido o principal objectivo resultando no aumento do frequência dos processadores, situação que já não é fazível devido ao consumo de energia exagerado dos processadores actuais. A arquitectura Cell Broadband Engine começou com o objectivo de providenciar alta capacidade computacional com um baixo consumo energético. O resultado é uma arquitectura com multiprocessadores heterogéneos e uma distribuição de memória única com vista a alto desempenho e redução da complexidade do hardware para reduzir o custo de produção. Espera-se que as técnicas de concorrência e paralelismo aumentem a performance desta arquitectura, no entanto as soluções de alto desempenho apresentadas s˜ao sempre muito especificas e devido à sua arquitectura e distribuição de memória inovadora ´e ainda difícil apresentar ferramentas passíveis de explorar concorrência e paralelismo como um camada de abstracção. Memória Transaccional por Software é um modelo de programação que propõe este nível de abstracção e tem vindo a ganhar popularidade existindo já variadas implementações com performance perto de soluções específicas de grão fino. A possibilidade de usar Memória Transaccional por Software nesta arquitectura inovadora, desenvolvendo uma ferramenta capaz de abstrair o programador da consistência e gestão de memória é apelativo. Neste documento especifica-se uma plataforma deffered-update de Memória Transactional por Software para a arquitectura Cell Broadband Engine que tira partido da capacidade computacional dos Synergistic Processing Elements (SPEs) usando locks em commit-time. São propostos dois modelos diferentes, fully local e multi-buffered de forma a poder estudar as implicações das escolhas feitas no desenho da plataforma

    Performance analysis and optimizations of the ArchC simulators

    Get PDF
    Orientadores: Edson Borin, Rodolfo Jardim de AzevedoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Geração automática possui a grande vantagem de automatizar um processo, reduzir o tempo que seria gasto nesta etapa e evitar que erros comuns aconteçam. Porém, de que adianta reduzir o tempo de uma etapa se existe a possibilidade de aumentar o tempo das demais etapas. Em projetos de circuitos digitais, foram desenvolvidas as linguagens de descrição de arquitetura, que possibilitaram o surgimento de ferramentas capazes de gerar automaticamente simuladores, compiladores, etc., que são utilizados para avaliar uma arquitetura sem que esta tenha um hardware propriamente dito. Simuladores gerados automaticamente são utilizados para executar aplicações e averiguar o comportamento destas e da arquitetura sendo projetada. No entanto, caso o simulador gerado não seja eficiente, o tempo de simulação aumenta, podendo superar o ganho obtido pela geração automática, cancelando suas vantagens. Neste caso, como verificar a eficiência do simulador gerado? Uma forma bastante usada é comparar com outros simuladores existentes ou gerar o simulador manualmente para comparação. Comparar com simuladores existentes exigem que estes sejam similares, já gerar manualmente o simulador elimina o propósito da geração automática. Nesse contexto, desenvolvemos uma metodologia para se avaliar os simuladores gerados automaticamente através de perfilamento de código. Isto permitiu a identificação dos gargalos de desempenho e, consequentemente, o desenvolvimento de otimizações na geração de código. Com as otimizações, conseguimos gerar um simulador do modelo MIPS 1,48 vezes melhorAbstract: Automatic generation has a great advantage of automating a process. This reduces the time taken in this step and avoiding common mistakes. However, what is the advantage of reducing the time of a step if there is the possibility of increasing the time of the remaining steps? In digital circuit design, the architecture description languages emerged to make possible the development of tools that automatically generate simulators, compilers, and others tools, that we use to evaluate an architecture without it having a hardware itself. Automatically generated simulators run applications and verify their behavior and the architecture in design. But if the generated simulator is not efficient, the simulation time increases and can exceed the gain achieved by automatic generation, canceling its benefits. How to check the efficiency of the generated simulator in this case? A common option compares the generated simulator with other existing simulators. The other alternative is generating manually a simulator for comparison. The first choice requires that the simulators are similar and the second possibility eliminates the purpose of automatic generation. In this context, we have developed a methodology to evaluate the simulators automatically generated using code profiling. This allowed the identification of performance bottlenecks and, consequently, the development of optimizations on code generation. With the optimizations, we generated a MIPS simulator 1.48 times betterMestradoCiência da ComputaçãoMestre em Ciência da Computação01-P-3951/2011, 01-P-1965/2012CAPE

    Design of Digital SoC for Operation at High Temperatures

    Get PDF
    There is a growing demand for Systems-on-Chip, integrating microprocessors, on-chip memories, data converters and a variety of sensors, which are capable of reliable operation at high temperatures. For instance, modern aircraft industry demands microcontrollers and electric motors to operate at high temperatures, in order to replace present hydraulic structures. This thesis explains how to design digital SoC which are capable of reliable operation at high temperatures. The essential part of this thesis focuses on the design, implementation, fabrication and high-temperature measurements of on-chip Latch based SRAM, PowerPC e200 based microcontroller, digital temperature sensor and Flash A/D converter. Embedded on-chip SRAM modules are one of the most important components in the modern SoC. We analyze thermally-caused failures in the 6T SRAM cell and elaborate on its reliability. Further, we show that Latch based SRAM modules are preferable to 6T SRAM for reliable operation beyond 150C, by comparing two 1kB SRAM modules implemented in standard 0.18um SOI CMOS process. We demonstrate reliable SRAM operation at 275C (fmax = 10MHz, Ptot = 400mW), that is by far the highest reported operating temperature for digital on-chip SRAM module. Designing SoCs for reliable operation at elevated temperatures is a significant challenge, due to increased static leakage current, reduced carrier mobility, and increased electromigration. We propose to customize a PowerPC e200 based SoC by using a dynamically reconfigurable clock frequency, exhaustive clock gating, and electromigration-resistant power distribution network. We fabricated a 20x9mm2 chip implementing this design in 0.35um Bulk CMOS process. We present worldâs first PowerPC based SoC for reliable operation at 225C (fmax = 30MHz, Ptot = 1.2W). This design outperforms previously reported PowerPC based SoCs, which are not operational at temperatures beyond 125C. The on-chip measurements of the p-n junction temperature allow reliability improvements for the SoC that operates at high temperatures. Low-resolution temperature measurements are efficiently used for adjusting the optimal operation frequency and supply voltage. We used the Time-to-Digital conversion technique to design a fully-digital temperature sensor. We designed and simulated a fully-digital 5bit temperature sensor for 10C resolution temperature measurements in between Tj,min = -45C and Tj,max = 125C. Further, using a single clock cycle Time-to-Digital conversion technique, we present a fully-digital 5bit Pulse based Flash ADC implemented in 0.18um Bulk CMOS process. Measurement results demonstrate the state-of-the-art power efficiency result of 450 fJ/conv (fmax = 83MHz, Ptot = 900uW)

    Design and implementation of WCET analyses : including a case study on multi-core processors with shared buses

    Get PDF
    For safety-critical real-time embedded systems, the worst-case execution time (WCET) analysis — determining an upper bound on the possible execution times of a program — is an important part of the system verification. Multi-core processors share resources (e.g. buses and caches) between multiple processor cores and, thus, complicate the WCET analysis as the execution times of a program executed on one processor core significantly depend on the programs executed in parallel on the concurrent cores. We refer to this phenomenon as shared-resource interference. This thesis proposes a novel way of modeling shared-resource interference during WCET analysis. It enables an efficient analysis — as it only considers one processor core at a time — and it is sound for hardware platforms exhibiting timing anomalies. Moreover, this thesis demonstrates how to realize a timing-compositional verification on top of the proposed modeling scheme. In this way, this thesis closes the gap between modern hardware platforms, which exhibit timing anomalies, and existing schedulability analyses, which rely on timing compositionality. In addition, this thesis proposes a novel method for calculating an upper bound on the amount of interference that a given processor core can generate in any time interval of at most a given length. Our experiments demonstrate that the novel method is more precise than existing methods.Die Analyse der maximalen Ausführungszeit (Worst-Case-Execution-Time-Analyse, WCET-Analyse) ist für eingebettete Echtzeit-Computer-Systeme in sicherheitskritischen Anwendungsbereichen unerlässlich. Mehrkernprozessoren erschweren die WCET-Analyse, da einige ihrer Hardware-Komponenten von mehreren Prozessorkernen gemeinsam genutzt werden und die Ausführungszeit eines Programmes somit vom Verhalten mehrerer Kerne abhängt. Wir bezeichnen dies als Interferenz durch gemeinsam genutzte Komponenten. Die vorliegende Arbeit schlägt eine neuartige Modellierung dieser Interferenz während der WCET-Analyse vor. Der vorgestellte Ansatz ist effizient und führt auch für Computer-Systeme mit Zeitanomalien zu korrekten Ergebnissen. Darüber hinaus zeigt diese Arbeit, wie ein zeitkompositionales Verfahren auf Basis der vorgestellten Modellierung umgesetzt werden kann. Auf diese Weise schließt diese Arbeit die Lücke zwischen modernen Mikroarchitekturen, die Zeitanomalien aufweisen, und den existierenden Planbarkeitsanalysen, die sich alle auf die Kompositionalität des Zeitverhaltens verlassen. Außerdem stellt die vorliegende Arbeit ein neues Verfahren zur Berechnung einer oberen Schranke der Menge an Interferenz vor, die ein bestimmter Prozessorkern in einem beliebigen Zeitintervall einer gegebenen Länge höchstens erzeugen kann. Unsere Experimente zeigen, dass das vorgestellte Berechnungsverfahren präziser ist als die existierenden Verfahren.Deutsche Forschungsgemeinschaft (DFG) as part of the Transregional Collaborative Research Centre SFB/TR 14 (AVACS

    Modeling and automated synthesis of reconfigurable interfaces

    Get PDF
    Stefan IhmorPaderborn, Univ., Diss., 200

    Optimizing Simulation In Multiprocessor Platforms Using Dynamic-compiled Simulation

    No full text
    Conselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)Contemporary SoC design involves the proper selection of cores from a reference platform. Such selection implies the design exploration of CPUs, which requires simulation platforms with high performance and flexibility. Applying retarget able instruction-set simulation tools in this environment can simplify the design of new architectures. The increasing system complexity makes the traditional approach to simulation inefficient for today's architectures. The dynamic-compiled instruction-set simulation compiles application code blocks, at runtime, to accelerate the simulation with high efficiency. This paper presents a retarget able dynamic-compiled simulator to improve the performance in multiprocessor platforms. Three architectures were modeled - MIPS, SPARC and PowerPC - and tested in platforms with 1, 2, 4 and 8 processors. The performance on platforms with dynamic-compiled simulators was 3 times better than interpreted simulators, using large programs. Dynamic-compiled simulators outside the platforms with single core programs reached the 139 Million Instructions per Seconds on average. © 2012 IEEE.8087 Brazilian Computer Society (SBC),CAPES,FAPERJ,Cons. Nac. Desenvolv. Cient. Tecnol. (CNPq),Ministerio da Ciencia, Tecnologia e InovacaoConselho Nacional de Desenvolvimento Científico e Tecnológico (CNPq)Rigo, S., Araujo, G., Bartholomeu, M., Azevedo, R., ArchC: A systemc-based architecture description language SBAC-PAD '04, Foz Do Iguaçu, PR, Brasil, 2004, pp. 66-73. , Proceedings of the 16th Symposium on Computer Architecture and High Performance Computing, serPees, S., Zivojnovic, V., Ropers, A., Meyr, H., Fast simulation of the TI TMS 320c54x DSP Proc. Int. Conf. on Signal Processing Application and Technology (ICSPAT), San Diego, Sep 1997, pp. 995-999Araujo, G., Rigo, S., Azevedo, R., Processor design with ArchC (2008) Processor Description Languages, , P. Mishra and N. Dutt, Eds. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., ch. 11Garcia, M.S., Azevedo, R., Rigo, S., Optimizing a retargetable compiled simulator to achieve near-native performance WSCAD-SCC '10, Petrópolis, RJ, Brasil, 2010, pp. 33-39. , Proceedings of the 11th Symposium on Computing Systems, ser(2012), http://www.systemc.orgCai, L., Gajski, D., Transaction level modeling: An overview CODES+ISSS '03, New York, NY, USA, 2003, pp. 19-24. , Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, serLattner, C., Adve, V., Llvm: A compilation framework for lifelong program analysis & transformation CGO '04, Palo Alto, California, 2004, pp. 0-75. , Proceedings of the international symposium on Code generation and optimization: feedback-directed and runtime optimization, serBurtscher, M., Ganusov, I., Automatic synthesis of highspeed processor simulators MICRO 37, Portland, Oregon, 2004, pp. 55-66. , Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, serBartholomeu, M., Azevedo, R., Rigo, S., Araujo, G., Optimizations for compiled simulation using instruction type information SBAC-PAD '04, Foz Do Iguaçu, PR, Brasil, 2004, pp. 74-81. , Proceedings of the 16th Symposium on Computer Architecture and High Performance Computing, serCmelik, B., Keppel, D., Shade: A fast instruction-set simulator for execution profiling SIGMETRICS '94, Nashville, Tennessee, United States, 1994, pp. 128-137. , Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems, serZivojnovic, V., Pees, S., Meyr, H., Lisa - Machine description language and generic machine model for hw/sw codesign VLSISP '96, San Francisco, USA, 1996, pp. 127-136. , Proceedings of the IEEE Workshop on VLSI Signal Processing, serHalambi, A., Grun, P., Ganesh, V., Khare, A., Dutt, N., Nicolau, A., Expression: A language for architecture exploration through compiler/simulator retargetability (1999) DATE '99, pp. 485-490. , Proceedings of the conference on Design, automation and test in Europe, ser. Munich, Germany: ACMReshadi, M., Mishra, P., Dutt, N., Instruction set compiled simulation: A technique for fast and flexible instruction set simulation DAC '03, Anaheim, CA, USA, 2003, pp. 758-763. , Proceedings of the 40th annual Design Automation Conference, serQin, W., D'Errico, J., Zhu, X., A multiprocessing approach to accelerate retargetable and portable dynamic-compiled instruction-set simulation CODES+ISSS '06, Seoul, Korea, 2006, pp. 193-198. , Proceedings of the 4th international conference on Hardware/software codesign and system synthesis, serHong, D.-Y., Hsu, C.-C., Yew, P.-C., Wu, J.-J., Hsu, W.-C., Liu, P., Wang, C.-M., Chung, Y.-C., Hqemu: A multi-threaded and retargetable dynamic binary translator on multicores (2012) CGO '12, pp. 104-113. , http://doi.acm.org/10.1145/2259016.2259030, Proceedings of the Tenth International Symposium on Code Generation and Optimization, ser. New York, NY, USA: ACM, [Online]. AvailableBellard, F., Qemu, a fast and portable dynamic translator (2005) ATEC '05, pp. 41-41. , http://dl.acm.org/citation.cfm?id=1247360.1247401, Proceedings of the annual conference on USENIX Annual Technical Conference, ser. Berkeley, CA, USA: USENIX Association, [Online]. AvailableHelmstetter, C., Joloboff, V., Xinlei, Z., Xiaopeng, G., Fast Instruction Set Simulation Using LLVM-based Dynamic Translation (2011) International MultiConference of Engineers and Computer Scientists 2011, 2188, pp. 212-216. , http://hal.inria.fr/hal-00646947, IAENG. Hong Kong, Chine: Springer, Jul. [Online]. AvailableAzevedo, R., Rigo, S., Bartholomeu, M., Araujo, G., Araujo, C., Barros, E., The archc architecture description language and tools (2005) Int. J. Parallel Program., 33, pp. 453-484. , OctoberKane, G., (1988) MIPS RISC Architecture, , Upper Saddle River, NJ, USA: Prentice-Hall, IncDiefendorff, K., Silha, E., The powerpc user instruction set architecture (1994) IEEE Micro, 14, pp. 30-41. , OctoberPaul, R.P., (1999) SPARC Architecture, Assembly Language Programming, and C, , 2nd ed. Upper Saddle River, NJ, USA: Prentice Hall PTRIqbal, S.M.Z., Liang, Y., Grahn, H., Parmibench - An open-source benchmark for embedded multiprocessor systems (2010) IEEE Computer Architecture Letters, 9, pp. 45-48Woo, S.C., Ohara, M., Torrie, E., Singh, J.P., Gupta, A., The splash-2 programs: Characterization and methodological considerations (1995) SIGARCH Comput. Archit. News, 23 (2), pp. 24-36. , http://doi.acm.org/10.1145/225830.223990, May [Online]. AvailableGuthaus, M.R., Ringenberg, J.S., Ernst, D., Austin, T.M., Mudge, T., Brown, R.B., Mibench: A free, commercially representative embedded benchmark suite Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop on, Dec. 2001, pp. 3-1
    corecore