749 research outputs found
Enlarging instruction streams
The stream fetch engine is a high-performance fetch architecture based on the concept of an instruction stream. We call a sequence of instructions from the target of a taken branch to the next taken branch, potentially containing multiple basic blocks, a stream. The long length of instruction streams makes it possible for the stream fetch engine to provide a high fetch bandwidth and to hide the branch predictor access latency, leading to performance results close to a trace cache at a lower implementation cost and complexity. Therefore, enlarging instruction streams is an excellent way to improve the stream fetch engine. In this paper, we present several hardware and software mechanisms focused on enlarging those streams that finalize at particular branch types. However, our results point out that focusing on particular branch types is not a good strategy due to Amdahl's law. Consequently, we propose the multiple-stream predictor, a novel mechanism that deals with all branch types by combining single streams into long virtual streams. This proposal tolerates the prediction table access latency without requiring the complexity caused by additional hardware mechanisms like prediction overriding. Moreover, it provides high-performance results which are comparable to state-of-the-art fetch architectures but with a simpler design that consumes less energy.Peer ReviewedPostprint (published version
Worst-Case Execution Time Analysis of Predicated Architectures
The time-predictable design of computer architectures for the use in (hard) real-time systems is becoming more and more important, due to the increasing complexity of modern computer architectures. The design of predictable processor pipelines recently received considerable attention. The goal here is to find a trade-off between predictability and computing power.
Branches and jumps are particularly problematic for high-performance processors. For one, branches are executed late in the pipeline. This either leads to high branch penalties (flushing) or complex software/hardware techniques (branch predictors). Another side-effect of branches is that they make it difficult to exploit instruction-level parallelism due to control dependencies.
Predicated computer architectures allow to attach a predicate to the instructions in a program. An instruction is then only executed when the predicate evaluates to true and otherwise behaves like a simple nop instruction. Predicates can thus be used to convert control dependencies into data dependencies, which helps to address both of the aforementioned problems.
A downside of predicated instructions is the precise worst-case execution time (WCET) analysis of programs making use of them. Predicated memory accesses, for instance, may or may not have an impact on the processor\u27s cache and thus need to be considered by the cache analysis. Predication potentially has an impact on all analysis phases of a WCET analysis tool. We thus explore a preprocessing step that explicitly unfolds the control-flow graph, which allows us to apply standard analyses that are themselves not aware of predication
Recommended from our members
JavaFlow : a Java DataFlow Machine
textThe JavaFlow, a Java DataFlow Machine is a machine design concept implementing a Java Virtual Machine aimed at addressing technology roadmap issues along with the ability to effectively utilize and manage very large numbers of processing cores. Specific design challenges addressed include: design complexity through a common set of repeatable structures; low power by featuring unused circuits and ability to power off sections of the chip; clock propagation and wire limits by using locality to bring data to processing elements and a Globally Asynchronous Locally Synchronous (GALS) design; and reliability by allowing portions of the design to be bypassed in case of failures. A Data Flow Architecture is used with multiple heterogeneous networks to connect processing elements capable of executing a single Java ByteCode instruction. Whole methods are cached in this DataFlow fabric, and the networks plus distributed intelligence are used for their management and execution. A mesh network is used for the DataFlow transfers; two ordered networks are used for management and control flow mapping; and multiple high speed rings are used to access the storage subsystem and a controlling General Purpose Processor (GPP). Analysis of benchmarks demonstrates the potential for this design concept. The design process was initiated by analyzing SPEC JVM benchmarks which identified a small number methods contributing to a significant percentage of the overall ByteCode operations. Additional analysis established static instruction mixes to prioritize the types of processing elements used in the DataFlow Fabric. The overall objective of the machine is to provide multi-threading performance for Java Methods deployed to this DataFlow fabric. With advances in technology it is envisioned that from 1,000 to 10,000 cores/instructions could be deployed and managed using this structure. This size of DataFlow fabric would allow all the key methods from the SPEC benchmarks to be resident. A baseline configuration is defined with a compressed dataflow structure and then compared to multiple configurations of instruction assignments and clock relationships. Using a series of methods from the SPEC benchmark running independently, IPC (Instructions per Cycle) performance of the sparsely populated heterogeneous structure is 40% of the baseline. The average ratio of instructions to required nodes is 3.5. Innovative solutions to the loading and management of Java methods along with the translation from control flow to DataFlow structure are demonstrated.Electrical and Computer Engineerin
An evaluation of the TRIPS computer system
The TRIPS system employs a new instruction set architecture (ISA) called Explicit Data Graph Execution (EDGE) that renegotiates the boundary between hardware and software to expose and exploit concurrency. EDGE ISAs use a block-atomic execution model in which blocks are composed of dataflow instructions. The goal of the TRIPS design is to mine concurrency for high performance while tolerating emerging technology scaling challenges, such as increasing wire delays and power consumption. This paper evaluates how well TRIPS meets this goal through a detailed ISA and performance analysis. We compare performance, using cycles counts, to commercial processors. On SPEC CPU2000, the Intel Core 2 outperforms compiled TRIPS code in most cases, although TRIPS matches a Pentium 4. On simple benchmarks, compiled TRIPS code outperforms the Core 2 by 10% and hand-optimized TRIPS code outperforms it by factor of 3. Compared to conventional ISAs, the block-atomic model provides a larger instruction window, increases concurrency at a cost of more instructions executed, and replaces register and memory accesses with more efficient direct instruction-to-instruction communication. Our analysis suggests ISA, microarchitecture, and compiler enhancements for addressing weaknesses in TRIPS and indicates that EDGE architectures have the potential to exploit greater concurrency in future technologies
Stack-less SIMT reconvergence at low cost
Parallel architectures following the SIMT model such as GPUs benefit from application regularity by issuing concurrent threads running in lockstep on SIMD units. As threads take different paths across the control-flow graph, lockstep execution is partially lost, and must be regained whenever possible in order to maximize the occupancy of SIMD units. In this paper, we propose a technique to handle SIMT control divergence that operates in constant space and handles indirect jumps and recursion. We describe a possible implementation which leverage the existing memory divergence management unit, ensuring a low hardware cost. In terms of performance, this solution is at least as efficient as existing techniques
Low level conditional move optimization
The high level optimizations are becoming more and more sophisticated, the importance of low level optimizations should not be underestimated. Due to the changes in the inner architecture of modern processors, some optimization techniques may become more or less effective. Existing techniques need, from time to time, to be reconsidered, and new techniques, targeting these modern architectures, may emerge. Due to the growing instruction pipeline of modern processors, recovering after branch mis-predictions is becoming more expensive, and so avoiding that is becoming more critical. In this paper we introduce a novel approach to branch elimination using conditional move operations, namely the CMOVcc instruction group. The inappropriate use of these instructions may result in sensible performance regression, but in many cases they outperform the sequence of a conditional jump and an unconditional move instruction. Our goal is to analyze the usage of CMOVcc in different contexts on modern processors, and based on these results, propose a technique to automatically decide whether the conditional move or the sequence of a conditional jump and an unconditional move should be performed in a given situation
Design and evaluation of a VLIW processor for real-time systems
Tese (doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Engenharia de Automação e Sistemas, Florianópolis, 2016.Atualmente, aplicações de tempo estão tornando-se cada vez mais complexas e, conforme os requisitos destes sistemas aumentam, maior é a demanda por capacidade de processamento. Contudo, o correto funcionamento destas aplicações não está em função somente da correta resposta lógica, mas também no tempo que ela é produzida. O projeto de processadores de propósito geral gera dificuldades para análises de tempo real devido ao seu comportamento não determinista causado pelo uso de memórias cache, previsores de fluxo dinâmicos, execução especulativa e fora de ordem. Nesta tese, investiga-se uma arquitetura de processador Very-Long Instruction Word (VLIW) especificamente projetada para sistemas de tempo real considerando sua análise do pior tempo de computação (Worst-case Execution Time WCET). Técnicas para obtenção do WCET para máquinas VLIW são consideradas e quantifica-se a importância de técnicas de hardware como previsor de fluxo estático, predicação, bem como velocidade do processador para instruções complexas como acesso a memória e multiplicação. Arquitetura de memória não faz parte do escopo deste trabalho e para tal utilizamos uma estrutura determinista formada por uma memória cache com mapeamento direto para instruções e uma memória de rascunho (scratchpad) para dados. Nós também consideramos a implementação em VHDL do protótipo para inferir suas características temporais mantendo compatibilidade com o conjunto de instruções (ISA) HP VLIW ST231. Em termos de avaliação, foi utilizado um conjunto representativo de código exemplos da Universidade de Mälardalen que é amplamente utilizado em avaliações de sistemas de tempo real.Abstract : Nowadays, many real-time applications are very complex and as the complexity and the requirements of those applications become more demanding, more hardware processing capacity is necessary. The correct functioning of real-time systems depends not only on the logically correct response, but also on the time when it is produced. General purpose processor design fails to deliver analyzability due to their non-deterministic behavior caused by the use of cache memories, dynamic branch prediction, speculative execution and out-of-order pipelines. In this thesis, we design and evaluate the performance of VLIW (Very Long Instruction Word) architectures for real-time systems with an in-order pipeline considering WCET (Worst-case Execution Time) performance. Techniques on obtaining the WCET of VLIW machines are also considered and we make a quantification on how important are hardware techniques such as static branch prediction, predication, pipeline speed of complex operations such as memory access and multiplication for high-performance real-time systems. The memory hierarchy is out of scope of this thesis and we used a classic deterministic structure formed by a direct mapped instruction cache and a data scratchpad memory. A VLIW prototype was implemented in VHDL from scratch considering the HP VLIW ST231 ISA. We also show some compiler insights and we use a representative subset of the Mälardalen s WCET benchmarks for validation and performance quantification. Supporting our objective to investigate and evaluate hardware features which reconcile determinism and performance, we made the following contributions: design space investigation and evaluation regarding VLIW processors, complete WCET analysis for the proposed design, complete VHDL design and timing characterization, detailed branch architecture, low-overhead full-predication system for VLIW processors
The Impact of Java Applications at Microarchitectural Level from Branch Prediction Perspective
The portability, the object-oriented and distributed programming models, multithreading support and automatic garbage collection are features that make Java very attractive for application developers. The main goal of this paper consists in pointing out the impact of Java applications at microarchitectural level from two perspectives: unbiased branches and indirect jumps/calls, such branches limiting the ceiling of dynamic branch prediction and causing significant performance degradation. Therefore, accurately predicting this kind of branches remains an open problem. The simulation part of the paper mainly refers to determining the context length influence on the percentage of unbiased branches from Java applications, the prediction accuracy and the usage degree obtained using a Fast Path-Based Perceptron predictor. We realize a comparison with C/C++ application behavior from unbiased branches perspective. We also analyze some Java testing programs, built using design patterns or including inheritance, polymorphism, backtracking and recursivity, in order to determine the features of indirect branches, the arity of each indirect jump and the prediction accuracy using the Target Cache predictor
- …