3,420 research outputs found
Improving latency tolerance of multithreading through decoupling
The increasing hardware complexity of dynamically scheduled superscalar processors may compromise the scalability of this organization to make an efficient use of future increases in transistor budget. SMT processors, designed over a superscalar core, are therefore directly concerned by this problem. The article presents and evaluates a novel processor microarchitecture which combines two paradigms: simultaneous multithreading and access/execute decoupling. Since its decoupled units issue instructions in order, this architecture is significantly less complex, in terms of critical path delays, than a centralized out-of-order design, and it is more effective for future growth in issue-width and clock speed. We investigate how both techniques complement each other. Since decoupling features an excellent memory latency hiding efficiency, the large amount of parallelism exploited by multithreading may be used to hide the latency of functional units and keep them fully utilized. The study shows that, by adding decoupling to a multithreaded architecture, fewer threads are needed to achieve maximum throughput. Therefore, in addition to the obvious hardware complexity reduction, it places lower demands on the memory system. The study also reveals that multithreading by itself exhibits little memory latency tolerance. Results suggest that most of the latency hiding effectiveness of SMT architectures comes from the dynamic scheduling. On the other hand, decoupling is very effective at hiding memory latency. An increase in the cache miss penalty from 1 to 32 cycles reduces the performance of a 4-context multithreaded decoupled processor by less than 2 percent. For the nondecoupled multithreaded processor, the loss of performance is about 23 percent.Peer ReviewedPostprint (published version
Instruction fetch architectures and code layout optimizations
The design of higher performance processors has been following two major trends: increasing the pipeline depth to allow faster clock rates, and widening the pipeline to allow parallel execution of more instructions. Designing a higher performance processor implies balancing all the pipeline stages to ensure that overall performance is not dominated by any of them. This means that a faster execution engine also requires a faster fetch engine, to ensure that it is possible to read and decode enough instructions to keep the pipeline full and the functional units busy. This paper explores the challenges faced by the instruction fetch stage for a variety of processor designs, from early pipelined processors, to the more aggressive wide issue superscalars. We describe the different fetch engines proposed in the literature, the performance issues involved, and some of the proposed improvements. We also show how compiler techniques that optimize the layout of the code in memory can be used to improve the fetch performance of the different engines described Overall, we show how instruction fetch has evolved from fetching one instruction every few cycles, to fetching one instruction per cycle, to fetching a full basic block per cycle, to several basic blocks per cycle: the evolution of the mechanism surrounding the instruction cache, and the different compiler optimizations used to better employ these mechanisms.Peer ReviewedPostprint (published version
Software trace cache
We explore the use of compiler optimizations, which optimize the layout of instructions in memory. The target is to enable the code to make better use of the underlying hardware resources regardless of the specific details of the processor/architecture in order to increase fetch performance. The Software Trace Cache (STC) is a code layout algorithm with a broader target than previous layout optimizations. We target not only an improvement in the instruction cache hit rate, but also an increase in the effective fetch width of the fetch engine. The STC algorithm organizes basic blocks into chains trying to make sequentially executed basic blocks reside in consecutive memory positions, then maps the basic block chains in memory to minimize conflict misses in the important sections of the program. We evaluate and analyze in detail the impact of the STC, and code layout optimizations in general, on the three main aspects of fetch performance; the instruction cache hit rate, the effective fetch width, and the branch prediction accuracy. Our results show that layout optimized, codes have some special characteristics that make them more amenable for high-performance instruction fetch. They have a very high rate of not-taken branches and execute long chains of sequential instructions; also, they make very effective use of instruction cache lines, mapping only useful instructions which will execute close in time, increasing both spatial and temporal locality.Peer ReviewedPostprint (published version
The synergy of multithreading and access/execute decoupling
This work presents and evaluates a novel processor microarchitecture which combines two paradigms: access/execute decoupling and simultaneous multithreading. We investigate how both techniques complement each other: while decoupling features an excellent memory latency hiding efficiency, multithreading supplies the in-order issue stage with enough ILP to hide the functional unit latencies. Its partitioned layout, together with its in-order issue policy makes it potentially less complex, in terms of critical path delays, than a centralized out-of-order design, to support future growths in issue-width and clock speed. The simulations show that by adding decoupling to a multithreaded architecture, its miss latency tolerance is sharply increased and in addition, it needs fewer threads to achieve maximum throughput, especially for a large miss latency. Fewer threads result in a hardware complexity reduction and lower demands on the memory system, which becomes a critical resource for large miss latencies, since bandwidth may become a bottleneck.Peer ReviewedPostprint (published version
Quantifying the benefits of SPECint distant parallelism in simultaneous multithreading architectures
We exploit the existence of distant parallelism that future compilers could detect and characterise its performance under simultaneous multithreading architectures. By distant parallelism we mean parallelism that cannot be captured by the processor instruction window and that can produce threads suitable for parallel execution in a multithreaded processor. We show that distant parallelism can make feasible wider issue processors by providing more instructions from the distant threads, thus better exploiting the resources from the processor in the case of speeding up single integer applications. We also investigate the necessity of out-of-order processors in the presence of multiple threads of the same program. It is important to notice at this point that the benefits described are totally orthogonal to any other architectural techniques targeting a single thread.Peer ReviewedPostprint (published version
Fast simulation techniques for microprocessor design space exploration
Designing a microprocessor is extremely time-consuming. Computer architects heavily rely on architectural simulators, e.g., to drive high-level design decisions during early stage design space exploration. The benefit of architectural simulators is that they yield relatively accurate performance results, are highly parameterizable and are very flexible to use. The downside, however, is that they are at least three or four orders of magnitude slower than real hardware execution. The current trend towards multicore processors exacerbates the problem; as the number of cores on a multicore processor increases, simulation speed has become a major concern in computer architecture research and development.
In this dissertation, we propose and evaluate two simulation techniques that reduce the simulation time significantly: statistical simulation and interval simulation. Statistical simulation speeds up the simulation by reducing the number of dynamically executed instructions. First, we collect a number of program execution characteristics into a statistical profile. From this profile we can generate a synthetic trace that exhibits the same execution behavior but which has a much shorter trace length as compared to the original trace. Simulating this synthetic trace then yields a performance estimate. Interval simulation raises the level of abstraction in architectural simulation; it replaces the core-level cycle-accurate simulation model by a mechanistic analytical model. The analytical model builds on insights from interval analysis: miss events divide the smooth streaming of instructions into so called intervals. The model drives the timing by analyzing the type of the miss events and their latencies, instead of tracking the individual instructions as they propagate through the pipeline stages
DIA: A complexity-effective decoding architecture
Fast instruction decoding is a true challenge for the design of CISC microprocessors implementing variable-length instructions. A well-known solution to overcome this problem is caching decoded instructions in a hardware buffer. Fetching already decoded instructions avoids the need for decoding them again, improving processor performance. However, introducing such special--purpose storage in the processor design involves an important increase in the fetch architecture complexity. In this paper, we propose a novel decoding architecture that reduces the fetch engine implementation cost. Instead of using a special-purpose hardware buffer, our proposal stores frequently decoded instructions in the memory hierarchy. The address where the decoded instructions are stored is kept in the branch prediction mechanism, enabling it to guide our decoding architecture. This makes it possible for the processor front end to fetch already decoded instructions from the memory instead of the original nondecoded instructions. Our results show that using our decoding architecture, a state-of-the-art superscalar processor achieves competitive performance improvements, while requiring less chip area and energy consumption in the fetch architecture than a hardware code caching mechanism.Peer ReviewedPostprint (published version
Recommended from our members
Instruction history management for high-performance microprocessors
textHistory-driven dynamic optimization is an important factor in improving
instruction throughput in future high-performance microprocessors. Historybased
techniques have the ability to improve instruction-level parallelism by
breaking program dependencies, eliminating long-latency microarchitecture
operations, and improving prioritization within the microarchitecture. However,
a combination of factors, such as wider issue widths, smaller transistors,
larger die area, and increasing clock frequency, has led to microprocessors that
are sensitive to both wire delays and energy consumption. In this environment,
the global structures and long-distance communications that characterize current
history data management are limiting instruction throughput.
This dissertation proposes the ScatterFlow Framework for Instruction
History Management. Execution history management tasks, such as history
data storage, access, distribution, collection, and modification, are partitioned
and dispersed throughout the instruction execution pipeline. History data
packets are then associated with active instructions and flow with the instructions
as they execute, encountering the history management tasks along the
way. Between dynamic instances of the instructions, the history data packets
reside in trace-based history storage that is synchronized with the instruction
trace cache. Compared to traditional history data management, this ScatterFlow
method improves instruction coverage, increases history data access
bandwidth, shortens communication distances, improves history data accuracy
in many cases, and decreases the effective history data access time.
A comparison of general history management effectiveness between the
ScatterFlow Framework and traditional hardware tables shows that the ScatterFlow
Framework provides superior history maturity and instruction coverage.
The unique properties that arise due to trace-based history storage and
partitioned history management are analyzed, and novel design enhancements
are presented to increase the usefulness of instruction history data within the
ScatterFlow Framework.
To demonstrate the potential of the proposed framework, specific dynamic
optimization techniques are implemented using the ScatterFlow Framework.
These illustrative examples combine the history capture advantages
with the access latency improvements while exhibiting desirable dynamic energy
consumption properties. Compared to a traditional table-based predictor,
performing ScatterFlow value prediction improves execution time and reduces
dynamic energy consumption. In other detailed examples, ScatterFlowenabled
cluster assignment demonstrates improved execution time over previous
cluster assignment schemes, and ScatterFlow instruction-level profiling
detects more useful execution traits than traditional fixed-size and infinite-size
hardware tables.Electrical and Computer Engineerin
- …