41 research outputs found
An evaluation of the TRIPS computer system
The TRIPS system employs a new instruction set architecture (ISA) called Explicit Data Graph Execution (EDGE) that renegotiates the boundary between hardware and software to expose and exploit concurrency. EDGE ISAs use a block-atomic execution model in which blocks are composed of dataflow instructions. The goal of the TRIPS design is to mine concurrency for high performance while tolerating emerging technology scaling challenges, such as increasing wire delays and power consumption. This paper evaluates how well TRIPS meets this goal through a detailed ISA and performance analysis. We compare performance, using cycles counts, to commercial processors. On SPEC CPU2000, the Intel Core 2 outperforms compiled TRIPS code in most cases, although TRIPS matches a Pentium 4. On simple benchmarks, compiled TRIPS code outperforms the Core 2 by 10% and hand-optimized TRIPS code outperforms it by factor of 3. Compared to conventional ISAs, the block-atomic model provides a larger instruction window, increases concurrency at a cost of more instructions executed, and replaces register and memory accesses with more efficient direct instruction-to-instruction communication. Our analysis suggests ISA, microarchitecture, and compiler enhancements for addressing weaknesses in TRIPS and indicates that EDGE architectures have the potential to exploit greater concurrency in future technologies
Recommended from our members
Scalable hardware memory disambiguation
This dissertation deals with one of the long-standing problems in Computer Architecture
– the problem of memory disambiguation. Microprocessors typically reorder
memory instructions during execution to improve concurrency. Such microprocessors
use hardware memory structures for memory disambiguation, known as LoadStore
Queues (LSQs), to ensure that memory instruction dependences are satisfied
even when the memory instructions execute out-of-order. A typical LSQ implementation
(circa 2006) holds all in-flight memory instructions in a physically centralized
LSQ and performs a fully associative search on all buffered instructions to ensure
that memory dependences are satisfied. These LSQ implementations do not scale
because they use large, fully associative structures, which are known to be slow and
power hungry. The increasing trend towards distributed microarchitectures further
exacerbates these problems. As on-chip wire delays increase and high-performance
processors become necessarily distributed, centralized structures such as the LSQ
can limit scalability.
This dissertation describes techniques to create scalable LSQs in both centralized
and distributed microarchitectures. The problems and solutions described
in this thesis are motivated and validated by real system designs. The dissertation
starts with a description of the partitioned primary memory system of the TRIPS
processor, of which the LSQ is an important component, and then through a series
of optimizations describes how the power, area, and centralization problems
of the LSQ can be solved with minor performance losses (if at all) even for large
number of in flight memory instructions. The four solutions described in this dissertation
— partitioning, filtering, late binding and efficient overflow management —
enable power-, area-efficient, distributed and scalable LSQs, which in turn enable
aggressive large-window processors capable of simultaneously executing thousands
of instructions.
To mitigate the power problem, we replaced the power-hungry, fully associative
search with a power-efficient hash table lookup using a simple address-based
Bloom filter. Bloom filters are probabilistic data structures used for testing set
membership and can be used to quickly check if an instruction with the same data
address is likely to be found in the LSQ without performing the associative search.
Bloom filters typically eliminate more than 80% of the associative searches and they
are highly effective because in most programs, it is uncommon for loads and stores
to have the same data address and be in execution simultaneously.
To rectify the area problem, we observe the fact that only a small fraction
of all memory instructions are dependent, that only such dependent instructions
need to be buffered in the LSQ, and that these instructions need to be in the LSQ
only for certain parts of the pipelined execution. We propose two mechanisms to
exploit these observations. The first mechanism, area filtering, is a hardware mechanism
that couples Bloom filters and dependence predictors to dynamically identify
and buffer only those instructions which are likely to be dependent. The second
mechanism, late binding, reduces the occupancy and hence size of the LSQ. Both of
these optimizations allows the number of LSQ slots to be reduced by up to one-half
compared to a traditional organization without any performance degradation.
Finally, we describe a new decentralized LSQ design for handling LSQ structural
hazards in distributed microarchitectures. Decentralization of LSQs, and to
a large extent distributed microarchitectures with memory speculation, has proved
to be impractical because of the high performance penalties associated with the
mechanisms for dealing with hazards. To solve this problem, we applied classic
flow-control techniques from interconnection networks for handling resource con-
flicts. The first method, memory-side buffering, buffers the overflowing instructions
in a separate buffer near the LSQs. The second scheme, execution-side NACKing,
sends the overflowing instruction back to the issue window from which it is later
re-issued. The third scheme, network buffering, uses the buffers in the interconnection
network between the execution units and memory to hold instructions when the
LSQ is full, and uses virtual channel flow control to avoid deadlocks. The network
buffering scheme is the most robust of all the overflow schemes and shows less than
1% performance degradation due to overflows for a subset of SPEC CPU 2000 and
EEMBC benchmarks on a cycle-accurate simulator that closely models the TRIPS
processor.
The techniques proposed in this dissertation are independent, architectureneutral
and their cumulative benefits result in LSQs that can be partitioned at a
fine granularity and have low design complexity. Each of these partitions selectively
buffers only memory instructions with true dependences and can be closely coupled
with the execution units thus minimizing power, area, and latency. Such LSQ
designs with near-ideal characteristics are well suited for microarchitectures with
thousands of instructions in-flight and may enable even more aggressive microarchitectures
in the future.Computer Science
Recommended from our members
Combined branch target and predicate prediction
Embodiments provide methods, apparatus, systems, and computer readable media associated with predicting predicates and branch targets during execution of programs using combined branch target and predicate predictions. The predictions may be made using one or more prediction control flow graphs which represent predicates in instruction blocks and branches between blocks in a program. The prediction control flow graphs may be structured as trees such that each node in the graphs is associated with a predicate instruction, and each leaf associated with a branch target which jumps to another block. During execution of a block, a prediction generator may take a control point history and generate a prediction. Following the path suggested by the prediction through the tree, both predicate values and branch targets may be predicted. Other embodiments may be described and claimed.Board of Regents, University of Texas Syste
Recommended from our members
Teaching Microarchitecture through Metaphors
Students traditionally learn microarchitecture by studying textual descriptions with diagrams but few analogies. Several popular textbooks on this topic introduce concepts such as pipelining and caching in the context of simple paper-only architectures. While this instructional style allows important concepts to be covered within a given class period, students have difficulty bridging the gap between what is covered in classes and real-world implementations. Discussing concrete implementations and complications would, however, take too much time. In this paper, we propose a technique of representing microarchitecture building blocks with animated metaphors to accelerate the process of learning about complex microarchitectures. We represent hardware implementations as road networks that include specific patterns of traffic flow found in microarchitectural behavior. Our experiences indicate an 83% improvement to understanding memory system microarchitecture. We believe the mental models developed by these students will serve them in remembering microarchitectural behavior and extend to learning new microarchitectures more easily
Design and implementation of in-network coherence
Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2013.Title as it appears in MIT Commencement Exercises program, June 2013: Design and implementation of in-network coherence. Cataloged from PDF version of thesis.Includes bibliographical references (p. 101-104).CMOS technology scaling has enabled increasing transistor density on chip. At the same time, multi-core processors that provide increased performance, vis-a'-vis power efficiency, have become prevalent in a power constrained environment. The shared memory model is a predominant paradigm in such systems, easing programmability and increasing portability. However with memory being shared by an increasing number of cores, a scalable coherence mechanism is imperative for these systems. Snoopy coherence has been a favored coherence scheme owing to its high performance and simplicity. However there are few viable proposals to extend snoopy coherence to unordered interconnects - specifically, modular packet-switched interconnects that have emerged as a scalable solution to the communication challenges in the CMP era. This thesis proposes a distributed in-network global ordering scheme that enables snoopy coherence on unordered interconnects. The proposed scheme is realized on a two-dimensional mesh interconnection network, referred to as OMNI (Ordered Mesh Network Interconnect). OMNI is an enabling solution for the SCORPIO processor prototype developed at MIT - a 36-core chip multi-processor supporting snoopy coherence, and fabricated in a commercial 45nm technology. OMNI is shown to be effective, reducing runtime by 36% in comparison to directory and Hammer coherence protocol implementations. The OMNI network achieves an operating frequency of 833 MHz post-layout, occupies 10% of the chip area, and consumes less than 100mW of power.by Suvinay Subramanian.S.M
Tiled microprocessors
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.Includes bibliographical references (p. 251-258).Current-day microprocessors have reached the point of diminishing returns due to inherent scalability limitations. This thesis examines the tiled microprocessor, a class of microprocessor which is physically scalable but inherits many of the desirable properties of conventional microprocessors. Tiled microprocessors are composed of an array of replicated tiles connected by a special class of network, the Scalar Operand Network (SON), which is optimized for low-latency, low-occupancy communication between remote ALUs on different tiles. Tiled microprocessors can be constructed to scale to 100's or 1000's of functional units. This thesis identifies seven key criteria for achieving physical scalability in tiled microprocessors. It employs an archetypal tiled microprocessor to examine the challenges in achieving these criteria and to explore the properties of Scalar Operand Networks. The thesis develops the field of SONs in three major ways: it introduces the 5-tuple performance metric, it describes a complete, high-frequency SON implementation, and it proposes a taxonomy, called AsTrO, for categorizing them.(cont.) To develop these ideas, the thesis details the design, implementation and analysis of a tiled microprocessor prototype, the Raw Microprocessor, which was implemented at MIT in 180 nm technology. Overall, compared to Raw, recent commercial processors with half the transistors required 30x as many lines of code, occupied 100x as many designers, contained 50x as many pre-tapeout bugs, and resulted in 33x as many post-tapeout bugs. At the same time, the Raw microprocessor proves to be more versatile in exploiting ILP, stream, and server-farm workloads with modest to large amounts of parallelism.by Michael Bedford Taylor.Ph.D
High performance dense linear algebra on a spatially distributed processor
As technology trends have limited the performance scaling of conventional processors, industry and academic research has turned to parallel architectures on a single chip, including distributed uniprocessors and multicore chips. This paper examines how to extend the archtypical operation of dense linear algebra, matrix multiply, to an emerging class of uniprocessor architectures characterized by a large number of independent functional units, register banks, and cache banks connected by a 2-D on-chip network. We extend the well known algorithm for matrix multiplication by Goto to this spatially distributed class of uniprocessor and describe the optimizations of the innermost kernel, a systolic-like algorithm running on a general purpose uniprocessor. The resulting implementation yields the first demonstration of high-performance in an application executing on the TRIPS processor hardware, a next-generation distributed processor core. We show that such processors are indeed capable of substantial improvements in single threaded performance provided their spatial topography is taken into account
The Execution Migration Machine: Directoryless Shared-Memory Architecture
For certain applications involving chip multiprocessors with more than 16 cores, a directoryless architecture with fine-grained and partial-context thread migration can outperform directory-based coherence, providing lighter on-chip traffic and reduced verification complexity
Summarizing multiprocessor program execution with versatile, microarchitecture-independent snapshots
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2006.Includes bibliographical references (p. 131-137).Computer architects rely heavily on software simulation to evaluate, refine, and validate new designs before they are implemented. However, simulation time continues to increase as computers become more complex and multicore designs become more common. This thesis investigates software structures and algorithms for quickly simulating modern cache-coherent multiprocessors by amortizing the time spent to simulate the memory system and branch predictors. The Memory Timestamp Record (MTR) summarizes the directory and cache state of a multiprocessor system in a compact data structure. A single MTR snapshot is versatile enough to reconstruct the microarchitectural state resulting from various coherence protocols and cache organizations. The MTR may be quickly updated by each simulated processor during a fast-forwarding phase and optionally stored off-line for reuse. To fill large branch prediction tables, we introduce Branch Predictor-based Compression (BPC) which compactly stores a branch trace so that it may be used to fill in any branch predictor structure. An entire BPC trace requires less space than single discrete predictor snapshots, and it may be decompressed 3-6x faster than performing functional simulation.by Kenneth C. Barr.Ph.D