Search CORE

26 research outputs found

Memory Access Scheduling

Author: Dally William J.
Dally William J.
Kapasi Ujval J.
Kapasi Ujval J.
Mattson Peter
Mattson Peter
Owens John D.
Owens John D.
Rixner Scott
Rixner Scott
Publication venue
Publication date: 28/03/2002
Field of study

Conference PaperThe bandwidth and latency of a memory system are strongly dependent on the manner in which accesses interact with the "3-D" structure of banks, rows, and columns characteristic of contemporary DRAM chips. There is nearly an order of magnitude difference in bandwidth between successive references to different columns within a row and different rows within a bank. This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure. Conservative reordering, in which the first ready reference in a sequence is performed, improves bandwidth by 40% for traces from five media benchmarks. Aggressive reordering, in which operations are scheduled to optimize memory bandwidth, improves bandwidth by 93% for the same set of applications. Memory access scheduling is particularly important for media processors where it enables the processor to make the most efficient use of scarce memory bandwidth

DSpace at Rice University

Register Organization for Media Processing

Author: Dally William J.
Dally William J.
Kapasi Ujval J.
Kapasi Ujval J.
Khailany Brucek
Khailany Brucek
Mattson Peter
Mattson Peter
Owens John D.
Owens John D.
Rixner Scott
Rixner Scott
Publication venue
Publication date: 28/03/2002
Field of study

Conference PaperProcessor architectures with tens to hundreds of arithmetic units are emerging to handle media processing applications. These applications, such as image coding, image synthesis, and image understanding, require arithmetic rates of up to 10^11 operations per second. As the number of arithmetic units in a processor increases to meet these demands, register storage and communication between the arithmetic units dominate the area, delay, and power of the arithmetic units. In this paper we show that partitioning the register file along three axes reduces the cost of register storage and communication without significantly impacting performance. We develop a taxonomy of register architectures by partitioning across the data-parallel, instruction-level parallel, and memory hierarchy axes, and by optimizing the hierarchical register organization to operate on streams of data. Compared to a centralized global register file, the most compact of these organizations reduces the register file area, delay, and power dissipation of a media processor by factors of 195, 20, and 430, respectively. This reduction in cost is achieved with a performance degradation of only 8% on a representative set of media processing benchmarks

DSpace at Rice University

Memory Access Scheduling

Author: John D. Owens
Peter Mattson
Scott Rixner
Ujval J. Kapasi
William J. Dally
Publication venue
Publication date: 01/01/2000
Field of study

The bandwidth and latency of a memory system are strongly dependent on the manner in which accesses interact with the "3-D" structure of banks, rows, and columns characteristic of contemporary DRAM chips. There is nearly an order of magnitude difference in bandwidth between successive references to different columns within a row and different rows within a bank. This paper introduces memory access scheduling, a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure. Conservative reordering, in which the first ready reference in a sequence is performed, improves bandwidth by 40% for traces from five media benchmarks. Aggressive reordering, in which operations are scheduled to optimize memory bandwidth, improves bandwidth by 93% for the same set of applications. Memory access scheduling is particularly important for media processors where it enables the processor to make the most efficient use of scar..

CiteSeerX

ABSTRACT Communication Scheduling

Author: John D. Owens
Peter Mattson
Scott Rixner
Ujval J. Kapasi
William J. Dally
Publication venue
Publication date
Field of study

The high arithmetic rates of media processing applications require architectures with tens to hundreds of functional units, multiple register files, and explicit interconnect between functional units and register files. Communication scheduling enables scheduling to these emerging architectures, including those that use shared buses and register file ports. Scheduling to these shared interconnect architectures is difficult because it requires simultaneously allocating functional units to operations and buses and register file ports to the communications between operations. Prior VLIW scheduling algorithms are limited to clustered register file architectures with no shared buses or register file ports. Communication scheduling extends the range of target architectures by making each communication explicit and decomposing it into three components: a write stub, zero or more copy operations, and a read stub. Communication scheduling allows media processing kernels to achieve 98 % of the performance of a central register file architecture on a distributed register file architecture with only 9 % of the area, 6 % of the power consumption, and 37 % of the access delay, and 120 % of the performance of a clustered register file architecture on a distributed register file architecture with 56 % of the area and 50 % of the power consumption. 1

CiteSeerX

VLSI Design and Verification of the Imagine Processor

Author: Andrew Chang
Brian Towles
Brucek Khailany
Jinyung Namkoong
Ujval J. Kapasi
William J. Dally
Publication venue
Publication date: 01/01/2002
Field of study

The Imagine stream processor is a 21 million transistor chip implemented by a collaboration between Stanford Unversity and Texas Instruments in a 1.5V 0.15 mprocess with five layers of aluminum metal. The VLSI design, clocking, and verification methodologies for the Imagine processor are presented. These methodologies enabled a small team of graduate students with limited resources to design a high-performance media processor in a modern ASIC flow. 1

CiteSeerX

eScholarship - University of California

Recommended from our members

VLSI Design and Verification of the Imagine Processor

Author: Chang Andrew
Dally William J.
Kapasi Ujval J.
Khailany Brucek
Namkoong Jinyung
Towles Brian
Publication venue: eScholarship, University of California
Publication date: 01/01/2002
Field of study

The Imagine stream processor is a 21 million transistor chip implemented by a collaboration between Stanford Unversity and Texas Instruments in a 1.5V 0.15 micron process with five layers of aluminum metal. The VLSI design, clocking, and verification methodologies for the Imagine processor are presented. These methodologies enabled a small team of graduate students with limited resources to design a high-performance media processor in a modern ASIC flow

eScholarship - University of California