Search CORE

177 research outputs found

Recommended from our members

Combined branch target and predicate prediction for instruction blocks

Author: Douglas Burger
Stephen W. Keckler
Publication venue: United States Patent and Trademark Office
Publication date: 18/06/2010
Field of study

Embodiments provide methods, apparatus, systems, and computer readable media associated with predicting predicates and branch targets during execution of programs using combined branch target and predicate predictions. The predictions may be made using one or more prediction control flow graphs which represent predicates in instruction blocks and branches between blocks in a program. The prediction control flow graphs may be structured as trees such that each node in the graphs is associated with a predicate instruction, and each leaf associated with a branch target which jumps to another block. During execution of a block, a prediction generator may take a control point history and generate a prediction. Following the path suggested by the prediction through the tree, both predicate values and branch targets may be predicted. Other embodiments may be described and claimed.Board of Regents, University of Texas Syste

Texas ScholarWorks

Recommended from our members

Combined branch target and predicate prediction

Author: Douglas Burger
Stephen W. Keckler
Publication venue: United States Patent and Trademark Office
Publication date: 25/03/2015
Field of study

Texas ScholarWorks

vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design

Author: Clemons Jason
Gimelshein Natalia
Keckler Stephen W.
Rhu Minsoo
Zulfiqar Arslan
Publication venue
Publication date: 28/07/2016
Field of study

The most widely used machine learning frameworks require users to carefully tune their memory usage so that the deep neural network (DNN) fits into the DRAM capacity of a GPU. This restriction hampers a researcher's flexibility to study different machine learning algorithms, forcing them to either use a less desirable network architecture or parallelize the processing across multiple GPUs. We propose a runtime memory manager that virtualizes the memory usage of DNNs such that both GPU and CPU memory can simultaneously be utilized for training larger DNNs. Our virtualized DNN (vDNN) reduces the average GPU memory usage of AlexNet by up to 89%, OverFeat by 91%, and GoogLeNet by 95%, a significant reduction in memory requirements of DNNs. Similar experiments on VGG-16, one of the deepest and memory hungry DNNs to date, demonstrate the memory-efficiency of our proposal. vDNN enables VGG-16 with batch size 256 (requiring 28 GB of memory) to be trained on a single NVIDIA Titan X GPU card containing 12 GB of memory, with 18% performance loss compared to a hypothetical, oracular GPU with enough memory to hold the entire DNN.Comment: Published as a conference paper at the 49th IEEE/ACM International Symposium on Microarchitecture (MICRO-49), 201

arXiv.org e-Print Archive

Crossref

포항공과대학교

Recommended from our members

Method and apparatus for congestion-aware routing in a computer interconnection network

Author: Boris Robert Grot
Paul Gratz
Stephen W. Keckler
Publication venue: United States Patent and Trademark Office
Publication date: 07/04/2014
Field of study

The present disclosure relates to an example of a method for a first router to adaptively determine status within a network. The network may include the first router, a second router and a third router. The method for the first router may comprise determining status information regarding the second router located in the network, and transmitting the status information to the third router located in the network. The second router and the third router may be indirectly coupled to one another.Board of Regents, University of Texas Syste

Texas ScholarWorks

Scalable On-chip Interconnect Topologies.

Author: Grot Boris
Keckler Stephen W.
Publication venue
Publication date: 01/01/2008
Field of study

Edinburgh Research Explorer

Recommended from our members

Computing nodes for executing groups of instructions

Author: Douglas Burger
Karthikevan Sankaralingam
Ramadass Nagarajan
Stephen W. Keckler
Publication venue: United States Patent and Trademark Office
Publication date: 10/06/2008
Field of study

A computation node according to various embodiments of the invention includes at least one input port capable of being coupled to at least one first other 5 computation node, a first store coupled to the input port(s) to store input data, a second store to receive and store instructions, an instruction wakeup unit to match the input data to the instructions, at least one execution unit to execute the instructions, using the input data to produce output data, and at least one output port capable of being coupled to at least one second other computation node. The node may also include a router to direct the output data from the output port(s) to the second other node. A system according to various embodiments of the invention includes and external instruction sequencer to fetch a group of instructions, and one or more interconnected, preselected computational nodes. An article according to an embodiment of the invention includes a medium having instructions which are capable of causing a machine to partition a program into a plurality of groups of instructions, assign one or more of the instruction groups to a plurality of interconnected preselected computation nodes, load the instruction groups on to the nodes, and execute the instruction groups as each instruction in each group receives all necessary associated operands for execution.Board of Regents, University of Texas Syste

Texas ScholarWorks

Netrace: Dependency-Driven Trace-Based Network-on-Chip Simulation

Author: Grot Boris
Hestness J
Keckler Stephen W.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2010
Field of study

Chip multiprocessors (CMPs) and systems-on-chip (SOCs) are expected to grow in core count from a few today to hundreds or more. Since efficient on-chip communication is a primary factor in the performance of large core-count systems, the research community has directed substantial attention to networks-on-chip (NOCs). Current NOC evaluation methodologies include analytical modeling, network simulation, and full-system simulation. However, as core count and system complexity grow, the deficiencies of each of these methods will limit their ability to meet the demands of developers and researchers. Developing efficient NOCs requires high-fidelity, low-overhead NOC evaluation techniques and metrics. To address these challenges, this paper describes a new trace-based network simulation methodology that captures dependencies between network messages observed in full-system simulation of multithreaded applications. We also introduce Netrace, a library of tools and traces that enables targeted NOC simulators to track and replay network messages and their dependencies. 1

CiteSeerX

Crossref

Edinburgh Research Explorer