Search CORE

2,036 research outputs found

Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips

Author: Hechtman Blake A.
Sorin Daniel J.
Publication venue
Publication date: 01/01/2013
Field of study

The trend in industry is towards heterogeneous multicore processors (HMCs), including chips with CPUs and massively-threaded throughput-oriented processors (MTTOPs) such as GPUs. Although current homogeneous chips tightly couple the cores with cache-coherent shared virtual memory (CCSVM), this is not the communication paradigm used by any current HMC. In this paper, we present a CCSVM design for a CPU/MTTOP chip, as well as an extension of the pthreads programming model, called xthreads, for programming this HMC. Our goal is to evaluate the potential performance benefits of tightly coupling heterogeneous cores with CCSVM

arXiv.org e-Print Archive

CiteSeerX

Crossref

Exploration of GPU Cache Architectures Targeting Machine Learning Applications

Author: Kotas Gerald
Publication venue: RIT Scholar Works
Publication date: 01/07/2019
Field of study

The computation power from graphics processing units (GPUs) has become prevalent in many fields of computer engineering. Massively parallel workloads and large data set capabilities make GPUs an essential asset in tackling today\u27s computationally intensive problems. One field that benefited greatly with the introduction of GPUs is machine learning. Many applications of machine learning use algorithms that show a significant speedup on a GPU compared to other processors due to the massively parallel nature of the problem set. The existing cache architecture, however, may not be ideal for these applications. The goal of this thesis is to determine if a cache architecture for the GPU can be redesigned to better fit the needs of this increasingly popular field of computer engineering. This work uses a cycle accurate GPU simulator, Multi2Sim, to analyze NVIDIA GPU architectures. The architectures are based on the Kepler series, but the flexibility of the simulator allows for emulation of newer features. Changes have been made to source code to expand on the metrics recorded to further the understanding of the cache architecture. Two suites of benchmarks were used: one for general purpose algorithms and another for machine learning. Running the benchmarks with various cache configurations led to insight into the effects the cache architecture had on each of them. Analysis of the results shows that the cache architecture, while beneficial to the general purpose algorithms, does not need to be as complex for machine learning algorithms. A large contributor to the complexity is the cache coherence protocol used by GPUs. Due to the high spacial locality associated with machine learning problems, the overhead needed by implementing the coherence protocol has little benefit, and simplifying the architecture can lead to smaller, cheaper, and more efficient designs

RIT Scholar Works