1,465 research outputs found
Evaluating Cache Coherent Shared Virtual Memory for Heterogeneous Multicore Chips
The trend in industry is towards heterogeneous multicore processors (HMCs),
including chips with CPUs and massively-threaded throughput-oriented processors
(MTTOPs) such as GPUs. Although current homogeneous chips tightly couple the
cores with cache-coherent shared virtual memory (CCSVM), this is not the
communication paradigm used by any current HMC. In this paper, we present a
CCSVM design for a CPU/MTTOP chip, as well as an extension of the pthreads
programming model, called xthreads, for programming this HMC. Our goal is to
evaluate the potential performance benefits of tightly coupling heterogeneous
cores with CCSVM
TransForm: Formally Specifying Transistency Models and Synthesizing Enhanced Litmus Tests
Memory consistency models (MCMs) specify the legal ordering and visibility of
shared memory accesses in a parallel program. Traditionally, instruction set
architecture (ISA) MCMs assume that relevant program-visible memory ordering
behaviors only result from shared memory interactions that take place between
user-level program instructions. This assumption fails to account for virtual
memory (VM) implementations that may result in additional shared memory
interactions between user-level program instructions and both 1) system-level
operations (e.g., address remappings and translation lookaside buffer
invalidations initiated by system calls) and 2) hardware-level operations
(e.g., hardware page table walks and dirty bit updates) during a user-level
program's execution. These additional shared memory interactions can impact the
observable memory ordering behaviors of user-level programs. Thus, memory
transistency models (MTMs) have been coined as a superset of MCMs to
additionally articulate VM-aware consistency rules. However, no prior work has
enabled formal MTM specifications, nor methods to support their automated
analysis.
To fill the above gap, this paper presents the TransForm framework. First,
TransForm features an axiomatic vocabulary for formally specifying MTMs.
Second, TransForm includes a synthesis engine to support the automated
generation of litmus tests enhanced with MTM features (i.e., enhanced litmus
tests, or ELTs) when supplied with a TransForm MTM specification. As a case
study, we formally define an estimated MTM for Intel x86 processors, called
x86t_elt, that is based on observations made by an ELT-based evaluation of an
Intel x86 MTM implementation from prior work and available public
documentation. Given x86t_elt and a synthesis bound as input, TransForm's
synthesis engine successfully produces a set of ELTs including relevant ELTs
from prior work.Comment: *This is an updated version of the TransForm paper that features
updated results reflecting performance optimizations and software bug fixes.
14 pages, 11 figures, Proceedings of the 47th Annual International Symposium
on Computer Architecture (ISCA
Addressing Memory Bottlenecks for Emerging Applications
There has been a recent emergence of applications from the domain of machine learning, data mining, numerical analysis and image processing. These applications are becoming the primary algorithms driving many important user-facing applications and becoming pervasive in our daily lives. Due to their increasing usage in both mobile and datacenter workloads, it is necessary to understand the software and hardware demands of these applications, and design techniques to match their growing needs.
This dissertation studies the performance bottlenecks that arise when we try to improve the performance of these applications on current hardware systems. We observe that most of these applications are data-intensive, i.e., they operate on a large amount of data. Consequently, these applications put significant pressure on the memory. Interestingly, we notice that this pressure is not just limited to one memory structure. Instead, different applications stress different levels of the memory hierarchy. For example, training Deep Neural Networks (DNN), an emerging machine learning approach, is currently limited by the size of the GPU main memory. On the other spectrum, improving DNN inference on CPUs is bottlenecked by Physical Register File (PRF) bandwidth. Concretely, this dissertation tackles four such memory bottlenecks for these emerging applications across the memory hierarchy (off-chip memory, on-chip memory and physical register file), presenting hardware and software techniques to address these bottlenecks and improve the performance of the emerging applications.
For on-chip memory, we present two scenarios where emerging applications perform at a sub-optimal performance. First, many applications have a large number of marginal bits that do not contribute to the application accuracy, wasting unnecessary space and transfer costs. We present ACME, an asymmetric compute-memory paradigm, that removes marginal bits from the memory hierarchy while performing the computation in full precision. Second, we tackle the contention in shared caches for these emerging applications that arise in datacenters where multiple applications can share the same cache capacity. We present ShapeShifter, a runtime system that continuously monitors the runtime environment, detects changes in the cache availability and dynamically recompiles the application on the fly to efficiently utilize the cache capacity.
For physical register file, we observe that DNN inference on CPUs is primarily limited by the PRF bandwidth. Increasing the number of compute units in CPU requires increasing the read ports in the PRF. In this case, PRF quickly reaches a point where latency could no longer be met. To solve this problem, we present LEDL, locality extensions for deep learning on CPUs, that entails a rearchitected FMA and PRF design tailored for the heavy data reuse inherent in DNN inference.
Finally, a significant challenge facing both the researchers and industry practitioners is that as the DNNs grow deeper and larger, the DNN training is limited by the size of the GPU main memory, restricting the size of the networks which GPUs can train. To tackle this challenge, we
first identify the primary contributors to this heavy memory footprint, finding that the feature maps (intermediate layer outputs) are the heaviest contributors in training as opposed to the weights in inference. Then, we present Gist, a runtime system, that uses three
efficient data encoding techniques to reduce the footprint of DNN training.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/146016/1/anijain_1.pd
- …