823 research outputs found
Multicore-optimized wavefront diamond blocking for optimizing stencil updates
The importance of stencil-based algorithms in computational science has
focused attention on optimized parallel implementations for multilevel
cache-based processors. Temporal blocking schemes leverage the large bandwidth
and low latency of caches to accelerate stencil updates and approach
theoretical peak performance. A key ingredient is the reduction of data traffic
across slow data paths, especially the main memory interface. In this work we
combine the ideas of multi-core wavefront temporal blocking and diamond tiling
to arrive at stencil update schemes that show large reductions in memory
pressure compared to existing approaches. The resulting schemes show
performance advantages in bandwidth-starved situations, which are exacerbated
by the high bytes per lattice update case of variable coefficients. Our thread
groups concept provides a controllable trade-off between concurrency and memory
usage, shifting the pressure between the memory interface and the CPU. We
present performance results on a contemporary Intel processor
A Survey of Techniques for Architecting TLBs
“Translation lookaside buffer” (TLB) caches virtual to physical address translation information and is used
in systems ranging from embedded devices to high-end servers. Since TLB is accessed very frequently
and a TLB miss is extremely costly, prudent management of TLB is important for improving performance
and energy efficiency of processors. In this paper, we present a survey of techniques for architecting and
managing TLBs. We characterize the techniques across several dimensions to highlight their similarities and
distinctions. We believe that this paper will be useful for chip designers, computer architects and system
engineers
Using the High Productivity Language Chapel to Target GPGPU Architectures
It has been widely shown that GPGPU architectures offer large performance gains compared to their traditional CPU counterparts for many applications. The downside to these architectures is that the current programming models present numerous challenges to the programmer: lower-level languages, explicit data movement, loss of portability, and challenges in performance optimization. In this paper, we present novel methods and compiler transformations that increase productivity by enabling users to easily program GPGPU architectures using the high productivity programming language Chapel. Rather than resorting to different parallel libraries or annotations for a given parallel platform, we leverage a language that has been designed from first principles to address the challenge of programming for parallelism and locality. This also has the advantage of being portable across distinct classes of parallel architectures, including desktop multicores, distributed memory clusters, large-scale shared memory, and now CPU-GPU hybrids. We present experimental results from the Parboil benchmark suite which demonstrate that codes written in Chapel achieve performance comparable to the original versions implemented in CUDA.NSF CCF 0702260Cray Inc. Cray-SRA-2010-016962010-2011 Nvidia Research Fellowshipunpublishednot peer reviewe
RowCore: A Processing-Near-Memory Architecture for Big Data Machine Learning
The technology-push of die stacking and application-pull of
Big Data machine learning (BDML) have created a unique
opportunity for processing-near-memory (PNM). This paper
makes four contributions: (1) While previous PNM work
explores general MapReduce workloads, we identify three
workload characteristics: (a) irregular-and-compute-light (i.e.,
perform only a few operations per input word which include
data-dependent branches and indirect memory accesses); (b)
compact (i.e., the computation has a small intermediate live
data and uses only a small amount of contiguous input data);
and (c) memory-row-dense (i.e., process the input data without
skipping over many bytes). We show that BDMLs have
or can be transformed to have these characteristics which,
except for irregularity, are necessary for bandwidth- and energyefficient
PNM, irrespective of the architecture. (2) Based on
these characteristics, we propose RowCore, a row-oriented
PNM architecture, which (pre)fetches and operates on entire
memory rows to exploit BDMLs’ row-density. Instead
of this row-centric access and compute-schedule, traditional
architectures opportunistically improve row locality while
fetching and operating on cache blocks. (3) RowCore employs
well-known MIMD execution to handle BDMLs’ irregularity,
and sequential prefetch of input data to hide memory
latency. In RowCore, however, one corelet prefetches
a row for all the corelets which may stray far from each
other due to their MIMD execution. Consequently, a leading
corelet may prematurely evict the prefetched data before
a lagging corelet has consumed the data. RowCore employs
novel cross-corelet flow-control to prevent such eviction. (4)
RowCore further exploits its flow-controlled prefetch for frequency
scaling based on novel coarse-grain compute-memory
rate-matching which decreases (increases) the processor clock
speed when the prefetch buffers are empty (full). Using simulations,
we show that RowCore improves performance and
energy, by 135% and 20% over a GPGPU with prefetch,
and by 35% and 34% over a multicore with prefetch, when
all three architectures use the same resources (i.e., number
of cores, and on-processor-die memory) and identical diestacking
(i.e., GPGPUs/multicores/RowCore and DRAM)
Domain-specific Architectures for Data-intensive Applications
Graphs' versatile ability to represent diverse relationships, make them effective for a wide range of applications. For instance, search engines use graph-based applications to provide high-quality search results. Medical centers use them to aid in patient diagnosis. Most recently, graphs are also being employed to support the management of viral pandemics. Looking forward, they are showing promise of being critical in unlocking several other opportunities, including combating the spread of fake content in social networks, detecting and preventing fraudulent online transactions in a timely fashion, and in ensuring collision avoidance in autonomous vehicle navigation, to name a few. Unfortunately, all these applications require more computational power than what can be provided by conventional computing systems. The key reason is that graph applications present large working sets that fail to fit in the small on-chip storage of existing computing systems, while at the same time they access data in seemingly unpredictable patterns, thus cannot draw benefit from traditional on-chip storage.
In this dissertation, we set out to address the performance limitations of existing computing systems so to enable emerging graph applications like those described above. To achieve this, we identified three key strategies: 1) specializing memory architecture, 2) processing data near its storage, and 3) message coalescing in the network. Based on these strategies, this dissertation develops several solutions: OMEGA, which employs specialized on-chip storage units, with co-located specialized compute engines to accelerate the computation; MessageFusion, which coalesces messages in the interconnect; and Centaur, providing an architecture that optimizes the processing of infrequently-accessed data. Overall, these solutions provide 2x in performance improvements, with negligible hardware overheads, across a wide range of applications.
Finally, we demonstrate the applicability of our strategies to other data-intensive domains, by exploring an acceleration solution for MapReduce applications, which achieves a 4x performance speedup, also with negligible area and power overheads.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163186/1/abrahad_1.pd
- …