1,746 research outputs found
Optimizing cache utilization in modern cache hierarchies
Memory wall is one of the major performance bottlenecks in modern computer
systems. SRAM caches have been used to successfully bridge the performance gap
between the processor and the memory. However, SRAM cache’s latency is inversely
proportional to its size. Therefore, simply increasing the size of caches could result
in negative impact on performance. To solve this problem, modern processors employ
multiple levels of caches, each of a different size, forming the so called memory hierarchy.
Upon a miss, the processor will start to lookup the data from the highest level
(L1 cache) to the lowest level (main memory). Such a design can effectively reduce the
negative performance impact of simply using a large cache. However, because SRAM
has lower storage density compared to other volatile storage, the size of an SRAM
cache is restricted by the available on-chip area. With modern applications requiring
more and more memory, researchers are continuing to look at techniques for increasing
the effective cache capacity. In general, researchers are approaching this problem
from two angles: maximizing the utilization of current SRAM caches or exploiting
new technology to support larger capacity in cache hierarchies.
The first part of this thesis focuses on how to maximize the utilization of existing
SRAM cache. In our first work, we observe that not all words belonging to a cache
block are accessed around the same time. In fact, a subset of words are consistently
accessed sooner than others. We call this subset of words as critical words. In our
study, we found these critical words can be predicted by using access footprint. Based
on this observation, we propose critical-words-only cache (co cache). Unlike the conventional
cache which stores all words that belongs to a block, co-cache only stores the
words that we predict as critical. In this work, we convert an L2 cache to a co-cache
and use L1s access footprint information to predict critical words. Our experiments
show the co-cache can outperform a conventional L2 cache in the workloads whose
working-set-sizes are greater than the L2 cache size. To handle the workloads whose
working-set-sizes fit in the conventional L2, we propose the adaptive co-cache (acocache)
which allows the co-cache to be configured back to the conventional cache.
The second part of this thesis focuses on how to efficiently enable a large capacity
on-chip cache. In the near future, 3D stacking technology will allow us to stack one or
multiple DRAM chip(s) onto the processor. The total size of these chips is expected
to be on the order of hundreds of megabytes or even few gigabytes. Recent works
have proposed to use this space as an on-chip DRAM cache. However, the tags of the
DRAM cache have created a classic space/time trade-off issue. On the one hand, we
would like the latency of a tag access to be small as it would contribute to both hit
and miss latencies. Accordingly, we would like to store these tags in a faster media
such as SRAM. However, with hundreds of megabytes of die-stacked DRAM cache,
the space overhead of the tags would be huge. For example, it would cost around 12
MB of SRAM space to store all the tags of a 256MB DRAM cache (if we used conventional
64B blocks). Clearly this is too large, considering that some of the current chip
multiprocessors have an L3 that is smaller. Prior works have proposed to store these
tags along with the data in the stacked DRAM array (tags-in-DRAM). However, this
scheme increases the access latency of the DRAM cache. To optimize access latency
in the DRAM cache, we propose aggressive tag cache (ATCache). Similar to a conventional
cache, the ATCache caches recently accessed tags to exploit temporal locality;
it exploits spatial locality by prefetching tags from nearby cache sets. In addition,
we also address the high miss latency issue and cache pollution caused by excessive
prefetching. To reduce this overhead, we propose a cost-effective prefetching, which
is a combination of dynamic prefetching granularity tunning and hit-prefetching, to
throttle the number of sets prefetched. Our proposed ATCache (which consumes 0.4%
of overall tag size) can satisfy over 60% of DRAM cache tag accesses on average.
The last proposed work in this thesis is a DRAM-Cache-Aware (DCA) DRAM controller.
In this work, we first address the challenge of scheduling requests in the DRAM
cache. While many recent DRAM works have built their techniques based on a tagsin-
DRAM scheme, storing these tags in the DRAM array, however, increases the complexity
of a DRAM cache request. In contrast to a conventional request to DRAM
main memory, a request to the DRAM cache will now translate into multiple DRAM
cache accesses (tag and data). In this work, we address challenges of how to schedule
these DRAM cache accesses. We start by exploring whether or not a conventional
DRAM controller will work well in this scenario. We introduce two potential designs
and study their limitations. From this study, we derive a set of design principles that
an ideal DRAM cache controller must satisfy. We then propose a DRAM-cache-aware
(DCA) DRAM controller that is based on these design principles. Our experimental
results show that DCA can outperform the baseline over 14%
Tuning and optimization for a variety of many-core architectures without changing a single line of implementation code using the Alpaka library
We present an analysis on optimizing performance of a single C++11 source
code using the Alpaka hardware abstraction library. For this we use the general
matrix multiplication (GEMM) algorithm in order to show that compilers can
optimize Alpaka code effectively when tuning key parameters of the algorithm.
We do not intend to rival existing, highly optimized DGEMM versions, but merely
choose this example to prove that Alpaka allows for platform-specific tuning
with a single source code. In addition we analyze the optimization potential
available with vendor-specific compilers when confronted with the heavily
templated abstractions of Alpaka. We specifically test the code for bleeding
edge architectures such as Nvidia's Tesla P100, Intel's Knights Landing (KNL)
and Haswell architecture as well as IBM's Power8 system. On some of these we
are able to reach almost 50\% of the peak floating point operation performance
using the aforementioned means. When adding compiler-specific #pragmas we are
able to reach 5 TFLOPS/s on a P100 and over 1 TFLOPS/s on a KNL system.Comment: Accepted paper for the P\^{}3MA workshop at the ISC 2017 in Frankfur
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from Concrete Concurrency Models
The upcoming many-core architectures require software developers to exploit
concurrency to utilize available computational power. Today's high-level
language virtual machines (VMs), which are a cornerstone of software
development, do not provide sufficient abstraction for concurrency concepts. We
analyze concrete and abstract concurrency models and identify the challenges
they impose for VMs. To provide sufficient concurrency support in VMs, we
propose to integrate concurrency operations into VM instruction sets.
Since there will always be VMs optimized for special purposes, our goal is to
develop a methodology to design instruction sets with concurrency support.
Therefore, we also propose a list of trade-offs that have to be investigated to
advise the design of such instruction sets.
As a first experiment, we implemented one instruction set extension for
shared memory and one for non-shared memory concurrency. From our experimental
results, we derived a list of requirements for a full-grown experimental
environment for further research
Runtime Optimizations for Prediction with Tree-Based Models
Tree-based models have proven to be an effective solution for web ranking as
well as other problems in diverse domains. This paper focuses on optimizing the
runtime performance of applying such models to make predictions, given an
already-trained model. Although exceedingly simple conceptually, most
implementations of tree-based models do not efficiently utilize modern
superscalar processor architectures. By laying out data structures in memory in
a more cache-conscious fashion, removing branches from the execution flow using
a technique called predication, and micro-batching predictions using a
technique called vectorization, we are able to better exploit modern processor
architectures and significantly improve the speed of tree-based models over
hard-coded if-else blocks. Our work contributes to the exploration of
architecture-conscious runtime implementations of machine learning algorithms
A Classification and Survey of Computer System Performance Evaluation Techniques
Classification and survey of computer system performance evaluation technique
- …