16 research outputs found
DSPatch: Dual Spatial Pattern Prefetcher
High main memory latency continues to limit performance of modern
high-performance out-of-order cores. While DRAM latency has remained nearly the
same over many generations, DRAM bandwidth has grown significantly due to
higher frequencies, newer architectures (DDR4, LPDDR4, GDDR5) and 3D-stacked
memory packaging (HBM). Current state-of-the-art prefetchers do not do well in
extracting higher performance when higher DRAM bandwidth is available.
Prefetchers need the ability to dynamically adapt to available bandwidth,
boosting prefetch count and prefetch coverage when headroom exists and
throttling down to achieve high accuracy when the bandwidth utilization is
close to peak. To this end, we present the Dual Spatial Pattern Prefetcher
(DSPatch) that can be used as a standalone prefetcher or as a lightweight
adjunct spatial prefetcher to the state-of-the-art delta-based Signature
Pattern Prefetcher (SPP). DSPatch builds on a novel and intuitive use of
modulated spatial bit-patterns. The key idea is to: (1) represent program
accesses on a physical page as a bit-pattern anchored to the first "trigger"
access, (2) learn two spatial access bit-patterns: one biased towards coverage
and another biased towards accuracy, and (3) select one bit-pattern at run-time
based on the DRAM bandwidth utilization to generate prefetches. Across a
diverse set of workloads, using only 3.6KB of storage, DSPatch improves
performance over an aggressive baseline with a PC-based stride prefetcher at
the L1 cache and the SPP prefetcher at the L2 cache by 6% (9% in
memory-intensive workloads and up to 26%). Moreover, the performance of
DSPatch+SPP scales with increasing DRAM bandwidth, growing from 6% over SPP to
10% when DRAM bandwidth is doubled.Comment: This work is to appear in MICRO 201
Telescope: Telemetry at Terabyte Scale
Data-hungry applications that require terabytes of memory have become
widespread in recent years. To meet the memory needs of these applications,
data centers are embracing tiered memory architectures with near and far memory
tiers. Precise, efficient, and timely identification of hot and cold data and
their placement in appropriate tiers is critical for performance in such
systems. Unfortunately, the existing state-of-the-art telemetry techniques for
hot and cold data detection are ineffective at the terabyte scale.
We propose Telescope, a novel technique that profiles different levels of the
application's page table tree for fast and efficient identification of hot and
cold data. Telescope is based on the observation that, for a memory- and
TLB-intensive workload, higher levels of a page table tree are also frequently
accessed during a hardware page table walk. Hence, the hotness of the higher
levels of the page table tree essentially captures the hotness of its subtrees
or address space sub-regions at a coarser granularity. We exploit this insight
to quickly converge on even a few megabytes of hot data and efficiently
identify several gigabytes of cold data in terabyte-scale applications.
Importantly, such a technique can seamlessly scale to petabyte-scale
applications.
Telescope's telemetry achieves 90%+ precision and recall at just 0.009%
single CPU utilization for microbenchmarks with a 5 TB memory footprint. Memory
tiering based on Telescope results in 5.6% to 34% throughput improvement for
real-world benchmarks with a 1-2 TB memory footprint compared to other
state-of-the-art telemetry techniques
ApHMM: Accelerating Profile Hidden Markov Models for Fast and Energy-Efficient Genome Analysis
Profile hidden Markov models (pHMMs) are widely employed in various
bioinformatics applications to identify similarities between biological
sequences, such as DNA or protein sequences. In pHMMs, sequences are
represented as graph structures. These probabilities are subsequently used to
compute the similarity score between a sequence and a pHMM graph. The
Baum-Welch algorithm, a prevalent and highly accurate method, utilizes these
probabilities to optimize and compute similarity scores. However, the
Baum-Welch algorithm is computationally intensive, and existing solutions offer
either software-only or hardware-only approaches with fixed pHMM designs. We
identify an urgent need for a flexible, high-performance, and energy-efficient
HW/SW co-design to address the major inefficiencies in the Baum-Welch algorithm
for pHMMs.
We introduce ApHMM, the first flexible acceleration framework designed to
significantly reduce both computational and energy overheads associated with
the Baum-Welch algorithm for pHMMs. ApHMM tackles the major inefficiencies in
the Baum-Welch algorithm by 1) designing flexible hardware to accommodate
various pHMM designs, 2) exploiting predictable data dependency patterns
through on-chip memory with memoization techniques, 3) rapidly filtering out
negligible computations using a hardware-based filter, and 4) minimizing
redundant computations.
ApHMM achieves substantial speedups of 15.55x - 260.03x, 1.83x - 5.34x, and
27.97x when compared to CPU, GPU, and FPGA implementations of the Baum-Welch
algorithm, respectively. ApHMM outperforms state-of-the-art CPU implementations
in three key bioinformatics applications: 1) error correction, 2) protein
family search, and 3) multiple sequence alignment, by 1.29x - 59.94x, 1.03x -
1.75x, and 1.03x - 1.95x, respectively, while improving their energy efficiency
by 64.24x - 115.46x, 1.75x, 1.96x.Comment: Accepted to ACM TAC