466 research outputs found
Architectural Techniques to Enable Reliable and Scalable Memory Systems
High capacity and scalable memory systems play a vital role in enabling our
desktops, smartphones, and pervasive technologies like Internet of Things
(IoT). Unfortunately, memory systems are becoming increasingly prone to faults.
This is because we rely on technology scaling to improve memory density, and at
small feature sizes, memory cells tend to break easily. Today, memory
reliability is seen as the key impediment towards using high-density devices,
adopting new technologies, and even building the next Exascale supercomputer.
To ensure even a bare-minimum level of reliability, present-day solutions tend
to have high performance, power and area overheads. Ideally, we would like
memory systems to remain robust, scalable, and implementable while keeping the
overheads to a minimum. This dissertation describes how simple cross-layer
architectural techniques can provide orders of magnitude higher reliability and
enable seamless scalability for memory systems while incurring negligible
overheads.Comment: PhD thesis, Georgia Institute of Technology (May 2017
The Dirty Secret of SSDs: Embodied Carbon
Scalable Solid-State Drives (SSDs) have revolutionized the way we store and
access our data across datacenters and handheld devices. Unfortunately, scaling
technology can have a significant environmental impact. Across the globe, most
semiconductor manufacturing use electricity that is generated from coal and
natural gas. For instance, manufacturing a Gigabyte of Flash emits 0.16 Kg
CO and is a significant fraction of the total carbon emission in the
system. We estimate that manufacturing storage devices has resulted in 20
million metric tonnes of CO emissions in 2021 alone. To better understand
this concern, this paper compares the sustainability trade-offs between Hard
Disk Drives (HDDs) and SSDs and recommends methodologies to estimate the
embodied carbon costs of the storage system. In this paper, we outline four
possible strategies to make storage systems sustainable. First, this paper
recommends directions that help select the right medium of storage (SSD vs
HDD). Second, this paper proposes lifetime extension techniques for SSDs.
Third, this paper advocates for effective and efficient recycling and reuse of
high-density multi-level cell-based SSDs. Fourth, specifically for hand-held
devices, this paper recommends leveraging elasticity in cloud storage.Comment: In the proceedings of the 1st Workshop on Sustainable Computer
Systems Design and Implementation (HotCarbon 2022
FLuID: Mitigating Stragglers in Federated Learning using Invariant Dropout
Federated Learning (FL) allows machine learning models to train locally on
individual mobile devices, synchronizing model updates via a shared server.
This approach safeguards user privacy; however, it also generates a
heterogeneous training environment due to the varying performance capabilities
across devices. As a result, straggler devices with lower performance often
dictate the overall training time in FL. In this work, we aim to alleviate this
performance bottleneck due to stragglers by dynamically balancing the training
load across the system. We introduce Invariant Dropout, a method that extracts
a sub-model based on the weight update threshold, thereby minimizing potential
impacts on accuracy. Building on this dropout technique, we develop an adaptive
training framework, Federated Learning using Invariant Dropout (FLuID). FLuID
offers a lightweight sub-model extraction to regulate computational intensity,
thereby reducing the load on straggler devices without affecting model quality.
Our method leverages neuron updates from non-straggler devices to construct a
tailored sub-model for each straggler based on client performance profiling.
Furthermore, FLuID can dynamically adapt to changes in stragglers as runtime
conditions shift. We evaluate FLuID using five real-world mobile clients. The
evaluations show that Invariant Dropout maintains baseline model efficiency
while alleviating the performance bottleneck of stragglers through a dynamic,
runtime approach
Scalable and Secure Row-Swap: Efficient and Safe Row Hammer Mitigation in Memory Systems
As Dynamic Random Access Memories (DRAM) scale, they are becoming
increasingly susceptible to Row Hammer. By rapidly activating rows of DRAM
cells (aggressor rows), attackers can exploit inter-cell interference through
Row Hammer to flip bits in neighboring rows (victim rows). A recent work,
called Randomized Row-Swap (RRS), proposed proactively swapping aggressor rows
with randomly selected rows before an aggressor row can cause Row Hammer.
Our paper observes that RRS is neither secure nor scalable. We first propose
the `Juggernaut attack pattern' that breaks RRS in under 1 day. Juggernaut
exploits the fact that the mitigative action of RRS, a swap operation, can
itself induce additional target row activations, defeating such a defense.
Second, this paper proposes a new defense Secure Row-Swap mechanism that avoids
the additional activations from swap (and unswap) operations and protects
against Juggernaut. Furthermore, this paper extends Secure Row-Swap with attack
detection to defend against even future attacks. While this provides better
security, it also allows for securely reducing the frequency of swaps, thereby
enabling Scalable and Secure Row-Swap. The Scalable and Secure Row-Swap
mechanism provides years of Row Hammer protection with 3.3X lower storage
overheads as compared to the RRS design. It incurs only a 0.7% slowdown as
compared to a not-secure baseline for a Row Hammer threshold of 1200
Accelerating Recommendation System Training by Leveraging Popular Choices
Recommender models are commonly used to suggest relevant items to a user for
e-commerce and online advertisement-based applications. These models use
massive embedding tables to store numerical representation of items' and users'
categorical variables (memory intensive) and employ neural networks (compute
intensive) to generate final recommendations. Training these large-scale
recommendation models is evolving to require increasing data and compute
resources. The highly parallel neural networks portion of these models can
benefit from GPU acceleration however, large embedding tables often cannot fit
in the limited-capacity GPU device memory. Hence, this paper deep dives into
the semantics of training data and obtains insights about the feature access,
transfer, and usage patterns of these models. We observe that, due to the
popularity of certain inputs, the accesses to the embeddings are highly skewed
with a few embedding entries being accessed up to 10000x more. This paper
leverages this asymmetrical access pattern to offer a framework, called FAE,
and proposes a hot-embedding aware data layout for training recommender models.
This layout utilizes the scarce GPU memory for storing the highly accessed
embeddings, thus reduces the data transfers from CPU to GPU. At the same time,
FAE engages the GPU to accelerate the executions of these hot embedding
entries. Experiments on production-scale recommendation models with real
datasets show that FAE reduces the overall training time by 2.3x and 1.52x in
comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline
accurac
Ad-Rec: Advanced Feature Interactions to Address Covariate-Shifts in Recommendation Networks
Recommendation models are vital in delivering personalized user experiences
by leveraging the correlation between multiple input features. However, deep
learning-based recommendation models often face challenges due to evolving user
behaviour and item features, leading to covariate shifts. Effective
cross-feature learning is crucial to handle data distribution drift and
adapting to changing user behaviour. Traditional feature interaction techniques
have limitations in achieving optimal performance in this context.
This work introduces Ad-Rec, an advanced network that leverages feature
interaction techniques to address covariate shifts. This helps eliminate
irrelevant interactions in recommendation tasks. Ad-Rec leverages masked
transformers to enable the learning of higher-order cross-features while
mitigating the impact of data distribution drift. Our approach improves model
quality, accelerates convergence, and reduces training time, as measured by the
Area Under Curve (AUC) metric. We demonstrate the scalability of Ad-Rec and its
ability to achieve superior model quality through comprehensive ablation
studies
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference
Transformers have emerged as the underpinning architecture for Large Language
Models (LLMs). In generative language models, the inference process involves
two primary phases: prompt processing and token generation. Token generation,
which constitutes the majority of the computational workload, primarily entails
vector-matrix multiplications and interactions with the Key-Value (KV) Cache.
This phase is constrained by memory bandwidth due to the overhead of
transferring weights and KV cache values from the memory system to the
computing units. This memory bottleneck becomes particularly pronounced in
applications that require long-context and extensive text generation, both of
which are increasingly crucial for LLMs.
This paper introduces "Keyformer", an innovative inference-time approach, to
mitigate the challenges associated with KV cache size and memory bandwidth
utilization. Keyformer leverages the observation that approximately 90% of the
attention weight in generative inference focuses on a specific subset of
tokens, referred to as "key" tokens. Keyformer retains only the key tokens in
the KV cache by identifying these crucial tokens using a novel score function.
This approach effectively reduces both the KV cache size and memory bandwidth
usage without compromising model accuracy. We evaluate Keyformer's performance
across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ
various positional embedding algorithms. Our assessment encompasses a variety
of tasks, with a particular emphasis on summarization and conversation tasks
involving extended contexts. Keyformer's reduction of KV cache reduces
inference latency by 2.1x and improves token generation throughput by 2.4x,
while preserving the model's accuracy
Touch\'e: Towards Ideal and Efficient Cache Compression By Mitigating Tag Area Overheads
Compression is seen as a simple technique to increase the effective cache
capacity. Unfortunately, compression techniques either incur tag area overheads
or restrict data placement to only include neighboring compressed cache blocks
to mitigate tag area overheads. Ideally, we should be able to place arbitrary
compressed cache blocks without any placement restrictions and tag area
overheads.
This paper proposes Touch\'e, a framework that enables storing multiple
arbitrary compressed cache blocks within a physical cacheline without any tag
area overheads. The Touch\'e framework consists of three components. The first
component, called the ``Signature'' (SIGN) engine, creates shortened signatures
from the tag addresses of compressed blocks. Due to this, the SIGN engine can
store multiple signatures in each tag entry. On a cache access, the physical
cacheline is accessed only if there is a signature match (which has a
negligible probability of false positive). The second component, called the
``Tag Appended Data'' (TADA) mechanism, stores the full tag addresses with
data. TADA enables Touch\'e to detect false positive signature matches by
ensuring that the actual tag address is available for comparison. The third
component, called the ``Superblock Marker'' (SMARK) mechanism, uses a unique
marker in the tag entry to indicate the occurrence of compressed cache blocks
from neighboring physical addresses in the same cacheline. Touch\'e is
completely hardware-based and achieves an average speedup of 12\% (ideal 13\%)
when compared to an uncompressed baseline.Comment: Keywords: Compression, Caches, Tag Array, Data Array, Hashin
Can bounded and self-interested agents be teammates? Application to planning in ad hoc teams
Planning for ad hoc teamwork is challenging because it involves agents collaborating without any prior coordination or communication. The focus is on principled methods for a single agent to cooperate with others. This motivates investigating the ad hoc teamwork problem in the context of self-interested decision-making frameworks. Agents engaged in individual decision making in multiagent settings face the task of having to reason about other agents’ actions, which may in turn involve reasoning about others. An established approximation that operationalizes this approach is to bound the infinite nesting from below by introducing level 0 models. For the purposes of this study, individual, self-interested decision making in multiagent settings is modeled using interactive dynamic influence diagrams (I-DID). These are graphical models with the benefit that they naturally offer a factored representation of the problem, allowing agents to ascribe dynamic models to others and reason about them. We demonstrate that an implication of bounded, finitely-nested reasoning by a self-interested agent is that we may not obtain optimal team solutions in cooperative settings, if it is part of a team. We address this limitation by including models at level 0 whose solutions involve reinforcement learning. We show how the learning is integrated into planning in the context of I-DIDs. This facilitates optimal teammate behavior, and we demonstrate its applicability to ad hoc teamwork on several problem domains and configurations
- …