96 research outputs found
EFFICIENT TIME-ENERGY EXECUTION OF DATA-PARALLEL APPLICATIONS ON HETEROGENEOUS SYSTEMS WITH GPU
Ph.DDOCTOR OF PHILOSOPH
Federated NLP in Few-shot Scenarios
Natural language processing (NLP) sees rich mobile applications. To support
various language understanding tasks, a foundation NLP model is often
fine-tuned in a federated, privacy-preserving setting (FL). This process
currently relies on at least hundreds of thousands of labeled training samples
from mobile clients; yet mobile users often lack willingness or knowledge to
label their data. Such an inadequacy of data labels is known as a few-shot
scenario; it becomes the key blocker for mobile NLP applications.
For the first time, this work investigates federated NLP in the few-shot
scenario (FedFSL). By retrofitting algorithmic advances of pseudo labeling and
prompt learning, we first establish a training pipeline that delivers
competitive accuracy when only 0.05% (fewer than 100) of the training data is
labeled and the remaining is unlabeled. To instantiate the workflow, we further
present a system FFNLP, addressing the high execution cost with novel designs.
(1) Curriculum pacing, which injects pseudo labels to the training workflow at
a rate commensurate to the learning progress; (2) Representational diversity, a
mechanism for selecting the most learnable data, only for which pseudo labels
will be generated; (3) Co-planning of a model's training depth and layer
capacity. Together, these designs reduce the training delay, client energy, and
network traffic by up to 46.0, 41.2 and 3000.0,
respectively. Through algorithm/system co-design, FFNLP demonstrates that FL
can apply to challenging settings where most training samples are unlabeled
An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System
Training machine learning (ML) algorithms is a computationally intensive
process, which is frequently memory-bound due to repeatedly accessing large
training datasets. As a result, processor-centric systems (e.g., CPU, GPU)
suffer from costly data movement between memory units and processing units,
which consumes large amounts of energy and execution cycles. Memory-centric
computing systems, i.e., with processing-in-memory (PIM) capabilities, can
alleviate this data movement bottleneck.
Our goal is to understand the potential of modern general-purpose PIM
architectures to accelerate ML training. To do so, we (1) implement several
representative classic ML algorithms (namely, linear regression, logistic
regression, decision tree, K-Means clustering) on a real-world general-purpose
PIM architecture, (2) rigorously evaluate and characterize them in terms of
accuracy, performance and scaling, and (3) compare to their counterpart
implementations on CPU and GPU. Our evaluation on a real memory-centric
computing system with more than 2500 PIM cores shows that general-purpose PIM
architectures can greatly accelerate memory-bound ML workloads, when the
necessary operations and datatypes are natively supported by PIM hardware. For
example, our PIM implementation of decision tree is faster than a
state-of-the-art CPU version on an 8-core Intel Xeon, and faster
than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering
on PIM is and than state-of-the-art CPU and GPU
versions, respectively.
To our knowledge, our work is the first one to evaluate ML training on a
real-world PIM architecture. We conclude with key observations, takeaways, and
recommendations that can inspire users of ML workloads, programmers of PIM
architectures, and hardware designers & architects of future memory-centric
computing systems
- …