Search CORE

96 research outputs found

EFFICIENT TIME-ENERGY EXECUTION OF DATA-PARALLEL APPLICATIONS ON HETEROGENEOUS SYSTEMS WITH GPU

Author: DUMITREL LOGHIN
Publication venue
Publication date: 07/06/2017
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Energy-Performance Optimization for the Cloud

Author: Pinel Frédéric
Publication venue: University of Luxembourg, Luxembourg
Publication date: 11/07/2014
Field of study

Open Repository and Bibliography - Luxembourg

Workload characterization and synthesis for data center optimization

Author: Polfliet Stijn
Publication venue: Ghent University. Faculty of Engineering and Architecture
Publication date: 01/01/2013
Field of study

Ghent University Academic Bibliography

The data cyclotron : juggling data and queries for a data warehouse audience

Author: Pereira Goncalves R.A. (Romulo Antonio)
Publication venue
Publication date: 22/03/2013
Field of study

CWI's Institutional Repository

Federated NLP in Few-shot Scenarios

Author: Cai Dongqi
Lin Felix Xiaozhu
Wang Shangguang
Wu Yaozong
Xu Mengwei
Publication venue
Publication date: 12/12/2022
Field of study

Natural language processing (NLP) sees rich mobile applications. To support various language understanding tasks, a foundation NLP model is often fine-tuned in a federated, privacy-preserving setting (FL). This process currently relies on at least hundreds of thousands of labeled training samples from mobile clients; yet mobile users often lack willingness or knowledge to label their data. Such an inadequacy of data labels is known as a few-shot scenario; it becomes the key blocker for mobile NLP applications. For the first time, this work investigates federated NLP in the few-shot scenario (FedFSL). By retrofitting algorithmic advances of pseudo labeling and prompt learning, we first establish a training pipeline that delivers competitive accuracy when only 0.05% (fewer than 100) of the training data is labeled and the remaining is unlabeled. To instantiate the workflow, we further present a system FFNLP, addressing the high execution cost with novel designs. (1) Curriculum pacing, which injects pseudo labels to the training workflow at a rate commensurate to the learning progress; (2) Representational diversity, a mechanism for selecting the most learnable data, only for which pseudo labels will be generated; (3) Co-planning of a model's training depth and layer capacity. Together, these designs reduce the training delay, client energy, and network traffic by up to 46.0

\times

, 41.2

\times

and 3000.0

\times

, respectively. Through algorithm/system co-design, FFNLP demonstrates that FL can apply to challenging settings where most training samples are unlabeled

arXiv.org e-Print Archive

An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

Author: Brocard Sylvan
Cimadomo Remy
Guo Yuxin
Gómez-Luna Juan
Legriel Julien
Mutlu Onur
Oliveira Geraldo F.
Singh Gagandeep
Publication venue
Publication date: 20/04/2023
Field of study

Training machine learning (ML) algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck. Our goal is to understand the potential of modern general-purpose PIM architectures to accelerate ML training. To do so, we (1) implement several representative classic ML algorithms (namely, linear regression, logistic regression, decision tree, K-Means clustering) on a real-world general-purpose PIM architecture, (2) rigorously evaluate and characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU. Our evaluation on a real memory-centric computing system with more than 2500 PIM cores shows that general-purpose PIM architectures can greatly accelerate memory-bound ML workloads, when the necessary operations and datatypes are natively supported by PIM hardware. For example, our PIM implementation of decision tree is

27\times

faster than a state-of-the-art CPU version on an 8-core Intel Xeon, and

1.34\times

faster than a state-of-the-art GPU version on an NVIDIA A100. Our K-Means clustering on PIM is

2.8\times

and

3.2\times

than state-of-the-art CPU and GPU versions, respectively. To our knowledge, our work is the first one to evaluate ML training on a real-world PIM architecture. We conclude with key observations, takeaways, and recommendations that can inspire users of ML workloads, programmers of PIM architectures, and hardware designers & architects of future memory-centric computing systems

arXiv.org e-Print Archive