Search CORE

4 research outputs found

RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure

Author: Agarwal Niket
Basant Aarti
Choudhary Dhruv
Kaplan Max
Kozyrakis Christos
Lin Sung-Han
Park Jongsoo
Pumma Sarunya
Somani Ajay
Tyagi Devashish
Wu Carole-Jean
Zhao Mark
Publication venue
Publication date: 25/04/2023
Field of study

We present RecD (Recommendation Deduplication), a suite of end-to-end infrastructure optimizations across the Deep Learning Recommendation Model (DLRM) training pipeline. RecD addresses immense storage, preprocessing, and training overheads caused by feature duplication inherent in industry-scale DLRM training datasets. Feature duplication arises because DLRM datasets are generated from interactions. While each user session can generate multiple training samples, many features' values do not change across these samples. We demonstrate how RecD exploits this property, end-to-end, across a deployed training pipeline. RecD optimizes data generation pipelines to decrease dataset storage and preprocessing resource demands and to maximize duplication within a training batch. RecD introduces a new tensor format, InverseKeyedJaggedTensors (IKJTs), to deduplicate feature values in each batch. We show how DLRM model architectures can leverage IKJTs to drastically increase training throughput. RecD improves the training and preprocessing throughput and storage efficiency by up to 2.48x, 1.79x, and 3.71x, respectively, in an industry-scale DLRM training system.Comment: Published in the Proceedings of the Sixth Conference on Machine Learning and Systems (MLSys 2023

arXiv.org e-Print Archive

Data Synchronization Method Between FLPs and EPNs

Author: Pumma Sarunya
Publication venue
Publication date: 01/01/2014
Field of study

In the year 2018, the ALICE detectors will be upgraded and the amount of data that will be produced from the detectors will be higher. The data throughput will be approximately 1 TB per second. For this reason, the whole ALICE's system has to be improved. The constraint of the new ALICE experiments is the system should be able to handle a large amount of particles collision events that will be produced from detectors. The data acquisition comprises two computer clusters – First Level Processors (FLP) and Event Processing Nodes (EPN). The two clusters serve different purposes. The FLP cluster is responsible to readout the detector and to reduce by a factor 5x the data before event reconstruction because the data have to be streamed to a permanent storage. Then, the data will be sent through the network to EPNs for event reconstruction and further processing. In the new system, a scheduler between the two clusters is required in order to efficiently propagate jobs and data from FLPs to EPNs. Each FLP node receives collision events which are grouped in time frames spanning 0.1 second. EPN also has to forward the events to an assigned EPN node immediately. Scheduling process and data synchronization must be fast because the delay may cause severe bottleneck and provoke side effects like increasing the buffer space needed on FLPs. In this work, we have adopted Zookeeper, a data synchronization tool, to distribute events schedule (that is created by scheduler) to FLPs

CERN Document Server

A runtime estimation framework for ALICE

Author: Achalakul Tiranee
Chapeland Sylvain
Feng Wu-chun
Phunchongharn Phond
Pumma Sarunya
Publication venue: 'Elsevier BV'
Publication date: 01/01/2017
Field of study

The European Organization for Nuclear Research (CERN) is the largest research organization for particle physics. ALICE, short for ALarge Ion Collider Experiment, serves as one of the main detectors at CERN and produces approximately 15 petabytes of data each year. The computing associated with an ALICE experiment consists of both online and offline processing. An online cluster retrieves data while an offline cluster farm performs a broad range of data analysis. Online processing occurs as collision events are streamed from the detector to the online cluster. This process compresses and calibrates the data before storing it in a data storage system for subsequent offline processing, e.g., event reconstruction. Due to the large volume of stored data to process, offline processing seeks to minimize execution time and data-staging time of the applications via a two-tier offline cluster — the Event Processing Node (EPN) as the first tier and the World LHC Grid Computing (WLGC) as the second tier. This two-tier cluster requires a smart job scheduler to efficiently manage the running of the application. Thus, we propose a runtime estimation method for this offline processing in the ALICE environment. Our approach exploits application profiles to predict the runtime of a high-performance computing (HPC) application without the need for any additional metadata. To evaluate our proposed framework, we performed our experiment on the actual ALICE applications. In addition, we also test the efficacy of our runtime estimation method to predict the run times of the HPC applications on the Amazon EC2 cloud. The results show that our approach generally delivers accurate predictions, i.e., low error percentages

Crossref

CERN Document Server

A runtime estimation framework for ALICE

Author: Aamodt
Abelev
Asanovic
Bailey
Banharnsakun
Baraglia
Chang
Che
Chou
Feng
Gaussier
Gropp
Hoste
Islam
Kadirvel
Karaboga
Karaboga
Kousiouris
Krishnaswamy
Li
Manakul
Manfroi
Matsunaga
Mejía-Alvarez
Minh
Montgomery
Phond Phunchongharn
Prodan
Pumma
Sarunya Pumma
Shiers
Smith
Srinivasan
Suaide
Sylvain Chapeland
Taetragool
Tang
Tetzlaff
Tiranee Achalakul
Tsafrir
Vasupongayya
Vivekanandan
Wu-chun Feng
Xia
Zhang
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref