4 research outputs found
RecD: Deduplication for End-to-End Deep Learning Recommendation Model Training Infrastructure
We present RecD (Recommendation Deduplication), a suite of end-to-end
infrastructure optimizations across the Deep Learning Recommendation Model
(DLRM) training pipeline. RecD addresses immense storage, preprocessing, and
training overheads caused by feature duplication inherent in industry-scale
DLRM training datasets. Feature duplication arises because DLRM datasets are
generated from interactions. While each user session can generate multiple
training samples, many features' values do not change across these samples. We
demonstrate how RecD exploits this property, end-to-end, across a deployed
training pipeline. RecD optimizes data generation pipelines to decrease dataset
storage and preprocessing resource demands and to maximize duplication within a
training batch. RecD introduces a new tensor format, InverseKeyedJaggedTensors
(IKJTs), to deduplicate feature values in each batch. We show how DLRM model
architectures can leverage IKJTs to drastically increase training throughput.
RecD improves the training and preprocessing throughput and storage efficiency
by up to 2.48x, 1.79x, and 3.71x, respectively, in an industry-scale DLRM
training system.Comment: Published in the Proceedings of the Sixth Conference on Machine
Learning and Systems (MLSys 2023
Data Synchronization Method Between FLPs and EPNs
In the year 2018, the ALICE detectors will be upgraded and the amount of data that will be produced from the detectors will be higher. The data throughput will be approximately 1 TB per second. For this reason, the whole ALICE's system has to be improved. The constraint of the new ALICE experiments is the system should be able to handle a large amount of particles collision events that will be produced from detectors. The data acquisition comprises two computer clusters – First Level Processors (FLP) and Event Processing Nodes (EPN). The two clusters serve different purposes. The FLP cluster is responsible to readout the detector and to reduce by a factor 5x the data before event reconstruction because the data have to be streamed to a permanent storage. Then, the data will be sent through the network to EPNs for event reconstruction and further processing. In the new system, a scheduler between the two clusters is required in order to efficiently propagate jobs and data from FLPs to EPNs. Each FLP node receives collision events which are grouped in time frames spanning 0.1 second. EPN also has to forward the events to an assigned EPN node immediately. Scheduling process and data synchronization must be fast because the delay may cause severe bottleneck and provoke side effects like increasing the buffer space needed on FLPs. In this work, we have adopted Zookeeper, a data synchronization tool, to distribute events schedule (that is created by scheduler) to FLPs
A runtime estimation framework for ALICE
The European Organization for Nuclear Research (CERN) is the largest research organization for particle physics. ALICE, short for ALarge Ion Collider Experiment, serves as one of the main detectors at CERN and produces approximately 15 petabytes of data each year. The computing associated with an ALICE experiment consists of both online and offline processing. An online cluster retrieves data while an offline cluster farm performs a broad range of data analysis. Online processing occurs as collision events are streamed from the detector to the online cluster. This process compresses and calibrates the data before storing it in a data storage system for subsequent offline processing, e.g., event reconstruction. Due to the large volume of stored data to process, offline processing seeks to minimize execution time and data-staging time of the applications via a two-tier offline cluster — the Event Processing Node (EPN) as the first tier and the World LHC Grid Computing (WLGC) as the second tier. This two-tier cluster requires a smart job scheduler to efficiently manage the running of the application. Thus, we propose a runtime estimation method for this offline processing in the ALICE environment.
Our approach exploits application profiles to predict the runtime of a high-performance computing (HPC) application without the need for any additional metadata. To evaluate our proposed framework, we performed our experiment on the actual ALICE applications. In addition, we also test the efficacy of our runtime estimation method to predict the run times of the HPC applications on the Amazon EC2 cloud. The results show that our approach generally delivers accurate predictions, i.e., low error percentages