1,580 research outputs found

    DALiuGE: A Graph Execution Framework for Harnessing the Astronomical Data Deluge

    Full text link
    The Data Activated Liu Graph Engine - DALiuGE - is an execution framework for processing large astronomical datasets at a scale required by the Square Kilometre Array Phase 1 (SKA1). It includes an interface for expressing complex data reduction pipelines consisting of both data sets and algorithmic components and an implementation run-time to execute such pipelines on distributed resources. By mapping the logical view of a pipeline to its physical realisation, DALiuGE separates the concerns of multiple stakeholders, allowing them to collectively optimise large-scale data processing solutions in a coherent manner. The execution in DALiuGE is data-activated, where each individual data item autonomously triggers the processing on itself. Such decentralisation also makes the execution framework very scalable and flexible, supporting pipeline sizes ranging from less than ten tasks running on a laptop to tens of millions of concurrent tasks on the second fastest supercomputer in the world. DALiuGE has been used in production for reducing interferometry data sets from the Karl E. Jansky Very Large Array and the Mingantu Ultrawide Spectral Radioheliograph; and is being developed as the execution framework prototype for the Science Data Processor (SDP) consortium of the Square Kilometre Array (SKA) telescope. This paper presents a technical overview of DALiuGE and discusses case studies from the CHILES and MUSER projects that use DALiuGE to execute production pipelines. In a companion paper, we provide in-depth analysis of DALiuGE's scalability to very large numbers of tasks on two supercomputing facilities.Comment: 31 pages, 12 figures, currently under review by Astronomy and Computin

    Hive on spark and MapReduce : a methodology for parameter tuning

    Get PDF
    Project Work presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Information Systems and Technologies ManagementAs the era of “big data” has arrived, more and more companies start using distributed file systems to manage and process their data streams like the Hadoop distributed file system framework (HDFS). This software library offers a way to store large files across multiple machines. Large data sets are processed by using its inherent programming model MapReduce. Apache Spark is a relatively new alternative to Hadoop MapReduce and claims to offer a performance boost up to 10 times for certain applications, while maintaining its automatic fault tolerance. To leverage the Data Warehouse capabilities of Hadoop Apache Hive was introduced. It is a concept for Big Data analytics that works on top of Hadoop and provides data analysis tools and most importantly translates queries to MapReduce and Spark jobs. Therefore, it exploits the scalability of Hadoop and offers data exploration and mining capabilities to non-developers. However, it is difficult for users to utilize the full potential of the Apache Spark execution engine. This results in very long execution times. Therefore, this project work gives researches and companies a tuning methodology that significantly can improve the execution time of queries. As a result, this tuning methodology could optimize a real-world batch-processing query by 5 times. Moreover, it gives insides in the underlying reasons of this big improvement by using Apache Spark Monitoring tools. The result can be helpful for many practitioners and researchers that would like to optimise the performance of Spark and MapReduce queries executed in Hive on top of an Apache Hadoop cluster

    Private Summation in the Multi-Message Shuffle Model

    Full text link
    The shuffle model of differential privacy (Erlingsson et al. SODA 2019; Cheu et al. EUROCRYPT 2019) and its close relative encode-shuffle-analyze (Bittau et al. SOSP 2017) provide a fertile middle ground between the well-known local and central models. Similarly to the local model, the shuffle model assumes an untrusted data collector who receives privatized messages from users, but in this case a secure shuffler is used to transmit messages from users to the collector in a way that hides which messages came from which user. An interesting feature of the shuffle model is that increasing the amount of messages sent by each user can lead to protocols with accuracies comparable to the ones achievable in the central model. In particular, for the problem of privately computing the sum of nn bounded real values held by nn different users, Cheu et al. showed that O(n)O(\sqrt{n}) messages per user suffice to achieve O(1)O(1) error (the optimal rate in the central model), while Balle et al. (CRYPTO 2019) recently showed that a single message per user leads to Θ(n1/3)\Theta(n^{1/3}) MSE (mean squared error), a rate strictly in-between what is achievable in the local and central models. This paper introduces two new protocols for summation in the shuffle model with improved accuracy and communication trade-offs. Our first contribution is a recursive construction based on the protocol from Balle et al. mentioned above, providing poly(loglogn)\mathrm{poly}(\log \log n) error with O(loglogn)O(\log \log n) messages per user. The second contribution is a protocol with O(1)O(1) error and O(1)O(1) messages per user based on a novel analysis of the reduction from secure summation to shuffling introduced by Ishai et al. (FOCS 2006) (the original reduction required O(logn)O(\log n) messages per user).Comment: Published at CCS'2

    InTune: Reinforcement Learning-based Data Pipeline Optimization for Deep Recommendation Models

    Full text link
    Deep learning-based recommender models (DLRMs) have become an essential component of many modern recommender systems. Several companies are now building large compute clusters reserved only for DLRM training, driving new interest in cost- and time- saving optimizations. The systems challenges faced in this setting are unique; while typical deep learning training jobs are dominated by model execution, the most important factor in DLRM training performance is often online data ingestion. In this paper, we explore the unique characteristics of this data ingestion problem and provide insights into DLRM training pipeline bottlenecks and challenges. We study real-world DLRM data processing pipelines taken from our compute cluster at Netflix to observe the performance impacts of online ingestion and to identify shortfalls in existing pipeline optimizers. We find that current tooling either yields sub-optimal performance, frequent crashes, or else requires impractical cluster re-organization to adopt. Our studies lead us to design and build a new solution for data pipeline optimization, InTune. InTune employs a reinforcement learning (RL) agent to learn how to distribute the CPU resources of a trainer machine across a DLRM data pipeline to more effectively parallelize data loading and improve throughput. Our experiments show that InTune can build an optimized data pipeline configuration within only a few minutes, and can easily be integrated into existing training workflows. By exploiting the responsiveness and adaptability of RL, InTune achieves higher online data ingestion rates than existing optimizers, thus reducing idle times in model execution and increasing efficiency. We apply InTune to our real-world cluster, and find that it increases data ingestion throughput by as much as 2.29X versus state-of-the-art data pipeline optimizers while also improving both CPU & GPU utilization.Comment: Accepted at RecSys 2023. 11 pages, 2 pages of references. 8 figures with 2 table
    corecore