Search CORE

263 research outputs found

Efficient Fault Tolerance for Pipelined Query Engines via Write-ahead Lineage

Author: Aiken Alex
Wang Ziheng
Publication venue
Publication date: 12/03/2024
Field of study

Modern distributed pipelined query engines either do not support intra-query fault tolerance or employ high-overhead approaches such as persisting intermediate outputs or checkpointing state. In this work, we present write-ahead lineage, a novel fault recovery technique that combines Spark's lineage-based replay and write-ahead logging. Unlike Spark, where the lineage is determined before query execution, write-ahead lineage persistently logs lineage at runtime to support dynamic task dependencies in pipelined query engines. Since only KB-sized lineages are persisted instead of MB-sized intermediate outputs, the normal execution overhead is minimal compared to spooling or checkpointing based approaches. To ensure fast fault recovery times, tasks only consume intermediate outputs with persisted lineage, preventing global rollbacks upon failure. In addition, lost tasks from different stages can be recovered in a pipelined parallel manner. We implement write-ahead lineage in a distributed pipelined query engine called Quokka. We show that Quokka is around 2x faster than SparkSQL on the TPC-H benchmark with similar fault recovery performance.Comment: ICDE 2024 (copyright IEEE

arXiv.org e-Print Archive

Training with Mixed-Precision Floating-Point Assignments

Author: Aiken Alex
Lee Wonyeol
Sharma Rahul
Publication venue
Publication date: 23/06/2023
Field of study

When training deep neural networks, keeping all tensors in high precision (e.g., 32-bit or even 16-bit floats) is often wasteful. However, keeping all tensors in low precision (e.g., 8-bit floats) can lead to unacceptable accuracy loss. Hence, it is important to use a precision assignment -- a mapping from all tensors (arising in training) to precision levels (high or low) -- that keeps most of the tensors in low precision and leads to sufficiently accurate models. We provide a technique that explores this memory-accuracy tradeoff by generating precision assignments for convolutional neural networks that (i) use less memory and (ii) lead to more accurate convolutional networks at the same time, compared to the precision assignments considered by prior work in low-precision floating-point training. We evaluate our technique on image classification tasks by training convolutional networks on CIFAR-10, CIFAR-100, and ImageNet. Our method typically provides > 2x memory reduction over a baseline precision assignment while preserving training accuracy, and gives further reductions by trading off accuracy. Compared to other baselines which sometimes cause training to diverge, our method provides similar or better memory reduction while avoiding divergence.Comment: Published in TML

arXiv.org e-Print Archive

Optimizing Mixture of Experts using Dynamic Recompilations

Author: Aiken Alex
Jia Zhihao
Kossmann Ferdinand
Publication venue
Publication date: 02/08/2022
Field of study

The Mixture of Experts architecture allows for outrageously large neural networks by scaling model parameter size independently from computational demand (FLOPs). However, current DNN frameworks cannot effectively support the dynamic data flow in Mixture of Experts, and implementations on top of these frameworks need to use workarounds that introduce significant overheads. To address the limitation of these frameworks, we present DynaMoE, a DNN library that uses dynamic recompilations to optimize and adapt the use of computational resources to the dynamic needs of Mixture of Experts models. Our evaluation shows that DynaMoE achieves a 1.8x speedup and supports 2.3x larger model sizes when compared to existing MoE systems, even when not using recompilations. We then present further optimizations enabled by dynamic recompilations that yield an additional 1.7x speedup while simultaneously reducing memory pressure and improving model quality.Comment: 13 pages, 15 figure

arXiv.org e-Print Archive

Collaboration Versus Cheating

Author: Aiken Alex
Caliskan-Islam Aylin
Eaton Sarah E
Grijalva Therese C
Tabsh Sami W
Zhang Youdan
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/12/2018
Field of study

We outline how we detected programming plagiarism in an introductory online course for a master's of science in computer science program, how we achieved a statistically significant reduction in programming plagiarism by combining a clear explanation of university and class policy on academic honesty reinforced with a short but formal assessment, and how we evaluated plagiarism rates before SIGand after implementing our policy and assessment.Comment: 7 pages, 1 figure, 5 tables, SIGCSE 201

arXiv.org e-Print Archive

Crossref

Language support for regions, in

Author: Alex Aiken
David Gay
Publication venue
Publication date: 06/03/2020
Field of study

CiteSeerX

Language support for dynamic, hierarchical data partitioning

Author: Alex Aiken
Bienia C.
Michael Bauer
Sean Treichler
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref