263 research outputs found
Efficient Fault Tolerance for Pipelined Query Engines via Write-ahead Lineage
Modern distributed pipelined query engines either do not support intra-query
fault tolerance or employ high-overhead approaches such as persisting
intermediate outputs or checkpointing state. In this work, we present
write-ahead lineage, a novel fault recovery technique that combines Spark's
lineage-based replay and write-ahead logging. Unlike Spark, where the lineage
is determined before query execution, write-ahead lineage persistently logs
lineage at runtime to support dynamic task dependencies in pipelined query
engines. Since only KB-sized lineages are persisted instead of MB-sized
intermediate outputs, the normal execution overhead is minimal compared to
spooling or checkpointing based approaches. To ensure fast fault recovery
times, tasks only consume intermediate outputs with persisted lineage,
preventing global rollbacks upon failure. In addition, lost tasks from
different stages can be recovered in a pipelined parallel manner. We implement
write-ahead lineage in a distributed pipelined query engine called Quokka. We
show that Quokka is around 2x faster than SparkSQL on the TPC-H benchmark with
similar fault recovery performance.Comment: ICDE 2024 (copyright IEEE
Training with Mixed-Precision Floating-Point Assignments
When training deep neural networks, keeping all tensors in high precision
(e.g., 32-bit or even 16-bit floats) is often wasteful. However, keeping all
tensors in low precision (e.g., 8-bit floats) can lead to unacceptable accuracy
loss. Hence, it is important to use a precision assignment -- a mapping from
all tensors (arising in training) to precision levels (high or low) -- that
keeps most of the tensors in low precision and leads to sufficiently accurate
models. We provide a technique that explores this memory-accuracy tradeoff by
generating precision assignments for convolutional neural networks that (i) use
less memory and (ii) lead to more accurate convolutional networks at the same
time, compared to the precision assignments considered by prior work in
low-precision floating-point training. We evaluate our technique on image
classification tasks by training convolutional networks on CIFAR-10, CIFAR-100,
and ImageNet. Our method typically provides > 2x memory reduction over a
baseline precision assignment while preserving training accuracy, and gives
further reductions by trading off accuracy. Compared to other baselines which
sometimes cause training to diverge, our method provides similar or better
memory reduction while avoiding divergence.Comment: Published in TML
Optimizing Mixture of Experts using Dynamic Recompilations
The Mixture of Experts architecture allows for outrageously large neural
networks by scaling model parameter size independently from computational
demand (FLOPs). However, current DNN frameworks cannot effectively support the
dynamic data flow in Mixture of Experts, and implementations on top of these
frameworks need to use workarounds that introduce significant overheads. To
address the limitation of these frameworks, we present DynaMoE, a DNN library
that uses dynamic recompilations to optimize and adapt the use of computational
resources to the dynamic needs of Mixture of Experts models. Our evaluation
shows that DynaMoE achieves a 1.8x speedup and supports 2.3x larger model sizes
when compared to existing MoE systems, even when not using recompilations. We
then present further optimizations enabled by dynamic recompilations that yield
an additional 1.7x speedup while simultaneously reducing memory pressure and
improving model quality.Comment: 13 pages, 15 figure
Collaboration Versus Cheating
We outline how we detected programming plagiarism in an introductory online
course for a master's of science in computer science program, how we achieved a
statistically significant reduction in programming plagiarism by combining a
clear explanation of university and class policy on academic honesty reinforced
with a short but formal assessment, and how we evaluated plagiarism rates
before SIGand after implementing our policy and assessment.Comment: 7 pages, 1 figure, 5 tables, SIGCSE 201
- …