22 research outputs found
Relaxed Schedulers Can Efficiently Parallelize Iterative Algorithms
There has been significant progress in understanding the parallelism inherent
to iterative sequential algorithms: for many classic algorithms, the depth of
the dependence structure is now well understood, and scheduling techniques have
been developed to exploit this shallow dependence structure for efficient
parallel implementations. A related, applied research strand has studied
methods by which certain iterative task-based algorithms can be efficiently
parallelized via relaxed concurrent priority schedulers. These allow for high
concurrency when inserting and removing tasks, at the cost of executing
superfluous work due to the relaxed semantics of the scheduler.
In this work, we take a step towards unifying these two research directions,
by showing that there exists a family of relaxed priority schedulers that can
efficiently and deterministically execute classic iterative algorithms such as
greedy maximal independent set (MIS) and matching. Our primary result shows
that, given a randomized scheduler with an expected relaxation factor of in
terms of the maximum allowed priority inversions on a task, and any graph on
vertices, the scheduler is able to execute greedy MIS with only an additive
factor of poly() expected additional iterations compared to an exact (but
not scalable) scheduler. This counter-intuitive result demonstrates that the
overhead of relaxation when computing MIS is not dependent on the input size or
structure of the input graph. Experimental results show that this overhead can
be clearly offset by the gain in performance due to the highly scalable
scheduler. In sum, we present an efficient method to deterministically
parallelize iterative sequential algorithms, with provable runtime guarantees
in terms of the number of executed tasks to completion.Comment: PODC 2018, pages 377-386 in proceeding
The Power of Choice in Priority Scheduling
Consider the following random process: we are given queues, into which
elements of increasing labels are inserted uniformly at random. To remove an
element, we pick two queues at random, and remove the element of lower label
(higher priority) among the two. The cost of a removal is the rank of the label
removed, among labels still present in any of the queues, that is, the distance
from the optimal choice at each step. Variants of this strategy are prevalent
in state-of-the-art concurrent priority queue implementations. Nonetheless, it
is not known whether such implementations provide any rank guarantees, even in
a sequential model.
We answer this question, showing that this strategy provides surprisingly
strong guarantees: Although the single-choice process, where we always insert
and remove from a single randomly chosen queue, has degrading cost, going to
infinity as we increase the number of steps, in the two choice process, the
expected rank of a removed element is while the expected worst-case
cost is . These bounds are tight, and hold irrespective of the
number of steps for which we run the process.
The argument is based on a new technical connection between "heavily loaded"
balls-into-bins processes and priority scheduling.
Our analytic results inspire a new concurrent priority queue implementation,
which improves upon the state of the art in terms of practical performance
Inherent Limitations of Hybrid Transactional Memory
Several Hybrid Transactional Memory (HyTM) schemes have recently been
proposed to complement the fast, but best-effort, nature of Hardware
Transactional Memory (HTM) with a slow, reliable software backup. However, the
fundamental limitations of building a HyTM with nontrivial concurrency between
hardware and software transactions are still not well understood.
In this paper, we propose a general model for HyTM implementations, which
captures the ability of hardware transactions to buffer memory accesses, and
allows us to formally quantify and analyze the amount of overhead
(instrumentation) of a HyTM scheme. We prove the following: (1) it is
impossible to build a strictly serializable HyTM implementation that has both
uninstrumented reads and writes, even for weak progress guarantees, and (2)
under reasonable assumptions, in any opaque progressive HyTM, a hardware
transaction must incur instrumentation costs linear in the size of its data
set. We further provide two upper bound implementations whose instrumentation
costs are optimal with respect to their progress guarantees. In sum, this paper
captures for the first time an inherent trade-off between the degree of
concurrency a HyTM provides between hardware and software transactions, and the
amount of instrumentation overhead the implementation must incur
Recursed Is Not Recursive: A Jarring Result
Recursed is a 2D puzzle platform video game featuring treasure chests that,
when jumped into, instantiate a room that can later be exited (similar to
function calls), optionally generating a jar that returns back to that room
(similar to continuations). We prove that Recursed is RE-complete and thus
undecidable (not recursive) by a reduction from the Post Correspondence
Problem. Our reduction is "practical": the reduction from PCP results in fully
playable levels that abide by all constraints governing levels (including the
15x20 room size) designed for the main game. Our reduction is also "efficient":
a Turing machine can be simulated by a Recursed level whose size is linear in
the encoding size of the Turing machine and whose solution length is polynomial
in the running time of the Turing machine.Comment: Submitted to MFCS2020, 21 page
Path Puzzles: Discrete Tomography with a Path Constraint is Hard
We prove that path puzzles with complete row and column information--or
equivalently, 2D orthogonal discrete tomography with Hamiltonicity
constraint--are strongly NP-complete, ASP-complete, and #P-complete. Along the
way, we newly establish ASP-completeness and #P-completeness for 3-Dimensional
Matching and Numerical 3-Dimensional Matching.Comment: 16 pages, 8 figures. Revised proof of Theorem 2.4. 2-page abstract
appeared in Abstracts from the 20th Japan Conference on Discrete and
Computational Geometry, Graphs, and Games (JCDCGGG 2017
Who witnesses The Witness? Finding witnesses in The Witness is hard and sometimes impossible
We analyze the computational complexity of the many types of
pencil-and-paper-style puzzles featured in the 2016 puzzle video game The
Witness. In all puzzles, the goal is to draw a simple path in a rectangular
grid graph from a start vertex to a destination vertex. The different puzzle
types place different constraints on the path: preventing some edges from being
visited (broken edges); forcing some edges or vertices to be visited
(hexagons); forcing some cells to have certain numbers of incident path edges
(triangles); or forcing the regions formed by the path to be partially
monochromatic (squares), have exactly two special cells (stars), or be singly
covered by given shapes (polyominoes) and/or negatively counting shapes
(antipolyominoes). We show that any one of these clue types (except the first)
is enough to make path finding NP-complete ("witnesses exist but are hard to
find"), even for rectangular boards. Furthermore, we show that a final clue
type (antibody), which necessarily "cancels" the effect of another clue in the
same region, makes path finding -complete ("witnesses do not exist"),
even with a single antibody (combined with many anti/polyominoes), and the
problem gets no harder with many antibodies. On the positive side, we give a
polynomial-time algorithm for monomino clues, by reducing to hexagon clues on
the boundary of the puzzle, even in the presence of broken edges, and solving
"subset Hamiltonian path" for terminals on the boundary of an embedded planar
graph in polynomial time.Comment: 72 pages, 59 figures. Revised proof of Lemma 3.5. A short version of
this paper appeared at the 9th International Conference on Fun with
Algorithms (FUN 2018
Infinite All-Layers Simple Foldability
We study the problem of deciding whether a crease pattern can be folded by
simple folds (folding along one line at a time) under the infinite all-layers
model introduced by [Akitaya et al., 2017], in which each simple fold is
defined by an infinite line and must fold all layers of paper that intersect
this line. This model is motivated by folding in manufacturing such as
sheet-metal bending. We improve on [Arkin et al., 2004] by giving a
deterministic -time algorithm to decide simple foldability of 1D crease
patterns in the all-layers model. Then we extend this 1D result to 2D, showing
that simple foldability in this model can be decided in linear time for
unassigned axis-aligned orthogonal crease patterns on axis-aligned 2D
orthogonal paper. On the other hand, we show that simple foldability is
strongly NP-complete if a subset of the creases have a mountain-valley
assignment, even for an axis-aligned rectangle of paper
Who witnesses The Witness? Finding witnesses in The Witness is hard and sometimes impossible
We analyze the computational complexity of the many types of pencil-and-paper-style puzzles featured in the 2016 puzzle video game The Witness. In all puzzles, the goal is to draw a path in a rectangular grid graph from a start vertex to a destination vertex. The different puzzle types place different constraints on the path: preventing some edges from being visited (broken edges); forcing some edges or vertices to be visited (hexagons); forcing some cells to have certain numbers of incident path edges (triangles); or forcing the regions formed by the path to be partially monochromatic (squares), have exactly two special cells (stars), or be singly covered by given shapes (polyominoes) and/or negatively counting shapes (antipolyominoes). We show that any one of these clue types (except the first) is enough to make path finding NP-complete ("witnesses exist but are hard to find"), even for rectangular boards. Furthermore, we show that a final clue type (antibody), which necessarily "cancels" the effect of another clue in the same region, makes path finding Sigma_2-complete ("witnesses do not exist"), even with a single antibody (combined with many anti/polyominoes), and the problem gets no harder with many antibodies
Inducing and exploiting activation sparsity for fast neural network inference
Optimizing convolutional neural networks for fast inference has recently become an extremely active area of research. One of the go-to solutions in this context is weight pruning, which aims to reduce computational and memory footprint by removing large subsets of the connections in a neural network. Surprisingly, much less attention has been given to exploiting sparsity in the activation maps, which tend to be naturally sparse in many settings thanks to the structure of rectified linear (ReLU) activation functions. In this paper, we present an in-depth analysis of methods for maximizing the sparsity of the activations in a trained neural network, and show that, when coupled with an efficient sparse-input convolution algorithm, we can leverage this sparsity for significant performance gains. To induce highly sparse activation maps without accuracy loss, we introduce a new regularization technique, coupled with a new threshold-based sparsification method based on a parameterized activation function called Forced-Activation-Threshold Rectified Linear Unit (FATReLU). We examine the impact of our methods on popular image classification models, showing that most architectures can adapt to significantly sparser activation maps without any accuracy loss. Our second contribution is showing that these these compression gains can be translated into inference speedups: we provide a new algorithm to enable fast convolution operations over networks with sparse activations, and show that it can enable significant speedups for end-to-end inference on a range of popular models on the large-scale ImageNet image classification task on modern Intel CPUs, with little or no retraining cost