35 research outputs found
IST Austria Thesis
The scalability of concurrent data structures and distributed algorithms strongly depends on
reducing the contention for shared resources and the costs of synchronization and communication. We show how such cost reductions can be attained by relaxing the strict consistency conditions required by sequential implementations. In the first part of the thesis, we consider relaxation in the context of concurrent data structures. Specifically, in data structures
such as priority queues, imposing strong semantics renders scalability impossible, since a correct implementation of the remove operation should return only the element with highest priority. Intuitively, attempting to invoke remove operations concurrently creates a race condition. This bottleneck can be circumvented by relaxing semantics of the affected data structure, thus allowing removal of the elements which are no longer required to have the highest priority. We prove that the randomized implementations of relaxed data structures provide provable guarantees on the priority of the removed elements even under concurrency. Additionally, we show that in some cases the relaxed data structures can be used to scale the classical algorithms which are usually implemented with the exact ones. In the second part, we study parallel variants of the stochastic gradient descent (SGD) algorithm, which distribute computation among the multiple processors, thus reducing the running time. Unfortunately, in order for standard parallel SGD to succeed, each processor has to maintain a local copy of the necessary model parameter, which is identical to the local copies of other processors; the overheads from this perfect consistency in terms of communication and synchronization can negate the speedup gained by distributing the computation. We show that the consistency conditions required by SGD can be relaxed, allowing the algorithm to be more flexible in terms of tolerating quantized communication, asynchrony, or even crash faults, while its convergence remains asymptotically the same
Relaxed Schedulers Can Efficiently Parallelize Iterative Algorithms
There has been significant progress in understanding the parallelism inherent
to iterative sequential algorithms: for many classic algorithms, the depth of
the dependence structure is now well understood, and scheduling techniques have
been developed to exploit this shallow dependence structure for efficient
parallel implementations. A related, applied research strand has studied
methods by which certain iterative task-based algorithms can be efficiently
parallelized via relaxed concurrent priority schedulers. These allow for high
concurrency when inserting and removing tasks, at the cost of executing
superfluous work due to the relaxed semantics of the scheduler.
In this work, we take a step towards unifying these two research directions,
by showing that there exists a family of relaxed priority schedulers that can
efficiently and deterministically execute classic iterative algorithms such as
greedy maximal independent set (MIS) and matching. Our primary result shows
that, given a randomized scheduler with an expected relaxation factor of in
terms of the maximum allowed priority inversions on a task, and any graph on
vertices, the scheduler is able to execute greedy MIS with only an additive
factor of poly() expected additional iterations compared to an exact (but
not scalable) scheduler. This counter-intuitive result demonstrates that the
overhead of relaxation when computing MIS is not dependent on the input size or
structure of the input graph. Experimental results show that this overhead can
be clearly offset by the gain in performance due to the highly scalable
scheduler. In sum, we present an efficient method to deterministically
parallelize iterative sequential algorithms, with provable runtime guarantees
in terms of the number of executed tasks to completion.Comment: PODC 2018, pages 377-386 in proceeding
The Power of Choice in Priority Scheduling
Consider the following random process: we are given queues, into which
elements of increasing labels are inserted uniformly at random. To remove an
element, we pick two queues at random, and remove the element of lower label
(higher priority) among the two. The cost of a removal is the rank of the label
removed, among labels still present in any of the queues, that is, the distance
from the optimal choice at each step. Variants of this strategy are prevalent
in state-of-the-art concurrent priority queue implementations. Nonetheless, it
is not known whether such implementations provide any rank guarantees, even in
a sequential model.
We answer this question, showing that this strategy provides surprisingly
strong guarantees: Although the single-choice process, where we always insert
and remove from a single randomly chosen queue, has degrading cost, going to
infinity as we increase the number of steps, in the two choice process, the
expected rank of a removed element is while the expected worst-case
cost is . These bounds are tight, and hold irrespective of the
number of steps for which we run the process.
The argument is based on a new technical connection between "heavily loaded"
balls-into-bins processes and priority scheduling.
Our analytic results inspire a new concurrent priority queue implementation,
which improves upon the state of the art in terms of practical performance
The Transactional Conflict Problem
The transactional conflict problem arises in transactional systems whenever
two or more concurrent transactions clash on a data item.
While the standard solution to such conflicts is to immediately abort one of
the transactions, some practical systems consider the alternative of delaying
conflict resolution for a short interval, which may allow one of the
transactions to commit. The challenge in the transactional conflict problem is
to choose the optimal length of this delay interval so as to minimize the
overall running time penalty for the conflicting transactions. In this paper,
we propose a family of optimal online algorithms for the transactional conflict
problem.
Specifically, we consider variants of this problem which arise in different
implementations of transactional systems, namely "requestor wins" and
"requestor aborts" implementations: in the former, the recipient of a coherence
request is aborted, whereas in the latter, it is the requestor which has to
abort. Both strategies are implemented by real systems.
We show that the requestor aborts case can be reduced to a classic instance
of the ski rental problem, while the requestor wins case leads to a new version
of this classical problem, for which we derive optimal deterministic and
randomized algorithms.
Moreover, we prove that, under a simplified adversarial model, our algorithms
are constant-competitive with the offline optimum in terms of throughput.
We validate our algorithmic results empirically through a hardware simulation
of hardware transactional memory (HTM), showing that our algorithms can lead to
non-trivial performance improvements for classic concurrent data structures
Dynamic Averaging Load Balancing on Cycles
We consider the following dynamic load-balancing process: given an underlying
graph with nodes, in each step , one unit of load is created,
and placed at a randomly chosen graph node. In the same step, the chosen node
picks a random neighbor, and the two nodes balance their loads by averaging
them. We are interested in the expected gap between the minimum and maximum
loads at nodes as the process progresses, and its dependence on and on the
graph structure.
Similar variants of the above graphical balanced allocation process have been
studied by Peres, Talwar, and Wieder, and by Sauerwald and Sun for regular
graphs. These authors left as open the question of characterizing the gap in
the case of \emph{cycle graphs} in the \emph{dynamic} case, where weights are
created during the algorithm's execution. For this case, the only known upper
bound is of , following from a majorization argument
due to Peres, Talwar, and Wieder, which analyzes a related graphical allocation
process.
In this paper, we provide an upper bound of
on the expected gap of the above process for cycles of length . We introduce
a new potential analysis technique, which enables us to bound the difference in
load between -hop neighbors on the cycle, for any . We
complement this with a "gap covering" argument, which bounds the maximum value
of the gap by bounding its value across all possible subsets of a certain
structure, and recursively bounding the gaps within each subset. We provide
analytical and experimental evidence that our upper bound on the gap is tight
up to a logarithmic factor
Influence of Tribological Parameters on the Railway Wheel Derailment
At present, Nadal’s formula is used for prediction of derailment that contains a limited number of parameters. Besides, insufficient study of laws of variation of the noted parameters and ignorance of the influence of other parameters on the derailment complicate solution of the problem. The sliding distance and the relative sliding velocity are the most sensitive factors contributing to the destruction of the third body. Moreover, increased friction coefficient between the steering surfaces of the wheel and rail promotes climbing of a wheel on the rail and derailment. Dependences of the main parameters, influencing the destruction of the third body, the sliding distance and the relative sliding velocity on the rail track curvature, and difference of diameters of wheels of the wheelset and the non-roundness of one of the wheels of the wheelset are shown in the work. The methods for estimation of the third body destruction degree and consideration in Nadal’s formula of the additional criterion of impossibility of the wheel rolling on the contact point of the wheel and rail steering surfaces, containing a value of this contact point advancing, which in turn depends on the angle of attack, are proposed
Lower Bounds for Shared-Memory Leader Election Under Bounded Write Contention
This paper gives tight logarithmic lower bounds on the solo step complexity of leader election in an asynchronous shared-memory model with single-writer multi-reader (SWMR) registers, for both deterministic and randomized obstruction-free algorithms. The approach extends to lower bounds for deterministic and randomized obstruction-free algorithms using multi-writer registers under bounded write concurrency, showing a trade-off between the solo step complexity of a leader election algorithm, and the worst-case number of stalls incurred by a processor in an execution
Communication-Efficient Federated Learning With Data and Client Heterogeneity
Federated Learning (FL) enables large-scale distributed training of machine
learning models, while still allowing individual nodes to maintain data
locally.
However, executing FL at scale comes with inherent practical challenges:
1) heterogeneity of the local node data distributions,
2) heterogeneity of node computational speeds (asynchrony),
but also 3) constraints in the amount of communication between the clients
and the server.
In this work, we present the first variant of the classic federated averaging
(FedAvg) algorithm
which, at the same time, supports data heterogeneity, partial client
asynchrony, and communication compression.
Our algorithm comes with a rigorous analysis showing that, in spite of these
system relaxations,
it can provide similar convergence to FedAvg in interesting parameter
regimes.
Experimental results in the rigorous LEAF benchmark on setups of up to
nodes show that our algorithm ensures fast convergence for standard federated
tasks, improving upon prior quantized and asynchronous approaches
Provably-Efficient and Internally-Deterministic Parallel Union-Find
Determining the degree of inherent parallelism in classical sequential
algorithms and leveraging it for fast parallel execution is a key topic in
parallel computing, and detailed analyses are known for a wide range of
classical algorithms. In this paper, we perform the first such analysis for the
fundamental Union-Find problem, in which we are given a graph as a sequence of
edges, and must maintain its connectivity structure under edge additions. We
prove that classic sequential algorithms for this problem are
well-parallelizable under reasonable assumptions, addressing a conjecture by
[Blelloch, 2017]. More precisely, we show via a new potential argument that,
under uniform random edge ordering, parallel union-find operations are unlikely
to interfere: concurrent threads processing the graph in parallel will
encounter memory contention times in
expectation, where and are the number of edges and nodes in the
graph, respectively. We leverage this result to design a new parallel
Union-Find algorithm that is both internally deterministic, i.e., its results
are guaranteed to match those of a sequential execution, but also
work-efficient and scalable, as long as the number of threads is
, for an arbitrarily small constant
, which holds for most large real-world graphs. We present
lower bounds which show that our analysis is close to optimal, and experimental
results suggesting that the performance cost of internal determinism is
limited