838 research outputs found
Bounding Cache Miss Costs of Multithreaded Computations Under General Schedulers
We analyze the caching overhead incurred by a class of multithreaded
algorithms when scheduled by an arbitrary scheduler. We obtain bounds that
match or improve upon the well-known caching cost for the
randomized work stealing (RWS) scheduler, where is the number of steals,
is the sequential caching cost, and and are the cache size and
block (or cache line) size respectively.Comment: Extended abstract in Proceedings of ACM Symp. on Parallel Alg. and
Architectures (SPAA) 2017, pp. 339-350. This revision has a few small updates
including a missing citation and the replacement of some big Oh terms with
precise constant
Hybrid static/dynamic scheduling for already optimized dense matrix factorization
We present the use of a hybrid static/dynamic scheduling strategy of the task
dependency graph for direct methods used in dense numerical linear algebra.
This strategy provides a balance of data locality, load balance, and low
dequeue overhead. We show that the usage of this scheduling in communication
avoiding dense factorization leads to significant performance gains. On a 48
core AMD Opteron NUMA machine, our experiments show that we can achieve up to
64% improvement over a version of CALU that uses fully dynamic scheduling, and
up to 30% improvement over the version of CALU that uses fully static
scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic
scheduling approach is up to 8% faster than the version of CALU that uses a
fully static scheduling or fully dynamic scheduling. Our algorithm leads to
speedups over the corresponding routines for computing LU factorization in well
known libraries. On the 48 core AMD NUMA machine, our best implementation is up
to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to
82% faster than MKL. Our approach also shows significant speedups compared with
PLASMA on both of these systems
Well-Structured Futures and Cache Locality
In fork-join parallelism, a sequential program is split into a directed
acyclic graph of tasks linked by directed dependency edges, and the tasks are
executed, possibly in parallel, in an order consistent with their dependencies.
A popular and effective way to extend fork-join parallelism is to allow threads
to create futures. A thread creates a future to hold the results of a
computation, which may or may not be executed in parallel. That result is
returned when some thread touches that future, blocking if necessary until the
result is ready.
Recent research has shown that while futures can, of course, enhance
parallelism in a structured way, they can have a deleterious effect on cache
locality. In the worst case, futures can incur deviations, which implies
additional cache misses, where is the number of cache lines, is the
number of processors, is the number of touches, and is the
\emph{computation span}. Since cache locality has a large impact on software
performance on modern multicores, this result is troubling.
In this paper, however, we show that if futures are used in a simple,
disciplined way, then the situation is much better: if each future is touched
only once, either by the thread that created it, or by a thread to which the
future has been passed from the thread that created it, then parallel
executions with work stealing can incur at most additional
cache misses, a substantial improvement. This structured use of futures is
characteristic of many (but not all) parallel applications
- …