522 research outputs found

    Bounding Cache Miss Costs of Multithreaded Computations Under General Schedulers

    Full text link
    We analyze the caching overhead incurred by a class of multithreaded algorithms when scheduled by an arbitrary scheduler. We obtain bounds that match or improve upon the well-known O(Q+S(M/B))O(Q+S \cdot (M/B)) caching cost for the randomized work stealing (RWS) scheduler, where SS is the number of steals, QQ is the sequential caching cost, and MM and BB are the cache size and block (or cache line) size respectively.Comment: Extended abstract in Proceedings of ACM Symp. on Parallel Alg. and Architectures (SPAA) 2017, pp. 339-350. This revision has a few small updates including a missing citation and the replacement of some big Oh terms with precise constant

    Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers

    Full text link
    The nested parallel (a.k.a. fork-join) model is widely used for writing parallel programs. However, the two composition constructs, i.e. "\parallel" (parallel) and ";;" (serial), are insufficient in expressing "partial dependencies" or "partial parallelism" in a program. We propose a new dataflow composition construct "\leadsto" to express partial dependencies in algorithms in a processor- and cache-oblivious way, thus extending the Nested Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign several divide-and-conquer algorithms ranging from dense linear algebra to dynamic-programming in the ND model and prove that they all have optimal span while retaining optimal cache complexity. We propose the design of runtime schedulers that map ND programs to multicore processors with multiple levels of possibly shared caches (i.e, Parallel Memory Hierarchies) and provide theoretical guarantees on their ability to preserve locality and load balance. For this, we adapt space-bounded (SB) schedulers for the ND model. We show that our algorithms have increased "parallelizability" in the ND model, and that SB schedulers can use the extra parallelizability to achieve asymptotically optimal bounds on cache misses and running time on a greater number of processors than in the NP model. The running time for the algorithms in this paper is O(i=0h1Q(t;σMi)Cip)O\left(\frac{\sum_{i=0}^{h-1} Q^{*}({\mathsf t};\sigma\cdot M_i)\cdot C_i}{p}\right), where QQ^{*} is the cache complexity of task t{\mathsf t}, CiC_i is the cost of cache miss at level-ii cache which is of size MiM_i, σ(0,1)\sigma\in(0,1) is a constant, and pp is the number of processors in an hh-level cache hierarchy

    On the Factor Refinement Principle and its Implementation on Multicore Architectures

    Get PDF
    The factor refinement principle turns a partial factorization of integers (or polynomi­ als) into a more complete factorization represented by basis elements and exponents, with basis elements that are pairwise coprime. There are lots of applications of this refinement technique such as simplifying systems of polynomial inequations and, more generally, speeding up certain algebraic algorithms by eliminating redundant expressions that may occur during intermediate computations. Successive GCD computations and divisions are used to accomplish this task until all the basis elements are pairwise coprime. Moreover, square-free factorization (which is the first step of many factorization algorithms) is used to remove the repeated patterns from each input element. Differentiation, division and GCD calculation op­ erations are required to complete this pre-processing step. Both factor refinement and square-free factorization often rely on plain (quadratic) algorithms for multipli­ cation but can be substantially improved with asymptotically fast multiplication on sufficiently large input. In this work, we review the working principles and complexity estimates of the factor refinement, in case of plain arithmetic, as well as asymptotically fast arithmetic. Following this review process, we design, analyze and implement parallel adaptations of these factor refinement algorithms. We consider several algorithm optimization techniques such as data locality analysis, balancing subproblems, etc. to fully exploit modern multicore architectures. The Cilk++ implementation of our parallel algorithm based on the augment refinement principle of Bach, Driscoll and Shallit achieves linear speedup for input data of sufficiently large size
    corecore