1,056 research outputs found
Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes
The ongoing hardware evolution exhibits an escalation in the number, as well
as in the heterogeneity, of computing resources. The pressure to maintain
reasonable levels of performance and portability forces application developers
to leave the traditional programming paradigms and explore alternative
solutions. PaStiX is a parallel sparse direct solver, based on a dynamic
scheduler for modern hierarchical manycore architectures. In this paper, we
study the benefits and limits of replacing the highly specialized internal
scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and
StarPU. The tasks graph of the factorization step is made available to the two
runtimes, providing them the opportunity to process and optimize its traversal
in order to maximize the algorithm efficiency for the targeted hardware
platform. A comparative study of the performance of the PaStiX solver on top of
its native internal scheduler, PaRSEC, and StarPU frameworks, on different
execution environments, is performed. The analysis highlights that these
generic task-based runtimes achieve comparable results to the
application-optimized embedded scheduler on homogeneous platforms. Furthermore,
they are able to significantly speed up the solver on heterogeneous
environments by taking advantage of the accelerators while hiding the
complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014
Reproducible and User-Controlled Software Environments in HPC with Guix
Support teams of high-performance computing (HPC) systems often find
themselves between a rock and a hard place: on one hand, they understandably
administrate these large systems in a conservative way, but on the other hand,
they try to satisfy their users by deploying up-to-date tool chains as well as
libraries and scientific software. HPC system users often have no guarantee
that they will be able to reproduce results at a later point in time, even on
the same system-software may have been upgraded, removed, or recompiled under
their feet, and they have little hope of being able to reproduce the same
software environment elsewhere. We present GNU Guix and the functional package
management paradigm and show how it can improve reproducibility and sharing
among researchers with representative use cases.Comment: 2nd International Workshop on Reproducibility in Parallel Computing
(RepPar), Aug 2015, Vienne, Austria. http://reppar.org
The Language of Search
This paper is concerned with a class of algorithms that perform exhaustive
search on propositional knowledge bases. We show that each of these algorithms
defines and generates a propositional language. Specifically, we show that the
trace of a search can be interpreted as a combinational circuit, and a search
algorithm then defines a propositional language consisting of circuits that are
generated across all possible executions of the algorithm. In particular, we
show that several versions of exhaustive DPLL search correspond to such
well-known languages as FBDD, OBDD, and a precisely-defined subset of d-DNNF.
By thus mapping search algorithms to propositional languages, we provide a
uniform and practical framework in which successful search techniques can be
harnessed for compilation of knowledge into various languages of interest, and
a new methodology whereby the power and limitations of search algorithms can be
understood by looking up the tractability and succinctness of the corresponding
propositional languages
Static and Dynamic Scheduling for Effective Use of Multicore Systems
Multicore systems have increasingly gained importance in high performance computers. Compared to the traditional microarchitectures, multicore architectures have a simpler design, higher performance-to-area ratio, and improved power efficiency. Although the multicore architecture has various advantages, traditional parallel programming techniques do not apply to the new architecture efficiently. This dissertation addresses how to determine optimized thread schedules to improve data reuse on shared-memory multicore systems and how to seek a scalable solution to designing parallel software on both shared-memory and distributed-memory multicore systems.
We propose an analytical cache model to predict the number of cache misses on the time-sharing L2 cache on a multicore processor. The model provides an insight into the impact of cache sharing and cache contention between threads. Inspired by the model, we build the framework of affinity based thread scheduling to determine optimized thread schedules to improve data reuse on all the levels in a complex memory hierarchy. The affinity based thread scheduling framework includes a model to estimate the cost of a thread schedule, which consists of three submodels: an affinity graph submodel, a memory hierarchy submodel, and a cost submodel. Based on the model, we design a hierarchical graph partitioning algorithm to determine near-optimal solutions. We have also extended the algorithm to support threads with data dependences. The algorithms are implemented and incorporated into a feedback directed optimization prototype system. The prototype system builds upon a binary instrumentation tool and can improve program performance greatly on shared-memory multicore architectures.
We also study the dynamic data-availability driven scheduling approach to designing new parallel software on distributed-memory multicore architectures. We have implemented a decentralized dynamic runtime system. The design of the runtime system is focused on the scalability metric. At any time only a small portion of a task graph exists in memory. We propose an algorithm to solve data dependences without process cooperation in a distributed manner. Our experimental results demonstrate the scalability and practicality of the approach for both shared-memory and distributed-memory multicore systems. Finally, we present a scalable nonblocking topology-aware multicast scheme for distributed DAG scheduling applications
Efficient Set Sharing Using ZBDDs
Set sharing is an abstract domain in which each concrete object is represented by the set of local variables from which it might be reachable. It is a useful abstraction to detect parallelism opportunities, since it contains definite information about which variables do not share in memory, i.e., about when the memory regions reachable from those variables are disjoint. Set sharing is a more precise alternative to pair sharing, in which each domain element is a set of all pairs of local variables from which a common object may be reachable. However, the exponential complexity of some set sharing operations has limited its wider application. This work introduces an efficient implementation of the set sharing domain using Zero-suppressed Binary Decision Diagrams (ZBDDs). Because ZBDDs were designed to represent sets of combinations (i.e., sets of sets), they naturally represent elements of the set sharing domain. We show how to synthesize the operations needed in the set sharing transfer functions from basic ZBDD operations. For some of the operations, we devise custom ZBDD algorithms that perform better in practice. We also compare our implementation of the abstract domain with an efficient, compact, bit set-based alternative, and show that the ZBDD version scales better in terms of both memory usage and running time
- …