10,324 research outputs found
A study in parallel rewriting systems
In this paper we study systematically three basic classes of grammars incorporating parallel rewriting: Indian parallel grammars, Russian parallel grammars and L systems. In particular by extracting basic characteristics of these systems and combining them we introduce new classes of rewriting systems (ETOL[k] systems, ETOLIP systems and ETOLRP systems) Among others, some results on the combinatorial structure of Indian parallel languages and on the combinatorial structures of the new classes of languages are proved. As far as ETOL systems are concerned we prove that every ETOL language can be generated with a fixed (equal to 8) bounded degree of parallelism
Term Rewriting on GPUs
We present a way to implement term rewriting on a GPU. We do this by letting
the GPU repeatedly perform a massively parallel evaluation of all subterms. We
find that if the term rewrite systems exhibit sufficient internal parallelism,
GPU rewriting substantially outperforms the CPU. Since we expect that our
implementation can be further optimized, and because in any case GPUs will
become much more powerful in the future, this suggests that GPUs are an
interesting platform for term rewriting. As term rewriting can be viewed as a
universal programming language, this also opens a route towards programming
GPUs by term rewriting, especially for irregular computations
Separating Dependency from Constituency in a Tree Rewriting System
In this paper we present a new tree-rewriting formalism called Link-Sharing
Tree Adjoining Grammar (LSTAG) which is a variant of synchronous TAGs. Using
LSTAG we define an approach towards coordination where linguistic dependency is
distinguished from the notion of constituency. Such an approach towards
coordination that explicitly distinguishes dependencies from constituency gives
a better formal understanding of its representation when compared to previous
approaches that use tree-rewriting systems which conflate the two issues.Comment: 7 pages, 6 Postscript figures, uses fullname.st
Extending a multi-set relational algebra to a parallel environment
Parallel database systems will very probably be the future for high-performance data-intensive applications. In the past decade, many parallel database systems have been developed, together with many languages and approaches to specify operations in these systems. A common background is still missing, however. This paper proposes an extended relational algebra for this purpose, based on the well-known standard relational algebra. The extended algebra provides both complete database manipulation language features, and data distribution and process allocation primitives to describe parallelism. It is defined in terms of multi-sets of tuples to allow handling of duplicates and to obtain a close connection to the world of high-performance data processing. Due to its algebraic nature, the language is well suited for optimization and parallelization through expression rewriting. The proposed language can be used as a database manipulation language on its own, as has been done in the PRISMA parallel database project, or as a formal basis for other languages, like SQL
Extending the Nested Parallel Model to the Nested Dataflow Model with Provably Efficient Schedulers
The nested parallel (a.k.a. fork-join) model is widely used for writing
parallel programs. However, the two composition constructs, i.e. ""
(parallel) and "" (serial), are insufficient in expressing "partial
dependencies" or "partial parallelism" in a program. We propose a new dataflow
composition construct "" to express partial dependencies in
algorithms in a processor- and cache-oblivious way, thus extending the Nested
Parallel (NP) model to the \emph{Nested Dataflow} (ND) model. We redesign
several divide-and-conquer algorithms ranging from dense linear algebra to
dynamic-programming in the ND model and prove that they all have optimal span
while retaining optimal cache complexity. We propose the design of runtime
schedulers that map ND programs to multicore processors with multiple levels of
possibly shared caches (i.e, Parallel Memory Hierarchies) and provide
theoretical guarantees on their ability to preserve locality and load balance.
For this, we adapt space-bounded (SB) schedulers for the ND model. We show that
our algorithms have increased "parallelizability" in the ND model, and that SB
schedulers can use the extra parallelizability to achieve asymptotically
optimal bounds on cache misses and running time on a greater number of
processors than in the NP model. The running time for the algorithms in this
paper is , where is the cache complexity of task ,
is the cost of cache miss at level- cache which is of size ,
is a constant, and is the number of processors in an
-level cache hierarchy
Towards an Adaptive Skeleton Framework for Performance Portability
The proliferation of widely available, but very different, parallel architectures
makes the ability to deliver good parallel performance
on a range of architectures, or performance portability, highly desirable.
Irregularly-parallel problems, where the number and size
of tasks is unpredictable, are particularly challenging and require
dynamic coordination.
The paper outlines a novel approach to delivering portable parallel
performance for irregularly parallel programs. The approach
combines declarative parallelism with JIT technology, dynamic
scheduling, and dynamic transformation.
We present the design of an adaptive skeleton library, with a task
graph implementation, JIT trace costing, and adaptive transformations.
We outline the architecture of the protoype adaptive skeleton
execution framework in Pycket, describing tasks, serialisation,
and the current scheduler.We report a preliminary evaluation of the
prototype framework using 4 micro-benchmarks and a small case
study on two NUMA servers (24 and 96 cores) and a small cluster
(17 hosts, 272 cores). Key results include Pycket delivering good
sequential performance e.g. almost as fast as C for some benchmarks;
good absolute speedups on all architectures (up to 120 on
128 cores for sumEuler); and that the adaptive transformations do
improve performance
Towards a GPU-based implementation of interaction nets
We present ingpu, a GPU-based evaluator for interaction nets that heavily
utilizes their potential for parallel evaluation. We discuss advantages and
challenges of the ongoing implementation of ingpu and compare its performance
to existing interaction nets evaluators.Comment: In Proceedings DCM 2012, arXiv:1403.757
- …