1,165 research outputs found
A Case Study in Coordination Programming: Performance Evaluation of S-Net vs Intel's Concurrent Collections
We present a programming methodology and runtime performance case study
comparing the declarative data flow coordination language S-Net with Intel's
Concurrent Collections (CnC). As a coordination language S-Net achieves a
near-complete separation of concerns between sequential software components
implemented in a separate algorithmic language and their parallel orchestration
in an asynchronous data flow streaming network. We investigate the merits of
S-Net and CnC with the help of a relevant and non-trivial linear algebra
problem: tiled Cholesky decomposition. We describe two alternative S-Net
implementations of tiled Cholesky factorization and compare them with two CnC
implementations, one with explicit performance tuning and one without, that
have previously been used to illustrate Intel CnC. Our experiments on a 48-core
machine demonstrate that S-Net manages to outperform CnC on this problem.Comment: 9 pages, 8 figures, 1 table, accepted for PLC 2014 worksho
Algorithms for Large-scale Whole Genome Association Analysis
In order to associate complex traits with genetic polymorphisms, genome-wide
association studies process huge datasets involving tens of thousands of
individuals genotyped for millions of polymorphisms. When handling these
datasets, which exceed the main memory of contemporary computers, one faces two
distinct challenges: 1) Millions of polymorphisms come at the cost of hundreds
of Gigabytes of genotype data, which can only be kept in secondary storage; 2)
the relatedness of the test population is represented by a covariance matrix,
which, for large populations, can only fit in the combined main memory of a
distributed architecture. In this paper, we present solutions for both
challenges: The genotype data is streamed from and to secondary storage using a
double buffering technique, while the covariance matrix is kept across the main
memory of a distributed memory system. We show that these methods sustain
high-performance and allow the analysis of enormous datase
Distributed Bayesian Probabilistic Matrix Factorization
Matrix factorization is a common machine learning technique for recommender
systems. Despite its high prediction accuracy, the Bayesian Probabilistic
Matrix Factorization algorithm (BPMF) has not been widely used on large scale
data because of its high computational cost. In this paper we propose a
distributed high-performance parallel implementation of BPMF on shared memory
and distributed architectures. We show by using efficient load balancing using
work stealing on a single node, and by using asynchronous communication in the
distributed version we beat state of the art implementations
High Performance Solutions for Big-data GWAS
In order to associate complex traits with genetic polymorphisms, genome-wide
association studies process huge datasets involving tens of thousands of
individuals genotyped for millions of polymorphisms. When handling these
datasets, which exceed the main memory of contemporary computers, one faces two
distinct challenges: 1) Millions of polymorphisms and thousands of phenotypes
come at the cost of hundreds of gigabytes of data, which can only be kept in
secondary storage; 2) the relatedness of the test population is represented by
a relationship matrix, which, for large populations, can only fit in the
combined main memory of a distributed architecture. In this paper, by using
distributed resources such as Cloud or clusters, we address both challenges:
The genotype and phenotype data is streamed from secondary storage using a
double buffer- ing technique, while the relationship matrix is kept across the
main memory of a distributed memory system. With the help of these solutions,
we develop separate algorithms for studies involving only one or a multitude of
traits. We show that these algorithms sustain high-performance and allow the
analysis of enormous datasets.Comment: Submitted to Parallel Computing. arXiv admin note: substantial text
overlap with arXiv:1304.227
Directed Transmission Method, A Fully Asynchronous approach to Solve Sparse Linear Systems in Parallel
In this paper, we propose a new distributed algorithm, called Directed
Transmission Method (DTM). DTM is a fully asynchronous and continuous-time
iterative algorithm to solve SPD sparse linear system. As an architecture-aware
algorithm, DTM could be freely running on all kinds of heterogeneous parallel
computer. We proved that DTM is convergent by making use of the final-value
theorem of Laplacian Transformation. Numerical experiments show that DTM is
stable and efficient.Comment: v1: poster presented in SPAA'08; v2: full paper; v3: rename EVS to
GNBT; v4: reuse EVS. More info, see my web page at
http://weifei00.googlepages.co
Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis
Sympiler is a domain-specific code generator that optimizes sparse matrix
computations by decoupling the symbolic analysis phase from the numerical
manipulation stage in sparse codes. The computation patterns in sparse
numerical methods are guided by the input sparsity structure and the sparse
algorithm itself. In many real-world simulations, the sparsity pattern changes
little or not at all. Sympiler takes advantage of these properties to
symbolically analyze sparse codes at compile-time and to apply inspector-guided
transformations that enable applying low-level transformations to sparse codes.
As a result, the Sympiler-generated code outperforms highly-optimized matrix
factorization codes from commonly-used specialized libraries, obtaining average
speedups over Eigen and CHOLMOD of 3.8X and 1.5X respectively.Comment: 12 page
A Mobile Computing Architecture for Numerical Simulation
The domain of numerical simulation is a place where the parallelization of
numerical code is common. The definition of a numerical context means the
configuration of resources such as memory, processor load and communication
graph, with an evolving feature: the resources availability. A feature is often
missing: the adaptability. It is not predictable and the adaptable aspect is
essential. Without calling into question these implementations of these codes,
we create an adaptive use of these implementations. Because the execution has
to be driven by the availability of main resources, the components of a numeric
computation have to react when their context changes. This paper offers a new
architecture, a mobile computing architecture, based on mobile agents and
JavaSpace. At the end of this paper, we apply our architecture to several case
studies and obtain our first results
- …