378,723 research outputs found
Efficient Implementation of a Synchronous Parallel Push-Relabel Algorithm
Motivated by the observation that FIFO-based push-relabel algorithms are able
to outperform highest label-based variants on modern, large maximum flow
problem instances, we introduce an efficient implementation of the algorithm
that uses coarse-grained parallelism to avoid the problems of existing parallel
approaches. We demonstrate good relative and absolute speedups of our algorithm
on a set of large graph instances taken from real-world applications. On a
modern 40-core machine, our parallel implementation outperforms existing
sequential implementations by up to a factor of 12 and other parallel
implementations by factors of up to 3
Fast matrix multiplication techniques based on the Adleman-Lipton model
On distributed memory electronic computers, the implementation and
association of fast parallel matrix multiplication algorithms has yielded
astounding results and insights. In this discourse, we use the tools of
molecular biology to demonstrate the theoretical encoding of Strassen's fast
matrix multiplication algorithm with DNA based on an -moduli set in the
residue number system, thereby demonstrating the viability of computational
mathematics with DNA. As a result, a general scalable implementation of this
model in the DNA computing paradigm is presented and can be generalized to the
application of \emph{all} fast matrix multiplication algorithms on a DNA
computer. We also discuss the practical capabilities and issues of this
scalable implementation. Fast methods of matrix computations with DNA are
important because they also allow for the efficient implementation of other
algorithms (i.e. inversion, computing determinants, and graph theory) with DNA.Comment: To appear in the International Journal of Computer Engineering
Research. Minor changes made to make the preprint as similar as possible to
the published versio
Distributed computing methodology for training neural networks in an image-guided diagnostic application
Distributed computing is a process through which a set of computers connected by a network is used collectively to solve a single problem. In this paper, we propose a distributed computing methodology for training neural networks for the detection of lesions in colonoscopy. Our approach is based on partitioning the training set across multiple processors using a parallel virtual machine. In this way, interconnected computers of varied architectures can be used for the distributed evaluation of the error function and gradient values, and, thus, training neural networks utilizing various learning methods. The proposed methodology has large granularity and low synchronization, and has been implemented and tested. Our results indicate that the parallel virtual machine implementation of the training algorithms developed leads to considerable speedup, especially when large network architectures and training sets are used
Empirical Evaluation of the Parallel Distribution Sweeping Framework on Multicore Architectures
In this paper, we perform an empirical evaluation of the Parallel External
Memory (PEM) model in the context of geometric problems. In particular, we
implement the parallel distribution sweeping framework of Ajwani, Sitchinava
and Zeh to solve batched 1-dimensional stabbing max problem. While modern
processors consist of sophisticated memory systems (multiple levels of caches,
set associativity, TLB, prefetching), we empirically show that algorithms
designed in simple models, that focus on minimizing the I/O transfers between
shared memory and single level cache, can lead to efficient software on current
multicore architectures. Our implementation exhibits significantly fewer
accesses to slow DRAM and, therefore, outperforms traditional approaches based
on plane sweep and two-way divide and conquer.Comment: Longer version of ESA'13 pape
Techniques for Autotuning Algorithms on Heterogenous Platforms
Proceedings of the First PhD Symposium on Sustainable Ultrascale
Computing Systems (NESUS PhD 2016) Timisoara, Romania. February 8-11, 2016.Current GPUs (Graphic Processing Units) can obtain high computational performance in scientific applications.
Nevertheless, programmers have to use suitable parallel algorithms for these architectures and have to consider
optimization techniques in the implementation in order to achieve that performance. This thesis is focused on
designing and implementing parallel prefix algorithms into GPU architectures with little effort. For that, we have
developed a very optimized library called BPLG (Tuning Butterfly Processing Library for GPUs) and based on a set
of building blocks that enable to easily design well-known algorithms such as FFT, tridiagonal systems solvers, scan
operator, sorting or signal processing. This library is designed under a tuning methodology based on two-stages
indentified as GPU resource analysis and operator string manipulation. Specifically, this strategy is focused on a
set of parallel prefix algorithms that can be represented according to a set of common permutations of the digits
of each of its element indices [4], denoted as Index-Digit (ID) algorithms. So far, the proposed methodology has
obtained very good results with respect to state-of-art libraries, as CUFFT, CUSPARSE, CUDPP or ModernGPU.European Cooperation in Science and Technology. COS
A methodology for parallel implementation of the basic operations of digital signal processing
A methodology for parallel implementation of the basic operations of digital signal processing is considered. The methodology creation is based on the analysis and generalization of the results obtained in the construction of the model description of computation organization. The methodology provides a set of formal transformations that allow you to transform a sequential computing system into a parallel adaptive processing mode. The methodology offers a formal basis for the concurrent exploration of algorithms and architectures, thus creating a basis for improving the efficiency of parallel computing. © 2019 Author(s)
A Parallel General Purpose Multi-Objective Optimization Framework, with Application to Beam Dynamics
Particle accelerators are invaluable tools for research in the basic and
applied sciences, in fields such as materials science, chemistry, the
biosciences, particle physics, nuclear physics and medicine. The design,
commissioning, and operation of accelerator facilities is a non-trivial task,
due to the large number of control parameters and the complex interplay of
several conflicting design goals. We propose to tackle this problem by means of
multi-objective optimization algorithms which also facilitate a parallel
deployment. In order to compute solutions in a meaningful time frame a fast and
scalable software framework is required. In this paper, we present the
implementation of such a general-purpose framework for simulation-based
multi-objective optimization methods that allows the automatic investigation of
optimal sets of machine parameters. The implementation is based on a
master/slave paradigm, employing several masters that govern a set of slaves
executing simulations and performing optimization tasks. Using evolutionary
algorithms as the optimizer and OPAL as the forward solver, validation
experiments and results of multi-objective optimization problems in the domain
of beam dynamics are presented. The high charge beam line at the Argonne
Wakefield Accelerator Facility was used as the beam dynamics model. The 3D beam
size, transverse momentum, and energy spread were optimized
- …