67,693 research outputs found
A New Parallel N-body Gravity Solver: TPM
We have developed a gravity solver based on combining the well developed
Particle-Mesh (PM) method and TREE methods. It is designed for and has been
implemented on parallel computer architectures. The new code can deal with tens
of millions of particles on current computers, with the calculation done on a
parallel supercomputer or a group of workstations. Typically, the spatial
resolution is enhanced by more than a factor of 20 over the pure PM code with
mass resolution retained at nearly the PM level. This code runs much faster
than a pure TREE code with the same number of particles and maintains almost
the same resolution in high density regions. Multiple time step integration has
also been implemented with the code, with second order time accuracy. The
performance of the code has been checked in several kinds of parallel computer
configuration, including IBM SP1, SGI Challenge and a group of workstations,
with the speedup of the parallel code on a 32 processor IBM SP2 supercomputer
nearly linear (efficiency ) in the number of processors. The
computation/communication ratio is also very high (), which means the
code spends of its CPU time in computation.Comment: 21 Pages Latex file Figures available from anonymous ftp to
astro.princeton.edu under /xu/tpm.ps, POP-57
Recommended from our members
Architecture and Applications of DADO: A Large-Scale Parallel Computer for Artificial Intelligence
As part of our research on very high performance parallel architectures, we have been investigating; machine architectures specially adapted to the highly efficient implementation of artificial intelligence (AI) software. In the course of our research we designed DADO, a highly parallel, VLSI-based, tree-structured machine, and implemented a high-speed algorithm for production systems on a simulator for DADO. Subsequent research has convinced us that DADO can support many other AI applications, including the very rapid execution of PROLOG programs, and a large share of the symbolic processing typical of contemporary knowledge-based systems. In this brief report, we outline the hardware design of a moderate size DADO prototype, comprising 1023 processing elements, which is currently under construction at Columbia University. We then sketch the software base being implemented on a small 15 processing element prototype system including several applications written in PPL/M, a high-level language designed for specifying parallel computations on DADO
High Performance Implementation of Planted Motif Problem using Suffix trees
In this paper we present a high performance implementation of suffix tree based solution to the planted motif problem on two different parallel architectures: NVIDIA GPU and Intel Multicore machines. An (l,d) planted motif problem(PMP) is defined as: Given a sequence of n DNA sequences, each of length L, find M, the set of sequences(or motifs) of length l which have atleast one d-neighbor in each of the n sequences. Here, a d-neighbor of a sequence is a sequence of same length that differs in at-most d positions. PMP is a well studied problem in computational biology. It is useful in developing methods for finding transcription factor binding sites, sequence classification and for building phylogenetic trees. The problem is computationally challenging to solve, for example a (19,7) PMP takes 9.9 hours on a sequential machine. Many approaches to solve planted motif problem can be found in literature. One approach is based on use of suffix tree data structure. Though suffix tree based methods are the most efficient ones for solving large planted motif problems on sequential machines, they are quite difficult to parallelize. We present suffix tree based parallel solutions for PMP on NVIDIA GPU and Intel Multicore architectures that are efficient and scalable. The solutions are based on a suffix tree algorithm previously presented but use extensive adaptation to individual architectures to ensure that the implementations work efficiently and scale well
MASSIVELY PARALLEL ALGORITHMS FOR POINT CLOUD BASED OBJECT RECOGNITION ON HETEROGENEOUS ARCHITECTURE
With the advent of new commodity depth sensors, point cloud data processing plays an increasingly important role in object recognition and perception. However, the computational cost of point cloud data processing is extremely high due to the large data size, high dimensionality, and algorithmic complexity. To address the computational challenges of real-time processing, this work investigates the possibilities of using modern heterogeneous computing platforms and its supporting ecosystem such as massively parallel architecture (MPA), computing cluster, compute unified device architecture (CUDA), and multithreaded programming to accelerate the point cloud based object recognition. The aforementioned computing platforms would not yield high performance unless the specific features are properly utilized. Failing that the result actually produces an inferior performance. To achieve the high-speed performance in image descriptor computing, indexing, and matching in point cloud based object recognition, this work explores both coarse and fine grain level parallelism, identifies the acceptable levels of algorithmic approximation, and analyzes various performance impactors. A set of heterogeneous parallel algorithms are designed and implemented in this work. These algorithms include exact and approximate scalable massively parallel image descriptors for descriptor computing, parallel construction of k-dimensional tree (KD-tree) and the forest of KD-trees for descriptor indexing, parallel approximate nearest neighbor search (ANNS) and buffered ANNS (BANNS) on the KD-tree and the forest of KD-trees for descriptor matching. The results show that the proposed massively parallel algorithms on heterogeneous computing platforms can significantly improve the execution time performance of feature computing, indexing, and matching. Meanwhile, this work demonstrates that the heterogeneous computing architectures, with appropriate architecture specific algorithms design and optimization, have the distinct advantages of improving the performance of multimedia applications
Task-based adaptive multiresolution for time-space multi-scale reaction-diffusion systems on multi-core architectures
A new solver featuring time-space adaptation and error control has been
recently introduced to tackle the numerical solution of stiff
reaction-diffusion systems. Based on operator splitting, finite volume adaptive
multiresolution and high order time integrators with specific stability
properties for each operator, this strategy yields high computational
efficiency for large multidimensional computations on standard architectures
such as powerful workstations. However, the data structure of the original
implementation, based on trees of pointers, provides limited opportunities for
efficiency enhancements, while posing serious challenges in terms of parallel
programming and load balancing. The present contribution proposes a new
implementation of the whole set of numerical methods including Radau5 and
ROCK4, relying on a fully different data structure together with the use of a
specific library, TBB, for shared-memory, task-based parallelism with
work-stealing. The performance of our implementation is assessed in a series of
test-cases of increasing difficulty in two and three dimensions on multi-core
and many-core architectures, demonstrating high scalability
An Implementation of List Successive Cancellation Decoder with Large List Size for Polar Codes
Polar codes are the first class of forward error correction (FEC) codes with
a provably capacity-achieving capability. Using list successive cancellation
decoding (LSCD) with a large list size, the error correction performance of
polar codes exceeds other well-known FEC codes. However, the hardware
complexity of LSCD rapidly increases with the list size, which incurs high
usage of the resources on the field programmable gate array (FPGA) and
significantly impedes the practical deployment of polar codes. To alleviate the
high complexity, in this paper, two low-complexity decoding schemes and the
corresponding architectures for LSCD targeting FPGA implementation are
proposed. The architecture is implemented in an Altera Stratix V FPGA.
Measurement results show that, even with a list size of 32, the architecture is
able to decode a codeword of 4096-bit polar code within 150 us, achieving a
throughput of 27MbpsComment: 4 pages, 4 figures, 4 tables, Published in 27th International
Conference on Field Programmable Logic and Applications (FPL), 201
- …