Search CORE

535 research outputs found

RAxML-Cell: Parallel Phylogenetic Tree Inference on the Cell Broadband Engine

Author: Antonopoulos Christos
Blagojevic Filip
Nikolopoulos Dimitrios
Stamatakis Alexandros
Publication venue
Publication date: 01/01/2006
Field of study

Phylogenetic tree reconstruction is one of the grand challenge problems in Bioinformatics. The search for a best-scoring tree with 50 organisms, under a reasonable optimality criterion, creates a topological search space which is as large as the number of atoms in the universe. Computational phylogeny is challenging even for the most powerful supercomputers. It is also an ideal candidate for benchmarking emerging multiprocessor architectures, because it exhibits various levels of fine and coarse-grain parallelism. In this paper, we present the porting, optimization, and evaluation of RAxML on the Cell Broadband Engine. RAxML is a provably efficient, hill climbing algorithm for computing phylogenetic trees based on the Maximum Likelihood (ML) method. The algorithm uses an embarrassingly parallel search method, which also exhibits data-level parallelism and control parallelism in the computation of the likelihood functions. We present the optimization of one of the currently fastest tree search algorithms, on a real Cell blade prototype. We also investigate problems and present solutions pertaining to the optimization of floating point code, control flow, communication, scheduling, and multi-level parallelization on the Cell

Computer Science Technical Reports @Virginia Tech

Crossref

High-throughput sequence alignment using Graphics Processing Units

Author: AL Delcher
AL Delcher
Amitabh Varshney
Arthur L Delcher
C Shaffer
Cole Trapnell
D Gusfield
E Ukkonen
EW Myers
I Buck
J Mellor-Crummey
JD Owens
M Brudno
M Charalambous
M Hohl
M Pop
Michael C Schatz
MJ Harris
NK Govindaraju
nVidia
P Weiner
S Kurtz
S Kurtz
SF Atschul
W Liu
W Pearson
WJ Dally
Y Juekuan
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. These data are being generated for several purposes, including genotyping, genome resequencing, metagenomics, and <it>de novo </it>genome assembly projects. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies. Results This paper describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms the exact alignment component of MUMmer on a high end CPU by 3.5-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies. Conclusion MUMmerGPU is a low cost, ultra-fast sequence alignment program designed to handle the increasing volume of data produced by new, high-throughput sequencing technologies. MUMmerGPU demonstrates that even memory-intensive applications can run significantly faster on the relatively low-cost GPU than on the CPU.</p

CiteSeerX

Crossref

Cold Spring Harbor Laboratory Institutional Repository

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Digital Repository at the University of Maryland

On the acceleration of wavefront applications using distributed many-core architectures

Author: Hammond Simon D.
Jarvis Stephen A.
Mudalige Gihan R.
Pennycook Simon J.
Wright Steven A.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/02/2012
Field of study

In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P). Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures

CiteSeerX

University of Birmingham Research Portal

Warwick Research Archives Portal Repository

White Rose Research Online

Scheduling Dynamic Parallelism On Accelerators

Author: Benjamin Rose
Costin Iancu
Dimitrios S. Nikolopoulos
Filip Blagojević
Katherine Yelick
Matthew Curtis-maury
Publication venue
Publication date: 01/01/2009
Field of study

Resource management on accelerator based systems is complicated by the disjoint nature of the main CPU and accelerator, which involves separate memory hierarhcies, different degrees of parallelism, and relatively high cost of communicating between them. For applications with irregular parallelism, where work is dynamically created based on other computations, the accelerators may both consume and produce work. To maintain load balance, the accelerators hand work back to the CPU to be scheduled. In this paper we consider multiple approaches for such scheduling problems and use the Cell BE system to demonstrate the different schedulers and the trade-offs between them. Our evaluation is done with both microbenchmarks and two bioinformatics applications (PBPI and RAxML). Our baseline approach uses a standard Linux scheduler on the CPU, possibly with more than one process per CPU. We then consider the addition of cooperative scheduling to the Linux kernel and a user-level work-stealing approach. The two cooperative approaches are able to decrease SPE idle time, by 30 % and 70%, respectively, relative to the baseline scheduler. In both cases we believe the changes required to application level codes, e.g., a program written with MPI processes that use accelerator based compute nodes, is reasonable, although the kernel level approach provides more generality and ease of implementation, but often less performance than work stealing approach

CiteSeerX

Crossref

Strengthening measurements from the edges: application-level packet loss rate estimation

Author: Carlson R.
Dischinger M.
Juan Carlos De Martin
Michela Meo
Simone Basso
Sänchez M. A.
Publication venue: ACM New York, NY, USA
Publication date: 01/01/2013
Field of study

Network users know much less than ISPs, Internet exchanges and content providers about what happens inside the network. Consequently users cannot either easily detect network neutrality violations or readily exercise their market power by knowledgeably switching ISPs. This paper contributes to the ongoing efforts to empower users by proposing two models to estimate -- via application-level measurements -- a key network indicator, i.e., the packet loss rate (PLR) experienced by FTP-like TCP downloads. Controlled, testbed, and large-scale experiments show that the Inverse Mathis model is simpler and more consistent across the whole PLR range, but less accurate than the more advanced Likely Rexmit model for landline connections and moderate PL

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Recent Advances and Future Trend on the Emerging Role of GPUs as Platforms for Virtual Screening-Based Drug Discovery

Author: Horacio Pérez-Sánchez
José M. Cecilia
José M. García
Publication venue: 'IntechOpen'
Publication date: 14/03/2012
Field of study

IntechOpen

Recommended from our members

Heterogeneous Cloud Systems Based on Broadband Embedded Computing

Author: Neill Richard W.
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2013
Field of study

Computing systems continue to evolve from homogeneous systems of commodity-based servers within a single data-center towards modern Cloud systems that consist of numerous data-center clusters virtualized at the infrastructure and application layers to provide scalable, cost-effective and elastic services to devices connected over the Internet. There is an emerging trend towards heterogeneous Cloud systems driven from growth in wired as well as wireless devices that incorporate the potential of millions, and soon billions, of embedded devices enabling new forms of computation and service delivery. Service providers such as broadband cable operators continue to contribute towards this expansion with growing Cloud system infrastructures combined with deployments of increasingly powerful embedded devices across broadband networks. Broadband networks enable access to service provider Cloud data-centers and the Internet from numerous devices. These include home computers, smart-phones, tablets, game-consoles, sensor-networks, and set-top box devices. With these trends in mind, I propose the concept of broadband embedded computing as the utilization of a broadband network of embedded devices for collective computation in conjunction with centralized Cloud infrastructures. I claim that this form of distributed computing results in a new class of heterogeneous Cloud systems, service delivery and application enablement. To support these claims, I present a collection of research contributions in adapting distributed software platforms that include MPI and MapReduce to support simultaneous application execution across centralized data-center blade servers and resource-constrained embedded devices. Leveraging these contributions, I develop two complete prototype system implementations to demonstrate an architecture for heterogeneous Cloud systems based on broadband embedded computing. Each system is validated by executing experiments with applications taken from bioinformatics and image processing as well as communication and computational benchmarks. This vision, however, is not without challenges. The questions on how to adapt standard distributed computing paradigms such as MPI and MapReduce for implementation on potentially resource-constrained embedded devices, and how to adapt cluster computing runtime environments to enable heterogeneous process execution across millions of devices remain open-ended. This dissertation presents methods to begin addressing these open-ended questions through the development and testing of both experimental broadband embedded computing systems and in-depth characterization of broadband network behavior. I present experimental results and comparative analysis that offer potential solutions for optimal scalability and performance for constructing broadband embedded computing systems. I also present a number of contributions enabling practical implementation of both heterogeneous Cloud systems and novel application services based on broadband embedded computing

Columbia University Academic Commons

Porting Rodinia Applications to OmpSs@FPGA

Author: Perelló Bacardit Marc
Publication venue: Universitat Politècnica de Catalunya
Publication date: 22/10/2019
Field of study

La computació heterogènia amb FPGAs és una alternativa de baix consum a altres sistemes usats freqüentment, com la computació amb CPU multi-nucli i la computació heterogènia amb GPUs. No obstant, degut a que les FPGAs funcionen d'una manera totalment diferent a altres dispositius fets servir en computació, són bastant difícils de comparar. La Rodinia Benchmark Suite està formada per aplicacions que poden usar-se per comparar sistemes de computació heterogenis. La suite ha adaptat les aplicacions per computació amb CPU multi-nucli i computació amb GPU (fent servir les llibreries OpenMP, Cuda, OpenCL). L'objectiu del projecte és adaptar un cert nombre d'aquestes aplicacions per OmpSs@FPGA, un sistema de computació heterogeni amb dispositius FPGA de Xilinx. Algunes d'aquestes aplicacions també seran optimitzades fent servir eines de OmpSs i de Xilinx (Vivado HLS). Tot i que al principi la idea era adaptar i provar les aplicacions en el dispositiu FPGA físic, la absència del hardware durant la primera part de la fase d'adaptació va incentivar el desenvolupament d'un entorn de simulació de dispositius FPGA. Tal cosa va implicar modificar el runtime per fer que es comuniqués amb un programa software enlloc d'intentar accedir al hardware real. Aquesta tasca va afegir una càrrega de treball considerable en el projecte que no estava prevista. Tot i així, degut a que aquest entorn de simulació va fer molt més ràpida l'adaptació de les aplicacions, la quantitat d'hores amb les que es va desenvolupar l'entorn i es van adaptar les aplicacions va coincidir amb les hores previstes inicialment només per l'adaptació. Es van adaptar un total de 7 aplicacions, 6 de les quals es van optimitzar fins a cert punt. També es van analitzar totes les optimitzacions parcials acumulatives fent servir traces d'execució visualitzades amb el software Paraver. Un cop fets els anàlisis, es va fer un informe de sostenibilitat per avaluar l'impacte del projecte en els aspectes ambiental, econòmic i social. Finalment, s'arriba a la conclusió que s'ha completat l'objectiu inicial del projecte satisfactòriament.FPGA computing is a low power alternative to the vastly used multi-core CPU and GPU computing systems. However, due to FPGA devices being completely different in terms of architecture, they are quite complex to compare to other forms of computing. Rodinia Benchmark Suite consists of a number of applications that can be used to benchmark heterogeneous computing systems. The suite has currently adapted the applications for multi-core CPU and GPU computing (using OpenMP, Cuda and OpenCL libraries). The objective of this project is to port some of the applications from the Rodinia Benchmark Suite to OmpSs@FPGA, a heterogeneous FPGA computing environment based on Xilinx FPGA devices. A portion of these applications will also be optimized using both OmpSs features and Xilinx tools (Vivado HLS). While the original intentions were to port and test the applications with a physical FPGA device, the lack of access to the hardware during the initial porting phase encouraged the development of a simulated FPGA environment. This implied modifying the runtime to communicate with a software block running as an executable instead of trying to access the real hardware. Even though it added a significant workload to the project that was not intended at first, it ended up making the porting of the applications much faster than with the real hardware. Ultimately, the expected number of hours from the initial planning matched the hours it took to both develop the simulated environment and the applications. A total of 7 applications were ported to the OmpSs@FPGA environment, 6 of which were optimized to a certain extent. Furthermore, each of the accumulated optimization stages for every optimized application was analyzed and explained using Paraver traces. After that, a sustainability report was made in order to evaluate the impact of the project environmentally, economically and socially wise. In the final conclusions, it is stated that the original objective of the project has been fulfilled and thus the project has been completed successfully

UPCommons. Portal del coneixement obert de la UPC

Applications on emerging paradigms in parallel computing

Author: Sarje Abhinav
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2010
Field of study

The area of computing is seeing parallelism increasingly being incorporated at various levels: from the lowest levels of vector processing units following Single Instruction Multiple Data (SIMD) processing, Simultaneous Multi-threading (SMT) architectures, and multi/many-cores with thread-level shared memory and SIMT parallelism, to the higher levels of distributed memory parallelism as in supercomputers and clusters, and scaling them to large distributed systems as server farms and clouds. All together these form a large hierarchy of parallelism. Developing high-performance parallel algorithms and efficient software tools, which make use of the available parallelism, is inevitable in order to harness the raw computational power these emerging systems have to offer. In the work presented in this thesis, we develop architecture-aware parallel techniques on such emerging paradigms in parallel computing, specifically, parallelism offered by the emerging multi- and many-core architectures, as well as the emerging area of cloud computing, to target large scientific applications. First, we develop efficient parallel algorithms to compute optimal pairwise alignments of genomic sequences on heterogeneous multi-core processors, and demonstrate them on the IBM Cell Broadband Engine. Then, we develop parallel techniques for scheduling all-pairs computations on heterogeneous systems, including clusters of Cell processors, and NVIDIA graphics processors. We compare the performance of our strategies on Cell, GPU and Intel Nehalem multi-core processors. Further, we apply our algorithms to specific applications taken from the areas of systems biology, fluid dynamics and materials science: pairwise Mutual Information computations for reconstruction of gene regulatory networks; pairwise Lp-norm distance computations for coherent structures discovery in the design of flapping-wing Micro Air Vehicles, and construction of stochastic models for a set of properties of heterogeneous materials. Lastly, in the area of cloud computing, we propose and develop an abstract framework to enable computations in parallel on large tree structures, to facilitate easy development of a class of scientific applications based on trees. Our framework, in the style of Google\u27s MapReduce paradigm, is based on two generic user-defined functions through which a user writes an application. We implement our framework as a generic programming library for a large cluster of homogeneous multi-core processor, and demonstrate its applicability through two applications: all-k-nearest neighbors computations, and Fast Multipole Method (FMM) based simulations

Digital Repository @ Iowa State University (ISU)