535 research outputs found

    RAxML-Cell: Parallel Phylogenetic Tree Inference on the Cell Broadband Engine

    Get PDF
    Phylogenetic tree reconstruction is one of the grand challenge problems in Bioinformatics. The search for a best-scoring tree with 50 organisms, under a reasonable optimality criterion, creates a topological search space which is as large as the number of atoms in the universe. Computational phylogeny is challenging even for the most powerful supercomputers. It is also an ideal candidate for benchmarking emerging multiprocessor architectures, because it exhibits various levels of fine and coarse-grain parallelism. In this paper, we present the porting, optimization, and evaluation of RAxML on the Cell Broadband Engine. RAxML is a provably efficient, hill climbing algorithm for computing phylogenetic trees based on the Maximum Likelihood (ML) method. The algorithm uses an embarrassingly parallel search method, which also exhibits data-level parallelism and control parallelism in the computation of the likelihood functions. We present the optimization of one of the currently fastest tree search algorithms, on a real Cell blade prototype. We also investigate problems and present solutions pertaining to the optimization of floating point code, control flow, communication, scheduling, and multi-level parallelization on the Cell

    High-throughput sequence alignment using Graphics Processing Units

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The recent availability of new, less expensive high-throughput DNA sequencing technologies has yielded a dramatic increase in the volume of sequence data that must be analyzed. These data are being generated for several purposes, including genotyping, genome resequencing, metagenomics, and <it>de novo </it>genome assembly projects. Sequence alignment programs such as MUMmer have proven essential for analysis of these data, but researchers will need ever faster, high-throughput alignment tools running on inexpensive hardware to keep up with new sequence technologies.</p> <p>Results</p> <p>This paper describes MUMmerGPU, an open-source high-throughput parallel pairwise local sequence alignment program that runs on commodity Graphics Processing Units (GPUs) in common workstations. MUMmerGPU uses the new Compute Unified Device Architecture (CUDA) from nVidia to align multiple query sequences against a single reference sequence stored as a suffix tree. By processing the queries in parallel on the highly parallel graphics card, MUMmerGPU achieves more than a 10-fold speedup over a serial CPU version of the sequence alignment kernel, and outperforms the exact alignment component of MUMmer on a high end CPU by 3.5-fold in total application time when aligning reads from recent sequencing projects using Solexa/Illumina, 454, and Sanger sequencing technologies.</p> <p>Conclusion</p> <p>MUMmerGPU is a low cost, ultra-fast sequence alignment program designed to handle the increasing volume of data produced by new, high-throughput sequencing technologies. MUMmerGPU demonstrates that even memory-intensive applications can run significantly faster on the relatively low-cost GPU than on the CPU.</p

    On the acceleration of wavefront applications using distributed many-core architectures

    Get PDF
    In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P). Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures

    Scheduling Dynamic Parallelism On Accelerators

    Get PDF
    Resource management on accelerator based systems is complicated by the disjoint nature of the main CPU and accelerator, which involves separate memory hierarhcies, different degrees of parallelism, and relatively high cost of communicating between them. For applications with irregular parallelism, where work is dynamically created based on other computations, the accelerators may both consume and produce work. To maintain load balance, the accelerators hand work back to the CPU to be scheduled. In this paper we consider multiple approaches for such scheduling problems and use the Cell BE system to demonstrate the different schedulers and the trade-offs between them. Our evaluation is done with both microbenchmarks and two bioinformatics applications (PBPI and RAxML). Our baseline approach uses a standard Linux scheduler on the CPU, possibly with more than one process per CPU. We then consider the addition of cooperative scheduling to the Linux kernel and a user-level work-stealing approach. The two cooperative approaches are able to decrease SPE idle time, by 30 % and 70%, respectively, relative to the baseline scheduler. In both cases we believe the changes required to application level codes, e.g., a program written with MPI processes that use accelerator based compute nodes, is reasonable, although the kernel level approach provides more generality and ease of implementation, but often less performance than work stealing approach

    Strengthening measurements from the edges: application-level packet loss rate estimation

    Get PDF
    Network users know much less than ISPs, Internet exchanges and content providers about what happens inside the network. Consequently users cannot either easily detect network neutrality violations or readily exercise their market power by knowledgeably switching ISPs. This paper contributes to the ongoing efforts to empower users by proposing two models to estimate -- via application-level measurements -- a key network indicator, i.e., the packet loss rate (PLR) experienced by FTP-like TCP downloads. Controlled, testbed, and large-scale experiments show that the Inverse Mathis model is simpler and more consistent across the whole PLR range, but less accurate than the more advanced Likely Rexmit model for landline connections and moderate PL

    Porting Rodinia Applications to OmpSs@FPGA

    Get PDF
    La computació heterogènia amb FPGAs és una alternativa de baix consum a altres sistemes usats freqüentment, com la computació amb CPU multi-nucli i la computació heterogènia amb GPUs. No obstant, degut a que les FPGAs funcionen d'una manera totalment diferent a altres dispositius fets servir en computació, són bastant difícils de comparar. La Rodinia Benchmark Suite està formada per aplicacions que poden usar-se per comparar sistemes de computació heterogenis. La suite ha adaptat les aplicacions per computació amb CPU multi-nucli i computació amb GPU (fent servir les llibreries OpenMP, Cuda, OpenCL). L'objectiu del projecte és adaptar un cert nombre d'aquestes aplicacions per OmpSs@FPGA, un sistema de computació heterogeni amb dispositius FPGA de Xilinx. Algunes d'aquestes aplicacions també seran optimitzades fent servir eines de OmpSs i de Xilinx (Vivado HLS). Tot i que al principi la idea era adaptar i provar les aplicacions en el dispositiu FPGA físic, la absència del hardware durant la primera part de la fase d'adaptació va incentivar el desenvolupament d'un entorn de simulació de dispositius FPGA. Tal cosa va implicar modificar el runtime per fer que es comuniqués amb un programa software enlloc d'intentar accedir al hardware real. Aquesta tasca va afegir una càrrega de treball considerable en el projecte que no estava prevista. Tot i així, degut a que aquest entorn de simulació va fer molt més ràpida l'adaptació de les aplicacions, la quantitat d'hores amb les que es va desenvolupar l'entorn i es van adaptar les aplicacions va coincidir amb les hores previstes inicialment només per l'adaptació. Es van adaptar un total de 7 aplicacions, 6 de les quals es van optimitzar fins a cert punt. També es van analitzar totes les optimitzacions parcials acumulatives fent servir traces d'execució visualitzades amb el software Paraver. Un cop fets els anàlisis, es va fer un informe de sostenibilitat per avaluar l'impacte del projecte en els aspectes ambiental, econòmic i social. Finalment, s'arriba a la conclusió que s'ha completat l'objectiu inicial del projecte satisfactòriament.FPGA computing is a low power alternative to the vastly used multi-core CPU and GPU computing systems. However, due to FPGA devices being completely different in terms of architecture, they are quite complex to compare to other forms of computing. Rodinia Benchmark Suite consists of a number of applications that can be used to benchmark heterogeneous computing systems. The suite has currently adapted the applications for multi-core CPU and GPU computing (using OpenMP, Cuda and OpenCL libraries). The objective of this project is to port some of the applications from the Rodinia Benchmark Suite to OmpSs@FPGA, a heterogeneous FPGA computing environment based on Xilinx FPGA devices. A portion of these applications will also be optimized using both OmpSs features and Xilinx tools (Vivado HLS). While the original intentions were to port and test the applications with a physical FPGA device, the lack of access to the hardware during the initial porting phase encouraged the development of a simulated FPGA environment. This implied modifying the runtime to communicate with a software block running as an executable instead of trying to access the real hardware. Even though it added a significant workload to the project that was not intended at first, it ended up making the porting of the applications much faster than with the real hardware. Ultimately, the expected number of hours from the initial planning matched the hours it took to both develop the simulated environment and the applications. A total of 7 applications were ported to the OmpSs@FPGA environment, 6 of which were optimized to a certain extent. Furthermore, each of the accumulated optimization stages for every optimized application was analyzed and explained using Paraver traces. After that, a sustainability report was made in order to evaluate the impact of the project environmentally, economically and socially wise. In the final conclusions, it is stated that the original objective of the project has been fulfilled and thus the project has been completed successfully

    Applications on emerging paradigms in parallel computing

    Get PDF
    The area of computing is seeing parallelism increasingly being incorporated at various levels: from the lowest levels of vector processing units following Single Instruction Multiple Data (SIMD) processing, Simultaneous Multi-threading (SMT) architectures, and multi/many-cores with thread-level shared memory and SIMT parallelism, to the higher levels of distributed memory parallelism as in supercomputers and clusters, and scaling them to large distributed systems as server farms and clouds. All together these form a large hierarchy of parallelism. Developing high-performance parallel algorithms and efficient software tools, which make use of the available parallelism, is inevitable in order to harness the raw computational power these emerging systems have to offer. In the work presented in this thesis, we develop architecture-aware parallel techniques on such emerging paradigms in parallel computing, specifically, parallelism offered by the emerging multi- and many-core architectures, as well as the emerging area of cloud computing, to target large scientific applications. First, we develop efficient parallel algorithms to compute optimal pairwise alignments of genomic sequences on heterogeneous multi-core processors, and demonstrate them on the IBM Cell Broadband Engine. Then, we develop parallel techniques for scheduling all-pairs computations on heterogeneous systems, including clusters of Cell processors, and NVIDIA graphics processors. We compare the performance of our strategies on Cell, GPU and Intel Nehalem multi-core processors. Further, we apply our algorithms to specific applications taken from the areas of systems biology, fluid dynamics and materials science: pairwise Mutual Information computations for reconstruction of gene regulatory networks; pairwise Lp-norm distance computations for coherent structures discovery in the design of flapping-wing Micro Air Vehicles, and construction of stochastic models for a set of properties of heterogeneous materials. Lastly, in the area of cloud computing, we propose and develop an abstract framework to enable computations in parallel on large tree structures, to facilitate easy development of a class of scientific applications based on trees. Our framework, in the style of Google\u27s MapReduce paradigm, is based on two generic user-defined functions through which a user writes an application. We implement our framework as a generic programming library for a large cluster of homogeneous multi-core processor, and demonstrate its applicability through two applications: all-k-nearest neighbors computations, and Fast Multipole Method (FMM) based simulations
    corecore