2,981 research outputs found

    Applications on emerging paradigms in parallel computing

    Get PDF
    The area of computing is seeing parallelism increasingly being incorporated at various levels: from the lowest levels of vector processing units following Single Instruction Multiple Data (SIMD) processing, Simultaneous Multi-threading (SMT) architectures, and multi/many-cores with thread-level shared memory and SIMT parallelism, to the higher levels of distributed memory parallelism as in supercomputers and clusters, and scaling them to large distributed systems as server farms and clouds. All together these form a large hierarchy of parallelism. Developing high-performance parallel algorithms and efficient software tools, which make use of the available parallelism, is inevitable in order to harness the raw computational power these emerging systems have to offer. In the work presented in this thesis, we develop architecture-aware parallel techniques on such emerging paradigms in parallel computing, specifically, parallelism offered by the emerging multi- and many-core architectures, as well as the emerging area of cloud computing, to target large scientific applications. First, we develop efficient parallel algorithms to compute optimal pairwise alignments of genomic sequences on heterogeneous multi-core processors, and demonstrate them on the IBM Cell Broadband Engine. Then, we develop parallel techniques for scheduling all-pairs computations on heterogeneous systems, including clusters of Cell processors, and NVIDIA graphics processors. We compare the performance of our strategies on Cell, GPU and Intel Nehalem multi-core processors. Further, we apply our algorithms to specific applications taken from the areas of systems biology, fluid dynamics and materials science: pairwise Mutual Information computations for reconstruction of gene regulatory networks; pairwise Lp-norm distance computations for coherent structures discovery in the design of flapping-wing Micro Air Vehicles, and construction of stochastic models for a set of properties of heterogeneous materials. Lastly, in the area of cloud computing, we propose and develop an abstract framework to enable computations in parallel on large tree structures, to facilitate easy development of a class of scientific applications based on trees. Our framework, in the style of Google\u27s MapReduce paradigm, is based on two generic user-defined functions through which a user writes an application. We implement our framework as a generic programming library for a large cluster of homogeneous multi-core processor, and demonstrate its applicability through two applications: all-k-nearest neighbors computations, and Fast Multipole Method (FMM) based simulations

    Comparison of Data Partitioning Schema of Parallel Pairwise Alignment on Shared Memory System

    Get PDF
    The pairwise alignment (PA) algorithm is widely used in bioinformatics to analyze biological sequence. With the advance of sequencer technology, a massive amount of DNA fragments are sequenced much quicker and cheaper. The alignment algorithm needs to be parallelized to be able to align them in a shorter time. Many previous researches have parallelize PA algorithm using various data partitioning schema, but it is unclear which one is the best. The data partitioning schema is important for parallel PA performance, because this algorithm use dynamic programming technique that needs intense inter-thread communication. In this paper, we compared four partitioning schemas to find the best performing one on shared memory system. Those schemas are: blocked columnwise, rowwise, antidiagonal, and blocked columnwise with manual scheduling and loop unrolling. The last schema gave the best performance of 89% efficiency on 4 threads. This result provided fine-grain parallelism that can be used further to develop parallel multiple sequence alignment (MSA)

    MR-CUDASW - GPU accelerated Smith-Waterman algorithm for medium-length (meta)genomic data

    Get PDF
    The idea of using a graphics processing unit (GPU) for more than simply graphic output purposes has been around for quite some time in scientific communities. However, it is only recently that its benefits for a range of bioinformatics and life sciences compute-intensive tasks has been recognized. This thesis investigates the possibility of improving the performance of the overlap determination stage of an Overlap Layout Consensus (OLC)-based assembler by using a GPU-based implementation of the Smith-Waterman algorithm. In this thesis an existing GPU-accelerated sequence alignment algorithm is adapted and expanded to reduce its completion time. A number of improvements and changes are made to the original software. Workload distribution, query profile construction, and thread scheduling techniques implemented by the original program are replaced by custom methods specifically designed to handle medium-length reads. Accordingly, this algorithm is the first highly parallel solution that has been specifically optimized to process medium-length nucleotide reads (DNA/RNA) from modern sequencing machines (i.e. Ion Torrent). Results show that the software reaches up to 82 GCUPS (Giga Cell Updates Per Second) on a single-GPU graphic card running on a commodity desktop hardware. As a result it is the fastest GPU-based implemen- tation of the Smith-Waterman algorithm tailored for processing medium-length nucleotide reads. Despite being designed for performing the Smith-Waterman algorithm on medium-length nucleotide sequences, this program also presents great potential for improving heterogeneous computing with CUDA-enabled GPUs in general and is expected to make contributions to other research problems that require sensitive pairwise alignment to be applied to a large number of reads. Our results show that it is possible to improve the performance of bioinformatics algorithms by taking full advantage of the compute resources of the underlying commodity hardware and further, these results are especially encouraging since GPU performance grows faster than multi-core CPUs

    A Scientific Workflow System For Genomic Data Analysis

    Get PDF
    Scientific workflows have become increasingly popular as a new computing paradigm for scientists to design and execute complex and distributed scientific processes to enable and accelerate many scientific discoveries. Although several scientific workflow management systems (SWFMSs) have been developed, there is a great need for an integrated scientific workflow system that enables the design and execution of higher-level scientific workflows, which integrate heterogeneous scientific workflows enacted by existing SWFMSs. On one hand, science is becoming increasingly collaborative today, requiring an integrated solution that combines the features and capabilities of different SWFMSs, which are typically developed and optimized towards one single discipline. One the other hand, such an integrated environment can immediately leverage existing and emerging techniques and strengths of various SWFMSs and their supported execution environments, such as Cluster, Grid, and Cloud. The main contributions of this dissertation are: 1) We propose a scientific workflow system, called GENOMEFLOW, to design, develop, and execute higher-level scientific workflows, whose workflow tasks are themselves scientific workflows enacted by existing SWFMSs; 2) We propose a workflow scheduling algorithm, called GSA, to enable the parallel execution of such heterogeneous scientific workflows in their native heterogeneous environments; and 3) We implemented GENOMEFLOW towards the life science community and developed several GENOMEFLOW scientific workflows to demonstrate the capabilities of our system for genome data analysis applications

    Revisiting the Speed-versus-Sensitivity Tradeoff in Pairwise Sequence Search

    Get PDF
    The Smith-Waterman algorithm is a dynamic programming method for determining optimal local alignments between nucleotide or protein sequences. However, it suffers from quadratic time and space complexity. As a result, many algorithmic and architectural enhancements have been proposed to solve this problem, but at the cost of reduced sensitivity in the algorithms or significant expense in hardware, respectively. Hence, there exists a need to evaluate the tradeoffs between the different solutions. This motivation, coupled with the lack of an evaluation metric to quantify these tradeoffs leads us to formally define and quantify the sensitivity of homology search methods so that tradeoffs between sequence-search solutions can be evaluated in a quantitative manner. As an example, though the BLAST algorithm executes significantly faster than Smith-Waterman, we find that BLAST misses 80% of the significant sequence alignments. This paper then presents a highly efficient parallelization of the Smith-Waterman algorithm on the Cell Broadband Engine, a novel hybrid multicore architecture that drives the PlayStation 3 (PS3) game consoles, and emulates BLAST by repeatedly executing the parallelized Smith-Waterman algorithm to search for a query in a given sequence database. Through an innovative mapping of the optimal Smith-Waterman algorithm onto a cluster of PlayStation 3 nodes, our implementation delivers a 10-fold speed-up over a high-end multicore architecture and an 88-fold speed-up over a non-accelerated PS3. Finally, we compare the performance of our implementation of the Smith-Waterman algorithm to that of BLAST and the canonical Smith-Waterman implementation, based on a combination of three factors — execution time (speed), sensitivity, and the actual cost of de-ploying each solution. In the end, our parallelized Smith-Waterman algorithm approaches the speed of BLAST while maintaining ideal sensitivity and achieving low cost through the use of PlayStation 3 game consoles

    High performance reconfigurable architectures for biological sequence alignment

    Get PDF
    Bioinformatics and computational biology (BCB) is a rapidly developing multidisciplinary field which encompasses a wide range of domains, including genomic sequence alignments. It is a fundamental tool in molecular biology in searching for homology between sequences. Sequence alignments are currently gaining close attention due to their great impact on the quality aspects of life such as facilitating early disease diagnosis, identifying the characteristics of a newly discovered sequence, and drug engineering. With the vast growth of genomic data, searching for a sequence homology over huge databases (often measured in gigabytes) is unable to produce results within a realistic time, hence the need for acceleration. Since the exponential increase of biological databases as a result of the human genome project (HGP), supercomputers and other parallel architectures such as the special purpose Very Large Scale Integration (VLSI) chip, Graphic Processing Unit (GPUs) and Field Programmable Gate Arrays (FPGAs) have become popular acceleration platforms. Nevertheless, there are always trade-off between area, speed, power, cost, development time and reusability when selecting an acceleration platform. FPGAs generally offer more flexibility, higher performance and lower overheads. However, they suffer from a relatively low level programming model as compared with off-the-shelf microprocessors such as standard microprocessors and GPUs. Due to the aforementioned limitations, the need has arisen for optimized FPGA core implementations which are crucial for this technology to become viable in high performance computing (HPC). This research proposes the use of state-of-the-art reprogrammable system-on-chip technology on FPGAs to accelerate three widely-used sequence alignment algorithms; the Smith-Waterman with affine gap penalty algorithm, the profile hidden Markov model (HMM) algorithm and the Basic Local Alignment Search Tool (BLAST) algorithm. The three novel aspects of this research are firstly that the algorithms are designed and implemented in hardware, with each core achieving the highest performance compared to the state-of-the-art. Secondly, an efficient scheduling strategy based on the double buffering technique is adopted into the hardware architectures. Here, when the alignment matrix computation task is overlapped with the PE configuration in a folded systolic array, the overall throughput of the core is significantly increased. This is due to the bound PE configuration time and the parallel PE configuration approach irrespective of the number of PEs in a systolic array. In addition, the use of only two configuration elements in the PE optimizes hardware resources and enables the scalability of PE systolic arrays without relying on restricted onboard memory resources. Finally, a new performance metric is devised, which facilitates the effective comparison of design performance between different FPGA devices and families. The normalized performance indicator (speed-up per area per process technology) takes out advantages of the area and lithography technology of any FPGA resulting in fairer comparisons. The cores have been designed using Verilog HDL and prototyped on the Alpha Data ADM-XRC-5LX card with the Virtex-5 XC5VLX110-3FF1153 FPGA. The implementation results show that the proposed architectures achieved giga cell updates per second (GCUPS) performances of 26.8, 29.5 and 24.2 respectively for the acceleration of the Smith-Waterman with affine gap penalty algorithm, the profile HMM algorithm and the BLAST algorithm. In terms of speed-up improvements, comparisons were made on performance of the designed cores against their corresponding software and the reported FPGA implementations. In the case of comparison with equivalent software execution, acceleration of the optimal alignment algorithm in hardware yielded an average speed-up of 269x as compared to the SSEARCH 35 software. For the profile HMM-based sequence alignment, the designed core achieved speed-up of 103x and 8.3x against the HMMER 2.0 and the latest version of HMMER (version 3.0) respectively. On the other hand, the implementation of the gapped BLAST with the two-hit method in hardware achieved a greater than tenfold speed-up compared to the latest NCBI BLAST software. In terms of comparison against other reported FPGA implementations, the proposed normalized performance indicator was used to evaluate the designed architectures fairly. The results showed that the first architecture achieved more than 50 percent improvement, while acceleration of the profile HMM sequence alignment in hardware gained a normalized speed-up of 1.34. In the case of the gapped BLAST with the two-hit method, the designed core achieved 11x speed-up after taking out advantages of the Virtex-5 FPGA. In addition, further analysis was conducted in terms of cost and power performances; it was noted that, the core achieved 0.46 MCUPS per dollar spent and 958.1 MCUPS per watt. This shows that FPGAs can be an attractive platform for high performance computation with advantages of smaller area footprint as well as represent economic ‘green’ solution compared to the other acceleration platforms. Higher throughput can be achieved by redeploying the cores on newer, bigger and faster FPGAs with minimal design effort

    Parallelization of dynamic programming recurrences in computational biology

    Get PDF
    The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms

    Many-Task Computing and Blue Waters

    Full text link
    This report discusses many-task computing (MTC) generically and in the context of the proposed Blue Waters systems, which is planned to be the largest NSF-funded supercomputer when it begins production use in 2012. The aim of this report is to inform the BW project about MTC, including understanding aspects of MTC applications that can be used to characterize the domain and understanding the implications of these aspects to middleware and policies. Many MTC applications do not neatly fit the stereotypes of high-performance computing (HPC) or high-throughput computing (HTC) applications. Like HTC applications, by definition MTC applications are structured as graphs of discrete tasks, with explicit input and output dependencies forming the graph edges. However, MTC applications have significant features that distinguish them from typical HTC applications. In particular, different engineering constraints for hardware and software must be met in order to support these applications. HTC applications have traditionally run on platforms such as grids and clusters, through either workflow systems or parallel programming systems. MTC applications, in contrast, will often demand a short time to solution, may be communication intensive or data intensive, and may comprise very short tasks. Therefore, hardware and software for MTC must be engineered to support the additional communication and I/O and must minimize task dispatch overheads. The hardware of large-scale HPC systems, with its high degree of parallelism and support for intensive communication, is well suited for MTC applications. However, HPC systems often lack a dynamic resource-provisioning feature, are not ideal for task communication via the file system, and have an I/O system that is not optimized for MTC-style applications. Hence, additional software support is likely to be required to gain full benefit from the HPC hardware
    corecore