4,020 research outputs found

    The Parallelism Motifs of Genomic Data Analysis

    Get PDF
    Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

    Analyzing large-scale DNA Sequences on Multi-core Architectures

    Full text link
    Rapid analysis of DNA sequences is important in preventing the evolution of different viruses and bacteria during an early phase, early diagnosis of genetic predispositions to certain diseases (cancer, cardiovascular diseases), and in DNA forensics. However, real-world DNA sequences may comprise several Gigabytes and the process of DNA analysis demands adequate computational resources to be completed within a reasonable time. In this paper we present a scalable approach for parallel DNA analysis that is based on Finite Automata, and which is suitable for analyzing very large DNA segments. We evaluate our approach for real-world DNA segments of mouse (2.7GB), cat (2.4GB), dog (2.4GB), chicken (1GB), human (3.2GB) and turkey (0.2GB). Experimental results on a dual-socket shared-memory system with 24 physical cores show speed-ups of up to 17.6x. Our approach is up to 3x faster than a pattern-based parallel approach that uses the RE2 library.Comment: The 18th IEEE International Conference on Computational Science and Engineering (CSE 2015), Porto, Portugal, 20 - 23 October 201

    High Performance Implementation of Planted Motif Problem using Suffix trees

    Get PDF
    In this paper we present a high performance implementation of suffix tree based solution to the planted motif problem on two different parallel architectures: NVIDIA GPU and Intel Multicore machines. An (l,d) planted motif problem(PMP) is defined as: Given a sequence of n DNA sequences, each of length L, find M, the set of sequences(or motifs) of length l which have atleast one d-neighbor in each of the n sequences. Here, a d-neighbor of a sequence is a sequence of same length that differs in at-most d positions. PMP is a well studied problem in computational biology. It is useful in developing methods for finding transcription factor binding sites, sequence classification and for building phylogenetic trees. The problem is computationally challenging to solve, for example a (19,7) PMP takes 9.9 hours on a sequential machine. Many approaches to solve planted motif problem can be found in literature. One approach is based on use of suffix tree data structure. Though suffix tree based methods are the most efficient ones for solving large planted motif problems on sequential machines, they are quite difficult to parallelize. We present suffix tree based parallel solutions for PMP on NVIDIA GPU and Intel Multicore architectures that are efficient and scalable. The solutions are based on a suffix tree algorithm previously presented but use extensive adaptation to individual architectures to ensure that the implementations work efficiently and scale well

    High Performance Frequent Subgraph Mining on Transactional Datasets

    Get PDF
    Graph data mining has been a crucial as well as inevitable area of research. Large amounts of graph data are produced in many areas, such as Bioinformatics, Cheminformatics, Social Networks, and Web etc. Scalable graph data mining methods are getting increasingly popular and necessary due to increased graph complexities. Frequent subgraph mining is one such area where the task is to find overly recurring patterns/subgraphs. To tackle this problem, many main memory-based methods were proposed, which proved to be inefficient as the data size grew exponentially over time. In the past few years several research groups have attempted to handle the frequent subgraph mining (FSM) problem in multiple ways. Many authors have tried to achieve better performance using Graphic Processing Units (GPUs) which has multi-fold improvement over in-memory while dealing with large datasets. Later, Google\u27s MapReduce model with the Hadoop framework proved to be a major breakthrough in high performance large batch processing. Although MapReduce came with many benefits, its disk I/O and non-iterative style model could not help much for FSM domain since subgraph mining process is an iterative approach. In recent years, Spark has emerged to be the De Facto industry standard with its distributed in-memory computing capability. This is a right fit solution for iterative style of programming as well. In this work, we cover how high-performance computing has helped in improving the performance tremendously in the transactional directed and undirected aspect of graphs and performance comparisons of various FSM techniques are done based on experimental results
    • …
    corecore