17,412 research outputs found
Optimizing Splicing Junction Detection in Next Generation Sequencing Data on a Virtual-GRID Infrastructure
The new protocol for sequencing the messenger RNA in a cell, named RNA-seq produce millions of short sequence fragments. Next Generation Sequencing technology allows more accurate analysis but increase needs in term of computational resources. This paper describes the optimization of a RNA-seq analysis pipeline devoted to splicing variants detection, aimed at reducing computation time and providing a multi-user/multisample environment. This work brings two main contributions. First, we optimized a well-known algorithm called TopHat by parallelizing some sequential mapping steps. Second, we designed and implemented a hybrid virtual GRID infrastructure allowing to efficiently execute multiple instances of TopHat running on different samples or on behalf of different users, thus optimizing the overall execution time and enabling a flexible multi-user environmen
Parallel Construction of Wavelet Trees on Multicore Architectures
The wavelet tree has become a very useful data structure to efficiently
represent and query large volumes of data in many different domains, from
bioinformatics to geographic information systems. One problem with wavelet
trees is their construction time. In this paper, we introduce two algorithms
that reduce the time complexity of a wavelet tree's construction by taking
advantage of nowadays ubiquitous multicore machines.
Our first algorithm constructs all the levels of the wavelet in parallel in
time and bits of working space, where
is the size of the input sequence and is the size of the alphabet. Our
second algorithm constructs the wavelet tree in a domain-decomposition fashion,
using our first algorithm in each segment, reaching time and
bits of extra space, where is the
number of available cores. Both algorithms are practical and report good
speedup for large real datasets.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
Parallel String Matching with Multi Core Processors-A Comparative Study for Gene Sequences
The increase in huge amount of data is seen clearly in present days because of requirement for storing more information. To extract certain data from this large database is a very difficult task, including text processing, information retrieval, text mining, pattern recognition and DNA sequencing. So we need concurrent events and high performance computing models for extracting the data. This will create a challenge to the researchers. One of the solutions is parallel algorithms for string matching on computing models. In this we implemented parallel string matching with JAVA Multi threading with multi core processing, and performed a comparative study on Knuth Morris Pratt, Boyer Moore and Brute force string matching algorithms. For testing our system we take a gene sequence which consists of lacks of records. From the test results it is shown that the multicore processing is better compared to lower versions. Finally this proposed parallel string matching with multicore processing is better compared to other sequential approaches
High-Throughput SNP Genotyping by SBE/SBH
Despite much progress over the past decade, current Single Nucleotide
Polymorphism (SNP) genotyping technologies still offer an insufficient degree
of multiplexing when required to handle user-selected sets of SNPs. In this
paper we propose a new genotyping assay architecture combining multiplexed
solution-phase single-base extension (SBE) reactions with sequencing by
hybridization (SBH) using universal DNA arrays such as all -mer arrays. In
addition to PCR amplification of genomic DNA, SNP genotyping using SBE/SBH
assays involves the following steps: (1) Synthesizing primers complementing the
genomic sequence immediately preceding SNPs of interest; (2) Hybridizing these
primers with the genomic DNA; (3) Extending each primer by a single base using
polymerase enzyme and dideoxynucleotides labeled with 4 different fluorescent
dyes; and finally (4) Hybridizing extended primers to a universal DNA array and
determining the identity of the bases that extend each primer by hybridization
pattern analysis. Our contributions include a study of multiplexing algorithms
for SBE/SBH genotyping assays and preliminary experimental results showing the
achievable tradeoffs between the number of array probes and primer length on
one hand and the number of SNPs that can be assayed simultaneously on the
other. Simulation results on datasets both randomly generated and extracted
from the NCBI dbSNP database suggest that the SBE/SBH architecture provides a
flexible and cost-effective alternative to genotyping assays currently used in
the industry, enabling genotyping of up to hundreds of thousands of
user-specified SNPs per assay.Comment: 19 page
Analyzing large-scale DNA Sequences on Multi-core Architectures
Rapid analysis of DNA sequences is important in preventing the evolution of
different viruses and bacteria during an early phase, early diagnosis of
genetic predispositions to certain diseases (cancer, cardiovascular diseases),
and in DNA forensics. However, real-world DNA sequences may comprise several
Gigabytes and the process of DNA analysis demands adequate computational
resources to be completed within a reasonable time. In this paper we present a
scalable approach for parallel DNA analysis that is based on Finite Automata,
and which is suitable for analyzing very large DNA segments. We evaluate our
approach for real-world DNA segments of mouse (2.7GB), cat (2.4GB), dog
(2.4GB), chicken (1GB), human (3.2GB) and turkey (0.2GB). Experimental results
on a dual-socket shared-memory system with 24 physical cores show speed-ups of
up to 17.6x. Our approach is up to 3x faster than a pattern-based parallel
approach that uses the RE2 library.Comment: The 18th IEEE International Conference on Computational Science and
Engineering (CSE 2015), Porto, Portugal, 20 - 23 October 201
BioWorkbench: A High-Performance Framework for Managing and Analyzing Bioinformatics Experiments
Advances in sequencing techniques have led to exponential growth in
biological data, demanding the development of large-scale bioinformatics
experiments. Because these experiments are computation- and data-intensive,
they require high-performance computing (HPC) techniques and can benefit from
specialized technologies such as Scientific Workflow Management Systems (SWfMS)
and databases. In this work, we present BioWorkbench, a framework for managing
and analyzing bioinformatics experiments. This framework automatically collects
provenance data, including both performance data from workflow execution and
data from the scientific domain of the workflow application. Provenance data
can be analyzed through a web application that abstracts a set of queries to
the provenance database, simplifying access to provenance information. We
evaluate BioWorkbench using three case studies: SwiftPhylo, a phylogenetic tree
assembly workflow; SwiftGECKO, a comparative genomics workflow; and RASflow, a
RASopathy analysis workflow. We analyze each workflow from both computational
and scientific domain perspectives, by using queries to a provenance and
annotation database. Some of these queries are available as a pre-built feature
of the BioWorkbench web application. Through the provenance data, we show that
the framework is scalable and achieves high-performance, reducing up to 98% of
the case studies execution time. We also show how the application of machine
learning techniques can enrich the analysis process
- …