Search CORE

50 research outputs found

High Performance Implementation of Planted Motif Problem using Suffix trees

Author
Publication venue
Publication date
Field of study

In this paper we present a high performance implementation of suffix tree based solution to the planted motif problem on two different parallel architectures: NVIDIA GPU and Intel Multicore machines. An (l,d) planted motif problem(PMP) is defined as: Given a sequence of n DNA sequences, each of length L, find M, the set of sequences(or motifs) of length l which have atleast one d-neighbor in each of the n sequences. Here, a d-neighbor of a sequence is a sequence of same length that differs in at-most d positions. PMP is a well studied problem in computational biology. It is useful in developing methods for finding transcription factor binding sites, sequence classification and for building phylogenetic trees. The problem is computationally challenging to solve, for example a (19,7) PMP takes 9.9 hours on a sequential machine. Many approaches to solve planted motif problem can be found in literature. One approach is based on use of suffix tree data structure. Though suffix tree based methods are the most efficient ones for solving large planted motif problems on sequential machines, they are quite difficult to parallelize. We present suffix tree based parallel solutions for PMP on NVIDIA GPU and Intel Multicore architectures that are efficient and scalable. The solutions are based on a suffix tree algorithm previously presented but use extensive adaptation to individual architectures to ensure that the implementations work efficiently and scale well

CiteSeerX

Parallel random projection using R high performance computing for planted motif search

Author: Dhiba Tyas Farrah
Fahsi Mahmoud
Hidayat Topik
Riza Lala Septem
Setiawan Wawan
Publication venue: 'Universitas Ahmad Dahlan'
Publication date: 01/06/2019
Field of study

Motif discovery in DNA sequences is one of the most important issues in bioinformatics. Thus, algorithms for dealing with the problem accurately and quickly have always been the goal of research in bioinformatics. Therefore, this study is intended to modify the random projection algorithm to be implemented on R high performance computing (i.e., the R package pbdMPI). Some steps are needed to achieve this objective, ie preprocessing data, splitting data according to number of batches, modifying and implementing random projection in the pbdMPI package, and then aggregating the results. To validate the proposed approach, some experiments have been conducted. Several benchmarking data were used in this study by sensitivity analysis on number of cores and batches. Experimental results show that computational cost can be reduced, which is that the computation cost of 6 cores is faster around 34 times compared with the standalone mode. Thus, the proposed approach can be used for motif discovery effectively and efficiently

Journal of Education and Learning (EduLearn)

TELKOMNIKA (Telecommunication Computing Electronics and Control)

UAD Journal Management System

WordSeeker: concurrent bioinformatics software for discovering genome-wide patterns and word-based genomic signatures

Author: Al-ouran Rami
Bitterman Thomas
Drews Frank
Ecker Klaus
Elnitski Laura
Jacox Edwin
Kurz Kyle
Lee Stephen Sauchi
Liang Xiaoyu
Lichtenberg Jens
Nau Lee J
Neiman Lev
Welch Joshua D
Welch Lonnie R
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Springer - Publisher Connector

PubMed Central

qPMS Sigma -- An Efficient and Exact Parallel Algorithm for the Planted $(l, d)$ Motif Search Problem

Author: Dhar Saurav
Goswami Dhiman
Mia Md. Abul Kashem
Saha Amlan
Publication venue
Publication date: 01/03/2024
Field of study

Motif finding is an important step for the detection of rare events occurring in a set of DNA or protein sequences. Extraction of information about these rare events can lead to new biological discoveries. Motifs are some important patterns that have numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similarity between families of proteins, etc. Although several flavors of motif searching algorithms have been studied in the literature, we study the version known as

(l, d)

-motif search or Planted Motif Search (PMS). In PMS, given two integers

l

d

and

n

input sequences we try to find all the patterns of length

l

that appear in each of the

n

input sequences with at most

d

mismatches. We also discuss the quorum version of PMS in our work that finds motifs that are not planted in all the input sequences but at least in

q

of the sequences. Our algorithm is mainly based on the algorithms qPMSPrune, qPMS7, TraverStringRef and PMS8. We introduce some techniques to compress the input strings and make faster comparison between strings with bitwise operations. Our algorithm performs a little better than the existing exact algorithms to solve the qPMS problem in DNA sequence. We have also proposed an idea for parallel implementation of our algorithm

arXiv.org e-Print Archive

Scalable Scientific Computing Algorithms Using MapReduce

Author: Xiang Jingen
Publication venue: 'University of Waterloo'
Publication date: 01/01/2013
Field of study

Cloud computing systems, like MapReduce and Pregel, provide a scalable and fault tolerant environment for running computations at massive scale. However, these systems are designed primarily for data intensive computational tasks, while a large class of problems in scientific computing and business analytics are computationally intensive (i.e., they require a lot of CPU in addition to I/O). In this thesis, we investigate the use of cloud computing systems, in particular MapReduce, for computationally intensive problems, focusing on two classic problems that arise in scienti c computing and also in analytics: maximum clique and matrix inversion. The key contribution that enables us to e ectively use MapReduce to solve the maximum clique problem on dense graphs is a recursive partitioning method that partitions the graph into several subgraphs of similar size and running time complexity. After partitioning, the maximum cliques of the di erent partitions can be computed independently, and the computation is sped up using a branch and bound method. Our experiments show that our approach leads to good scalability, which is unachievable by other partitioning methods since they result in partitions of di erent sizes and hence lead to load imbalance. Our method is more scalable than an MPI algorithm, and is simpler and more fault tolerant. For the matrix inversion problem, we show that a recursive block LU decomposition allows us to e ectively compute in parallel both the lower triangular (L) and upper triangular (U) matrices using MapReduce. After computing the L and U matrices, their inverses are computed using MapReduce. The inverse of the original matrix, which is the product of the inverses of the L and U matrices, is also obtained using MapReduce. Our technique is the rst matrix inversion technique that uses MapReduce. We show experimentally that our technique has good scalability, and it is simpler and more fault tolerant than MPI implementations such as ScaLAPACK

University of Waterloo's Institutional Repository

Understanding host-microbe interactions in maize kernel and sweetpotato leaf metagenomic profiles.

Author: Adams Alison K
Publication venue: TRACE: Tennessee Research and Creative Exchange
Publication date: 01/05/2023
Field of study

Functional and quantitative metagenomic profiling remains challenging and limits our understanding of host-microbe interactions. This body of work aims to mediate these challenges by using a novel quantitative reduced representation sequencing strategy (OmeSeq-qRRS), development of a fully automated software for quantitative metagenomic/microbiome profiling (Qmatey: quantitative metagenomic alignment and taxonomic identification using exact-matching) and implementing these tools for understanding plant-microbe-pathogen interactions in maize and sweetpotato. The next generation sequencing-based OmeSeq-qRRS leverages the strengths of shotgun whole genome sequencing and costs lower that the more affordable amplicon sequencing method. The novel FASTQ data compression/indexing and enhanced-multithreading of the MegaBLAST in Qmatey allows for computational speeds several thousand-folds faster than typical runs. Regardless of sample number, the analytical pipeline can be completed within days for genome-wide sequence data and provides broad-spectrum taxonomic profiling (virus to eukaryotes). As a proof of concept, these protocols and novel analytical pipelines were implemented to characterize the viruses within the leaf microbiome of a sweetpotato population that represents the global genetic diversity and the kernel microbiomes of genetically modified (GMO) and nonGMO maize hybrids. The metagenome profiles and high-density SNP data were integrated to identify host genetic factors (disease resistance and intracellular transport candidate genes) that underpin sweetpotato-virus interactions Additionally, microbial community dynamics were observed in the presence of pathogens, leading to the identification of multipartite interactions that modulate disease severity through co-infection and species competition. This study highlights a low-cost, quantitative and strain/species-level metagenomic profiling approach, new tools that complement the assay’s novel features and provide fast computation, and the potential for advancing functional metagenomic studies

University of Tennessee, Knoxville: Trace