Search CORE

1,150 research outputs found

Efficient motif finding algorithms for large-alphabet inputs

Author: Kuksa Pavel P
Pavlovic Vladimir
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background We consider the problem of identifying motifs, recurring or conserved patterns, in the biological sequence data sets. To solve this task, we present a new deterministic algorithm for finding patterns that are embedded as exact or inexact instances in all or most of the input strings. Results The proposed algorithm (1) improves search efficiency compared to existing algorithms, and (2) scales well with the size of alphabet. On a synthetic planted DNA motif finding problem our algorithm is over 10× more efficient than MITRA, PMSPrune, and RISOTTO for long motifs. Improvements are orders of magnitude higher in the same setting with large alphabets. On benchmark TF-binding site problems (FNP, CRP, LexA) we observed reduction in running time of over 12×, with high detection accuracy. The algorithm was also successful in rapidly identifying protein motifs in Lipocalin, Zinc metallopeptidase, and supersecondary structure motifs for Cadherin and Immunoglobin families. Conclusions Our algorithm reduces computational complexity of the current motif finding algorithms and demonstrate strong running time improvements over existing exact algorithms, especially in important and difficult cases of large-alphabet sequences.</p

CiteSeerX

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Efficient Algorithms for the Closest Pair Problem and Applications

Author: Pathak Sudipta
Rajasekaran Sanguthevar
Publication venue
Publication date: 21/07/2014
Field of study

The closest pair problem (CPP) is one of the well studied and fundamental problems in computing. Given a set of points in a metric space, the problem is to identify the pair of closest points. Another closely related problem is the fixed radius nearest neighbors problem (FRNNP). Given a set of points and a radius

R

, the problem is, for every input point

p

, to identify all the other input points that are within a distance of

R

from

p

. A naive deterministic algorithm can solve these problems in quadratic time. CPP as well as FRNNP play a vital role in computational biology, computational finance, share market analysis, weather prediction, entomology, electro cardiograph, N-body simulations, molecular simulations, etc. As a result, any improvements made in solving CPP and FRNNP will have immediate implications for the solution of numerous problems in these domains. We live in an era of big data and processing these data take large amounts of time. Speeding up data processing algorithms is thus much more essential now than ever before. In this paper we present algorithms for CPP and FRNNP that improve (in theory and/or practice) the best-known algorithms reported in the literature for CPP and FRNNP. These algorithms also improve the best-known algorithms for related applications including time series motif mining and the two locus problem in Genome Wide Association Studies (GWAS)

arXiv.org e-Print Archive

CiteSeerX

Mining Heterogeneous Multivariate Time-Series for Learning Meaningful Patterns: Application to Home Health Telecare

Author: Duchene Florence
Garbay Catherine
Rialle Vincent
Publication venue
Publication date: 25/11/2004
Field of study

For the last years, time-series mining has become a challenging issue for researchers. An important application lies in most monitoring purposes, which require analyzing large sets of time-series for learning usual patterns. Any deviation from this learned profile is then considered as an unexpected situation. Moreover, complex applications may involve the temporal study of several heterogeneous parameters. In that paper, we propose a method for mining heterogeneous multivariate time-series for learning meaningful patterns. The proposed approach allows for mixed time-series -- containing both pattern and non-pattern data -- such as for imprecise matches, outliers, stretching and global translating of patterns instances in time. We present the early results of our approach in the context of monitoring the health status of a person at home. The purpose is to build a behavioral profile of a person by analyzing the time variations of several quantitative or qualitative parameters recorded through a provision of sensors installed in the home

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

On the String Consensus Problem and the Manhattan Sequence Consensus Problem

Author: A. Amir
A. Amir
A. Frank
B. Ma
C. Boucher
G.D. Cohen
H.W. Lenstra Jr.
J. Gramm
J.J. Sylvester
K. Fischer
M. Frances
R. Kannan
R.L. Graham
Publication venue
Publication date: 01/01/2014
Field of study

In the Manhattan Sequence Consensus problem (MSC problem) we are given

k

integer sequences, each of length

l

, and we are to find an integer sequence

x

of length

l

(called a consensus sequence), such that the maximum Manhattan distance of

x

from each of the input sequences is minimized. For binary sequences Manhattan distance coincides with Hamming distance, hence in this case the string consensus problem (also called string center problem or closest string problem) is a special case of MSC. Our main result is a practically efficient

O(l)

-time algorithm solving MSC for

k\le 5

sequences. Practicality of our algorithms has been verified experimentally. It improves upon the quadratic algorithm by Amir et al.\ (SPIRE 2012) for string consensus problem for

k=5

binary strings. Similarly as in Amir's algorithm we use a column-based framework. We replace the implied general integer linear programming by its easy special cases, due to combinatorial properties of the MSC for

k\le 5

. We also show that for a general parameter

k

any instance can be reduced in linear time to a kernel of size

k!

, so the problem is fixed-parameter tractable. Nevertheless, for

k\ge 4

this is still too large for any naive solution to be feasible in practice.Comment: accepted to SPIRE 201

arXiv.org e-Print Archive

Crossref

RefSelect: a reference sequence selection algorithm for planted (l, d) motif search

Author: Feng Dazheng
Huan Jun
Huo Hongwei
Vitter Jeffrey Scott
Yu Qiang
Zhao Ruixing
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 19/07/2016
Field of study

Background The planted (l, d) motif search (PMS) is an important yet challenging problem in computational biology. Pattern-driven PMS algorithms usually use k out of t input sequences as reference sequences to generate candidate motifs, and they can find all the (l, d) motifs in the input sequences. However, most of them simply take the first k sequences in the input as reference sequences without elaborate selection processes, and thus they may exhibit sharp fluctuations in running time, especially for large alphabets. Results In this paper, we build the reference sequence selection problem and propose a method named RefSelect to quickly solve it by evaluating the number of candidate motifs for the reference sequences. RefSelect can bring a practical time improvement of the state-of-the-art pattern-driven PMS algorithms. Experimental results show that RefSelect (1) makes the tested algorithms solve the PMS problem steadily in an efficient way, (2) particularly, makes them achieve a speedup of up to about 100× on the protein data, and (3) is also suitable for large data sets which contain hundreds or more sequences. Conclusions The proposed algorithm RefSelect can be used to solve the problem that many pattern-driven PMS algorithms present execution time instability. RefSelect requires a small amount of storage space and is capable of selecting reference sequences efficiently and effectively. Also, the parallel version of RefSelect is provided for handling large data sets

KU ScholarWorks

PubMed Central

Recommended from our members

A multiprocessor parallel approach to bit-parallel approximate string matching

Author: Chibli Elias Anwar
Publication venue: CSUSB ScholarWorks
Publication date: 01/01/2008
Field of study

The purpose of this project is to present with empirical results that a parallel design with the use of multiple processors can be successfully applied along with bit-parallel approximate string matching algorithms to solve practical bioinformatics problems. It will demonstrate that nearly optimal speedup can be achieved with a cluster of between two and eight workstations using MPI (Message Passing Interface), directly decreasing the total latency required to perform a string matching problem

CSUSB ScholarWorks

Protein Repeats from First Principles

Author: Becher Veronica Andrea
Espada Rocío
Ferreiro Diego
Parra Rodrigo Gonzalo
Turjanski Pablo Guillermo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/04/2016
Field of study

Some natural proteins display recurrent structural patterns. Despite being highly similar at the tertiary structure level, repeating patterns within a single repeat protein can be extremely variable at the sequence level. We use a mathematical definition of a repetition and investigate the occurrences of these in sequences of different protein families. We found that long stretches of perfect repetitions are infrequent in individual natural proteins, even for those which are known to fold into structures of recurrent structural motifs. We found that natural repeat proteins are indeed repetitive in their families, exhibiting abundant stretches of 6 amino acids or longer that are perfect repetitions in the reference family. We provide a systematic quantification for this repetitiveness. We show that this form of repetitiveness is not exclusive of repeat proteins, but also occurs in globular domains. A by-product of this work is a fast quantification of the likelihood of a protein to belong to a family.Fil: Turjanski, Pablo Guillermo. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; ArgentinaFil: Parra, Rodrigo Gonzalo. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales; ArgentinaFil: Espada, Rocío. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales; ArgentinaFil: Becher, Veronica Andrea. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; ArgentinaFil: Ferreiro, Diego. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Química Biológica de la Facultad de Ciencias Exactas y Naturales; Argentin

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

CONICET Digital

qPMS Sigma -- An Efficient and Exact Parallel Algorithm for the Planted $(l, d)$ Motif Search Problem

Author: Dhar Saurav
Goswami Dhiman
Mia Md. Abul Kashem
Saha Amlan
Publication venue
Publication date: 01/03/2024
Field of study

Motif finding is an important step for the detection of rare events occurring in a set of DNA or protein sequences. Extraction of information about these rare events can lead to new biological discoveries. Motifs are some important patterns that have numerous applications including the identification of transcription factors and their binding sites, composite regulatory patterns, similarity between families of proteins, etc. Although several flavors of motif searching algorithms have been studied in the literature, we study the version known as

(l, d)

-motif search or Planted Motif Search (PMS). In PMS, given two integers

l

d

and

n

input sequences we try to find all the patterns of length

l

that appear in each of the

n

input sequences with at most

d

mismatches. We also discuss the quorum version of PMS in our work that finds motifs that are not planted in all the input sequences but at least in

q

of the sequences. Our algorithm is mainly based on the algorithms qPMSPrune, qPMS7, TraverStringRef and PMS8. We introduce some techniques to compress the input strings and make faster comparison between strings with bitwise operations. Our algorithm performs a little better than the existing exact algorithms to solve the qPMS problem in DNA sequence. We have also proposed an idea for parallel implementation of our algorithm

arXiv.org e-Print Archive