Search CORE

41,165 research outputs found

Separating sets of strings by finding matching patterns is almost always hard

Author: Lancia G
Mathieson L
Moscato P
Publication venue: 'Elsevier BV'
Publication date: 18/12/2016
Field of study

© 2017 Elsevier B.V. We study the complexity of the problem of searching for a set of patterns that separate two given sets of strings. This problem has applications in a wide variety of areas, most notably in data mining, computational biology, and in understanding the complexity of genetic algorithms. We show that the basic problem of finding a small set of patterns that match one set of strings but do not match any string in a second set is difficult (NP-complete, W[2]-hard when parameterized by the size of the pattern set, and APX-hard). We then perform a detailed parameterized analysis of the problem, separating tractable and intractable variants. In particular we show that parameterizing by the size of pattern set and the number of strings, and the size of the alphabet and the number of strings give FPT results, amongst others

arXiv.org e-Print Archive

University of Newcastle's Digital Repository

Archivio istituzionale della ricerca - Università degli Studi di Udine

OPUS - University of Technology Sydney

Closest string with outliers

Author: A Ben-Dor
B Ma
B Ma
Bin Ma
Christina Boucher
D Marx
G Pavesi
J Dopazo
J Gramm
J Gramm
J Lanctot
K Lucas
L Wang
M Fellows
M Fellows
M Frances
M Garey
M Li
M Tompa
P Pevzner
R Downey
R Zhao
V Proutski
W Lenstra
X Deng
ZZ Chen
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Background: Given n strings s1, …, sn each of length ℓ and a nonnegative integer d, the CLOSEST STRING problem asks to find a center string s such that none of the input strings has Hamming distance greater than d from s. Finding a common pattern in many – but not necessarily all – input strings is an important task that plays a role in many applications in bioinformatics. Results: Although the closest string model is robust to the oversampling of strings in the input, it is severely affected by the existence of outliers. We propose a refined model, the CLOSEST STRING WITH OUTLIERS (CSWO) problem, to overcome this limitation. This new model asks for a center string s that is within Hamming distance d to at least n – k of the n input strings, where k is a parameter describing the maximum number of outliers. A CSWO solution not only provides the center string as a representative for the set of strings but also reveals the outliers of the set. We provide fixed parameter algorithms for CSWO when d and k are parameters, for both bounded and unbounded alphabets. We also show that when the alphabet is unbounded the problem is W[1]-hard with respect to n – k, ℓ, and d. Conclusions: Our refined model abstractly models finding common patterns in several but not all input strings

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Efficient motif finding algorithms for large-alphabet inputs

Author: Kuksa Pavel P
Pavlovic Vladimir
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background We consider the problem of identifying motifs, recurring or conserved patterns, in the biological sequence data sets. To solve this task, we present a new deterministic algorithm for finding patterns that are embedded as exact or inexact instances in all or most of the input strings. Results The proposed algorithm (1) improves search efficiency compared to existing algorithms, and (2) scales well with the size of alphabet. On a synthetic planted DNA motif finding problem our algorithm is over 10× more efficient than MITRA, PMSPrune, and RISOTTO for long motifs. Improvements are orders of magnitude higher in the same setting with large alphabets. On benchmark TF-binding site problems (FNP, CRP, LexA) we observed reduction in running time of over 12×, with high detection accuracy. The algorithm was also successful in rapidly identifying protein motifs in Lipocalin, Zinc metallopeptidase, and supersecondary structure motifs for Cadherin and Immunoglobin families. Conclusions Our algorithm reduces computational complexity of the current motif finding algorithms and demonstrate strong running time improvements over existing exact algorithms, especially in important and difficult cases of large-alphabet sequences.</p

CiteSeerX

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central