Search CORE

103,190 research outputs found

Finding patterns common to a set of strings

Author: Angluin Dana
Publication venue: Published by Elsevier Inc.
Publication date: 31/08/1980
Field of study

AbstractAssume a finite alphabet of constant symbols and a disjoint infinite alphabet of variable symbols. A pattern is a non-null finite string of constant and variable symbols. The language of a pattern is all strings obtainable by substituting non-null strings of constant symbols for the variables of the pattern. A sample is a finite nonempty set of non-null strings of constant symbols. Given a sample S, a pattern p is descriptive of S provided the language of p contains S and does not properly contain the language of any other pattern that contains S. The computational problem of finding a pattern descriptive of a given sample is studied. The main result is a polynomial-time algorithm for the special case of patterns containing only one variable symbol (possibly occurring several times in the pattern). Several other results are proved concerning the class of languages generated by patterns and the problem of finding a descriptive pattern

Elsevier - Publisher Connector

Closest string with outliers

Author: A Ben-Dor
B Ma
B Ma
Bin Ma
Christina Boucher
D Marx
G Pavesi
J Dopazo
J Gramm
J Gramm
J Lanctot
K Lucas
L Wang
M Fellows
M Fellows
M Frances
M Garey
M Li
M Tompa
P Pevzner
R Downey
R Zhao
V Proutski
W Lenstra
X Deng
ZZ Chen
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Background: Given n strings s1, …, sn each of length ℓ and a nonnegative integer d, the CLOSEST STRING problem asks to find a center string s such that none of the input strings has Hamming distance greater than d from s. Finding a common pattern in many – but not necessarily all – input strings is an important task that plays a role in many applications in bioinformatics. Results: Although the closest string model is robust to the oversampling of strings in the input, it is severely affected by the existence of outliers. We propose a refined model, the CLOSEST STRING WITH OUTLIERS (CSWO) problem, to overcome this limitation. This new model asks for a center string s that is within Hamming distance d to at least n – k of the n input strings, where k is a parameter describing the maximum number of outliers. A CSWO solution not only provides the center string as a representative for the set of strings but also reveals the outliers of the set. We provide fixed parameter algorithms for CSWO when d and k are parameters, for both bounded and unbounded alphabets. We also show that when the alphabet is unbounded the problem is W[1]-hard with respect to n – k, ℓ, and d. Conclusions: Our refined model abstractly models finding common patterns in several but not all input strings

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Developments from enquiries into the learnability of the pattern languages from positive data

Author: Ng Yen Kaow
Shinohara Takeshi
Publication venue: Elsevier Ltd.
Publication date: 20/05/2008
Field of study

AbstractThe pattern languages are languages that are generated from patterns, and were first proposed by Angluin as a non-trivial class that is inferable from positive data [D. Angluin, Finding patterns common to a set of strings, Journal of Computer and System Sciences 21 (1980) 46–62; D. Angluin, Inductive inference of formal languages from positive data, Information and Control 45 (1980) 117–135]. In this paper we chronologize some results that developed from the investigations on the inferability of the pattern languages from positive data

Elsevier - Publisher Connector

Finding patterns in strings using suffix arrays

Author: Stehouwer H.
Van Zaanen M.
Publication venue
Publication date: 01/01/2010
Field of study

Finding regularities in large data sets requires implementations of systems that are efﬁcient in both time and space requirements. Here, we describe a newly developed system that exploits the internal structure of the enhanced sufﬁxarray to ﬁnd signiﬁcant patterns in a large collection of sequences. The system searches exhaustively for all signiﬁcantly compressing patterns where patterns may consist of symbols and skips or wildcards. We demonstrate a possible application of the system by detecting interesting patterns in a Dutch and an English corpus

MPG.PuRe