103,190 research outputs found

    Finding patterns common to a set of strings

    Get PDF
    AbstractAssume a finite alphabet of constant symbols and a disjoint infinite alphabet of variable symbols. A pattern is a non-null finite string of constant and variable symbols. The language of a pattern is all strings obtainable by substituting non-null strings of constant symbols for the variables of the pattern. A sample is a finite nonempty set of non-null strings of constant symbols. Given a sample S, a pattern p is descriptive of S provided the language of p contains S and does not properly contain the language of any other pattern that contains S. The computational problem of finding a pattern descriptive of a given sample is studied. The main result is a polynomial-time algorithm for the special case of patterns containing only one variable symbol (possibly occurring several times in the pattern). Several other results are proved concerning the class of languages generated by patterns and the problem of finding a descriptive pattern

    Closest string with outliers

    Get PDF
    Background: Given n strings s1, …, sn each of length ℓ and a nonnegative integer d, the CLOSEST STRING problem asks to find a center string s such that none of the input strings has Hamming distance greater than d from s. Finding a common pattern in many – but not necessarily all – input strings is an important task that plays a role in many applications in bioinformatics. Results: Although the closest string model is robust to the oversampling of strings in the input, it is severely affected by the existence of outliers. We propose a refined model, the CLOSEST STRING WITH OUTLIERS (CSWO) problem, to overcome this limitation. This new model asks for a center string s that is within Hamming distance d to at least n – k of the n input strings, where k is a parameter describing the maximum number of outliers. A CSWO solution not only provides the center string as a representative for the set of strings but also reveals the outliers of the set. We provide fixed parameter algorithms for CSWO when d and k are parameters, for both bounded and unbounded alphabets. We also show that when the alphabet is unbounded the problem is W[1]-hard with respect to n – k, ℓ, and d. Conclusions: Our refined model abstractly models finding common patterns in several but not all input strings

    Developments from enquiries into the learnability of the pattern languages from positive data

    Get PDF
    AbstractThe pattern languages are languages that are generated from patterns, and were first proposed by Angluin as a non-trivial class that is inferable from positive data [D. Angluin, Finding patterns common to a set of strings, Journal of Computer and System Sciences 21 (1980) 46–62; D. Angluin, Inductive inference of formal languages from positive data, Information and Control 45 (1980) 117–135]. In this paper we chronologize some results that developed from the investigations on the inferability of the pattern languages from positive data

    Finding patterns in strings using suffix arrays

    Get PDF
    Finding regularities in large data sets requires implementations of systems that are efficient in both time and space requirements. Here, we describe a newly developed system that exploits the internal structure of the enhanced suffixarray to find significant patterns in a large collection of sequences. The system searches exhaustively for all significantly compressing patterns where patterns may consist of symbols and skips or wildcards. We demonstrate a possible application of the system by detecting interesting patterns in a Dutch and an English corpus
    corecore