10 research outputs found

    Bounds on the Number of Longest Common Subsequences

    Get PDF
    This paper performs the analysis necessary to bound the running time of known, efficient algorithms for generating all longest common subsequences. That is, we bound the running time as a function of input size for algorithms with time essentially proportional to the output size. This paper considers both the case of computing all distinct LCSs and the case of computing all LCS embeddings. Also included is an analysis of how much better the efficient algorithms are than the standard method of generating LCS embeddings. A full analysis is carried out with running times measured as a function of the total number of input characters, and much of the analysis is also provided for cases in which the two input sequences are of the same specified length or of two independently specified lengths.Comment: 13 pages. Corrected typos, corrected operation of hyperlinks, improved presentatio

    Exemplar Longest Common Subsequence (extended abstract)

    Get PDF
    International audienceIn the paper we investigate the computational and approximation complexity of the Exemplar Longest Common Subsequence of a set of sequences (ELCS problem), a generalization of the Longest Common Subsequence problem, where the input sequences are over the union of two disjoint sets of symbols, a set of mandatory symbols and a set of optional symbols. We show that different versions of the problem are APX-hard even for instances with two sequences. Moreover, we show that the related problem of determining the existence of a feasible solution of the Exemplar Longest Common Subsequence of two sequences is NP-hard. On the positive side, efficient algorithms for the ELCS problem over instances of two sequences where each mandatory symbol can appear totally at most three times or the number of mandatory symbols is bounded by a constant are given

    Measuring the accuracy of page-reading systems

    Full text link
    Given a bitmapped image of a page from any document, a page-reading system identifies the characters on the page and stores them in a text file. This OCR-generated text is represented by a string and compared with the correct string to determine the accuracy of this process. The string editing problem is applied to find an optimal correspondence of these strings using an appropriate cost function. The ISRI annual test of page-reading systems utilizes the following performance measures, which are defined in terms of this correspondence and the string edit distance: character accuracy, throughput, accuracy by character class, marked character efficiency, word accuracy, non-stopword accuracy, and phrase accuracy. It is shown that the universe of cost functions is divided into equivalence classes, and the cost functions related to the longest common subsequence (LCS) are identified. The computation of a LCS can be made faster by a linear-time preprocessing step

    Algorithms for peptide and PTM identification using Tandem mass spectrometry

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Subsequences and Supersequences of Strings

    Get PDF
    Stringology - the study of strings - is a branch of algorithmics which been the sub-ject of mounting interest in recent years. Very recently, two books [M. Crochemore and W. Rytter, Text Algorithms, Oxford University Press, 1995] and [G. Stephen, String Searching Algorithms, World Scientific, 1994] have been published on the subject and at least two others are known to be in preparation. Problems on strings arise in information retrieval, version control, automatic spelling correction, and many other domains. However the greatest motivation for recent work in stringology has come from the field of molecular biology. String problems occur, for example, in genetic sequence construction, genetic sequence comparison, and phylogenetic tree construction. In this thesis we study a variety of string problems from a theoretical perspective. In particular, we focus on problems involving subsequences and supersequences of strings

    Expected length of longest common subsequences

    Get PDF
    A longest common subsequence of two sequences is a sequence that is a subsequence of both the given sequences and has largest possible length. It is known that the expected length of a longest common subsequence is proportional to the length of the given sequences. The proportion, denoted by 7k, is dependent on the alphabet size k and the exact value of this proportion is not known even for a binary alphabet. To obtain lower bounds for the constants 7k, finite state machines computing a common subsequence of the inputs are built. Analysing the behaviour of the machines for random inputs we get lower bounds for the constants 7k. The analysis of the machines is based on the theory of Markov chains. An algorithm for automated production of lower bounds is described. To obtain upper bounds for the constants 7k, collations pairs of sequences with a marked common subsequence - are defined. Upper bounds for the number of collations of ‘small size’ can be easily transformed to upper bounds for the constants 7k. Combinatorial analysis is used to bound the number of collations. The methods used for producing bounds on the expected length of a common subsequence of two sequences are also used for other problems, namely a longest common subsequence of several sequences, a shortest common supersequence and a maximal adaptability

    Kahden merkkijonon pisimmän yhteisen alijonon ongelma ja sen ratkaiseminen

    Get PDF
    Tämä tutkielma kuuluu merkkijonoalgoritmiikan piiriin. Merkkijono S on merkkijonojen X[1..m] ja Y[1..n] yhteinen alijono, mikäli se voidaan muodostaa poistamalla X:stä 0..m ja Y:stä 0..n kappaletta merkkejä mielivaltaisista paikoista. Jos yksikään X:n ja Y:n yhteinen alijono ei ole S:ää pidempi, sanotaan, että S on X:n ja Y:n pisin yhteinen alijono (lyh. PYA). Tässä työssä keskitytään kahden merkkijonon PYAn ratkaisemiseen, mutta ongelma on yleistettävissä myös useammalle jonolle. PYA-ongelmalle on sovelluskohteita – paitsi tietojenkäsittelytieteen niin myös bioinformatiikan osa-alueilla. Tunnetuimpia niistä ovat tekstin ja kuvien tiivistäminen, tiedostojen versionhallinta, hahmontunnistus sekä DNA- ja proteiiniketjujen rakennetta vertaileva tutkimus. Ongelman ratkaisemisen tekee hankalaksi ratkaisualgoritmien riippuvuus syötejonojen useista eri parametreista. Näitä ovat syötejonojen pituuden lisäksi mm. syöttöaakkoston koko, syötteiden merkkijakauma, PYAn suhteellinen osuus lyhyemmän syötejonon pituudesta ja täsmäävien merkkiparien lukumäärä. Täten on vaikeaa kehittää algoritmia, joka toimisi tehokkaasti kaikille ongelman esiintymille. Tutkielman on määrä toimia yhtäältä käsikirjana, jossa esitellään ongelman peruskäsitteiden kuvauksen jälkeen jo aikaisemmin kehitettyjä tarkkoja PYAalgoritmeja. Niiden tarkastelu on ryhmitelty algoritmin toimintamallin mukaan joko rivi, korkeuskäyrä tai diagonaali kerrallaan sekä monisuuntaisesti prosessoiviin. Tarkkojen menetelmien lisäksi esitellään PYAn pituuden ylä- tai alarajan laskevia heuristisia menetelmiä, joiden laskemia tuloksia voidaan hyödyntää joko sellaisinaan tai ohjaamaan tarkan algoritmin suoritusta. Tämä osuus perustuu tutkimusryhmämme julkaisemiin artikkeleihin. Niissä käsitellään ensimmäistä kertaa heuristiikoilla tehostettuja tarkkoja menetelmiä. Toisaalta työ sisältää laajahkon empiirisen tutkimusosuuden, jonka tavoitteena on ollut tehostaa olemassa olevien tarkkojen algoritmien ajoaikaa ja muistinkäyttöä. Kyseiseen tavoitteeseen on pyritty ohjelmointiteknisesti esittelemällä algoritmien toimintamallia hyvin tukevia tietorakenteita ja rajoittamalla algoritmien suorittamaa tuloksetonta laskentaa parantamalla niiden kykyä havainnoida suorituksen aikana saavutettuja välituloksia ja hyödyntää niitä. Tutkielman johtopäätöksinä voidaan yleisesti todeta tarkkojen PYA-algoritmien heuristisen esiprosessoinnin lähes systemaattisesti pienentävän niiden suoritusaikaa ja erityisesti muistintarvetta. Lisäksi algoritmin käyttämällä tietorakenteella on ratkaiseva vaikutus laskennan tehokkuuteen: mitä paikallisempia haku- ja päivitysoperaatiot ovat, sitä tehokkaampaa algoritmin suorittama laskenta on.The topic of this thesis belongs to string algorithmics. A string denoted by S is a common subsequence of strings X[1..m] and Y[1..n], if it can be extracted by deleting arbitrarily 0..m symbols from X and 0..n symbols from Y. If no common subsequence of X and Y is longer than S, it is determined that S is the longest common subsequence (abbr. LCS) of X and Y. In this work, solving the LCS problem for two strings will be concerned, but the problem can also be generalized for several strings. LCS problem has applications not only in computer science but also in research areas of bioinformatics. To the best known applications belong text and image compression, version maintenance of files, pattern recognition and comparison research of DNA and protein structures. Solving the problem is difficult, because the algorithms depend on several parameters of the input strings. To those belong, for instance, lengt of the input strings, size of the input alphabet, character distribution of the inputs, proportion of LCS in the shorter input string and amount of matching symbol pairs. Thus it is difficult to develop an algorithm which could run effectively for all problem instances. The thesis should on the one hand be regarded as a handbook, where, after the description of basic concepts, already earlier developed exact LCS algorithms will be introduced. They are considered in groups which are classified according to the processing model of the algorithm: one row, contour or one diagonal at a time or multidirectedy. In addition, heuristic methods for calculating an upper or a lower bound for LCS will be represented. The results calculated by those methods can be utilized either as such or in order to steer the processing of an exact algorithm. This section is based on articles published by our research group. Exact LCS algorithms enforced with heuristic preprocessing are introduced for the first time in those articles. On the other hand, the work contains quite a comprehensive section of empirical research, which aims at intensifying the running time and reduction of the memory consumption of existing exact algorithms. This target was tried to be reached in terms of programming techniques by introducing well-supporting data structures for the processing model of the algorithms, and by restricting fruitless calculation performed by the algorithms by improving their capability to detect intermediate results obtained once during the running time and to utilize them. As conclusions of the thesis, it can be generally mentioned that the heuristic preprocessing almost systematically reduces the running time and especially the memory need of exact LCS algorithms. Furthermore, the data structure of the algorithms has a crucial influence on the efficiency of the calculation: the more local the search and update operations are, the more efficient is the calculation of the algorithm.Siirretty Doriast
    corecore