1,776 research outputs found

    DC: a highly efficient and flexible exact pattern-matching algorithm

    Get PDF
    ware of the need for faster and flexible searching algorithms in fields such as web searching or bioinformatics, we propose DC - a high-performance algorithm for exact pattern matching. Emphasizing the analysis of pattern peculiarities in the pre-processing phase, the algorithm en- compasses a novel search logic based on the examination of multiple alignments within a larger window, selectively tested after a powerful heuristic called compatibility rule is verified. The new algorithm’s performance is, on average, above its best-rated competitors when testing different data types and using a complete suite of pattern extensions and compositions. The flexibility is remarkable and the efficiency is more relevant in quaternary or greater alphabets. Keywords: exact pattern-match, searching algorithms

    Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis

    Get PDF
    BACKGROUND: Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2(-L )distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations. RESULTS: The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm. CONCLUSIONS: The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems

    Models, Algorithms, and Architectures for Scalable Packet Classification

    Get PDF
    The growth and diversification of the Internet imposes increasing demands on the performance and functionality of network infrastructure. Routers, the devices responsible for the switch-ing and directing of traffic in the Internet, are being called upon to not only handle increased volumes of traffic at higher speeds, but also impose tighter security policies and provide support for a richer set of network services. This dissertation addresses the searching tasks performed by Internet routers in order to forward packets and apply network services to packets belonging to defined traffic flows. As these searching tasks must be performed for each packet traversing the router, the speed and scalability of the solutions to the route lookup and packet classification problems largely determine the realizable performance of the router, and hence the Internet as a whole. Despite the energetic attention of the academic and corporate research communities, there remains a need for search engines that scale to support faster communication links, larger route tables and filter sets and increasingly complex filters. The major contributions of this work include the design and analysis of a scalable hardware implementation of a Longest Prefix Matching (LPM) search engine for route lookup, a survey and taxonomy of packet classification techniques, a thorough analysis of packet classification filter sets, the design and analysis of a suite of performance evaluation tools for packet classification algorithms and devices, and a new packet classification algorithm that scales to support high-speed links and large filter sets classifying on additional packet fields

    Pattern matching through Chaos Game Representation: bridging numerical and discrete data structures for biological sequence analysis

    Get PDF
    This work was partially supported by FCT through the PIDDAC Program funds (INESC-ID multiannual funding) and under grant PEst-OE/EEI/LA0008/2011 (IT multiannual funding). In addition, it was also partially funded by projects HIVCONTROL (PTDC/EEA-CRO/100128/2008, S. Vinga, PI), TAGS (PTDC/EIA-EIA/112283/2009) and NEUROCLINOMICS (PTDC/EIA-EIA/111239/2009) from FCT (Portugal).Background: Chaos Game Representation (CGR) is an iterated function that bijectively maps discrete sequences into a continuous domain. As a result, discrete sequences can be object of statistical and topological analyses otherwise reserved to numerical systems. Characteristically, CGR coordinates of substrings sharing an L-long suffix will be located within 2(-L) distance of each other. In the two decades since its original proposal, CGR has been generalized beyond its original focus on genomic sequences and has been successfully applied to a wide range of problems in bioinformatics. This report explores the possibility that it can be further extended to approach algorithms that rely on discrete, graph-based representations. Results: The exploratory analysis described here consisted of selecting foundational string problems and refactoring them using CGR-based algorithms. We found that CGR can take the role of suffix trees and emulate sophisticated string algorithms, efficiently solving exact and approximate string matching problems such as finding all palindromes and tandem repeats, and matching with mismatches. The common feature of these problems is that they use longest common extension (LCE) queries as subtasks of their procedures, which we show to have a constant time solution with CGR. Additionally, we show that CGR can be used as a rolling hash function within the Rabin-Karp algorithm. Conclusions: The analysis of biological sequences relies on algorithmic foundations facing mounting challenges, both logistic (performance) and analytical (lack of unifying mathematical framework). CGR is found to provide the latter and to promise the former: graph-based data structures for sequence analysis operations are entailed by numerical-based data structures produced by CGR maps, providing a unifying analytical framework for a diversity of pattern matching problems.publishersversionpublishe

    Compressed Full-Text Indexes for Highly Repetitive Collections

    Get PDF
    This thesis studies problems related to compressed full-text indexes. A full-text index is a data structure for indexing textual (sequence) data, so that the occurrences of any query string in the data can be found efficiently. While most full-text indexes require much more space than the sequences they index, recent compressed indexes have overcome this limitation. These compressed indexes combine a compressed representation of the index with some extra information that allows decompressing any part of the data efficiently. This way, they provide similar functionality as the uncompressed indexes, while using only slightly more space than the compressed data. The efficiency of data compression is usually measured in terms of entropy. While entropy-based estimates predict the compressed size of most texts accurately, they fail with highly repetitive collections of texts. Examples of such collections include different versions of a document and the genomes of a number of individuals from the same population. While the entropy of a highly repetitive collection is usually similar to that of a text of the same kind, the collection can often be compressed much better than the entropy-based estimate. Most compressed full-text indexes are based on the Burrows-Wheeler transform (BWT). Originally intended for data compression, the BWT has deep connections with full-text indexes such as the suffix tree and the suffix array. With some additional information, these indexes can be simulated with the Burrows-Wheeler transform. The first contribution of this thesis is the first BWT-based index that can compress highly repetitive collections efficiently. Compressed indexes allow us to handle much larger data sets than the corresponding uncompressed indexes. To take full advantage of this, we need algorithms for constructing the compressed index directly, instead of first constructing an uncompressed index and then compressing it. The second contribution of this thesis is an algorithm for merging the BWT-based indexes of two text collections. By using this algorithm, we can derive better space-efficient construction algorithms for BWT-based indexes. The basic BWT-based indexes provide similar functionality as the suffix array. With some additional structures, the functionality can be extended to that of the suffix tree. One of the structures is an array storing the lengths of the longest common prefixes of lexicographically adjacent suffixes of the text. The third contribution of this thesis is a space-efficient algorithm for constructing this array, and a new compressed representation of the array. In the case of individual genomes, the highly repetitive collection can be considered a sample from a larger collection. This collection consists of a reference sequence and a set of possible differences from the reference, so that each sequence contains a subset of the differences. The fourth contribution of this thesis is a BWT-based index that extrapolates the larger collection from the sample and indexes it.Tässä väitöskirjassa käsitellään tiivistettyjä kokotekstihakemistoja tekstimuotoisille aineistoille. Kokotekstihakemistot ovat tietorakenteita, jotka mahdollistavat mielivaltaisten hahmojen esiintymien löytämisen tekstistä tehokkaasti. Perinteiset kokotekstihakemistot, kuten loppuosapuut ja -taulukot, vievät moninkertaisesti tilaa itse aineistoon nähden. Viime aikoina on kuitenkin kehitetty tiivistettyjä hakemistorakenteita, jotka tarjoavat vastaavan toiminnallisuuden alkuperäistä tekstiä pienemmässä tilassa. Tämä on mahdollistanut aikaisempaa suurempien aineistojen käsittelyn. Tekstin tiivistyvyyttä mitataan yleensä suhteessa sen entropiaan. Vaikka entropiaan perustuvat arviot ovat useimmilla aineistoilla varsin tarkkoja, aliarvioivat ne vahvasti toisteisien aineistojen tiivistyvyyttä. Esimerkkejä tällaisista aineistoista ovat kokoelmat saman populaation yksilöiden genomeita tai saman dokumentin eri versioita. Siinä missä tällaisen kokoelman entropia suhteessa aineiston kokoon on vastaava kuin yksittäisellä samaa tyyppiä olevalla tekstillä, tiivistyy kokoelma yleensä huomattavasti paremmin kuin entropian perusteella voisi odottaa. Useimmat tiivistetyt kokotekstihakemistot perustuvat Burrows-Wheeler-muunnokseen (BWT), joka kehitettiin alun perin tekstimuotoisten aineistojen tiivistämiseen. Pian kuitenkin havaittiin, että koska BWT muistuttaa rakenteeltaan loppuosapuuta ja -taulukkoa, voidaan sitä käyttää niissä tehtävien hakujen simulointiin. Tässä väitöskirjassa esitetään ensimmäinen BWT-pohjainen kokotekstihakemisto, joka pystyy tiivistämään vahvasti toisteiset aineistot tehokkaasti. Tiivistettyjen tietorakenteiden käyttö mahdollistaa suurempien aineistoiden käsittelemisen kuin tavallisia tietorakenteita käytettäessä. Tämä etu kuitenkin menetetään, jos tiivistetty tietorakenne muodostetaan luomalla ensin vastaava tavallinen tietorakenne ja tiivistämällä se. Tässä väitöskirjassa esitetään aikaisempaa vähemmän muistia käyttäviä algoritmeja BWT-pohjaisten kokotekstihakemistojen muodostamiseen. Kokoelma yksilöiden genomeita voidaan käsittää otokseksi suuremmasta kokoelmasta, joka koostuu populaation kaikkien yksilöiden sekä niiden hypoteettisten jälkeläisten genomeista. Tällainen kokoelma voidaan esittää äärellisenä automaattina, joka muodostuu referenssigenomista ja yksilöiden genomeissa esiintyvistä poikkeamista referenssistä. Tässä väitöskirjassa esitetään BWT-pohjaisten kokotekstihakemistojen yleistys, joka mahdollistaa tällaisten automaattien indeksoinnin

    Towards Terabit Carrier Ethernet and Energy Efficient Optical Transport Networks

    Get PDF
    corecore