71 research outputs found

    Perushakurakenteiden tehokas muodostus suurille tekstimassoille

    Get PDF
    This thesis studies efficient algorithms for constructing the most fundamental data structures used as building blocks in (compressed) full-text indexes. Full-text indexes are data structures that allow efficiently searching for occurrences of a query string in a (much larger) text. We are mostly interested in large-scale indexing, that is, dealing with input instances that cannot be processed entirely in internal memory and thus a much slower, external memory needs to be used. Specifically, we focus on three data structures: the suffix array, the LCP array and the Lempel-Ziv (LZ77) parsing. These are routinely found as components or used as auxiliary data structures in the construction of many modern full-text indexes. The suffix array is a list of all suffixes of a text in lexicographical order. Despite its simplicity, the suffix array is a powerful tool used extensively not only in indexing but also in data compression, string combinatorics or computational biology. The first contribution of this thesis is an improved algorithm for external memory suffix array construction based on constructing suffix arrays for blocks of text and merging them into the full suffix array. In many applications, the suffix array needs to be augmented with the information about the longest common prefix between each two adjacent suffixes in lexicographical order. The array containing such information is called the longest-common-prefix (LCP) array. The second contribution of this thesis is the first algorithm for computing the LCP array in external memory that is not an extension of a suffix-sorting algorithm. When the input text is highly repetitive, the general-purpose text indexes are usually outperformed (particularly in space usage) by specialized indexes. One of the most popular families of such indexes is based on the Lempel-Ziv (LZ77) parsing. LZ77 parsing is the encoding of text that replaces long repeating substrings with references to other occurrences. In addition to indexing, LZ77 is a heavily used tool in data compression. The third contribution of this thesis is a series of new algorithms to compute the LZ77 parsing, both in RAM and in external memory. The algorithms introduced in this thesis significantly improve upon the prior art. For example: (i) our new approach for constructing the LCP array in external memory is faster than the previously best algorithm by a factor of 2-4 and simultaneously reduces the disk space usage by a factor of four; (ii) a parallel version of our improved suffix array construction algorithm is able to handle inputs much larger than considered in the literature so far. In our experiments, computing the suffix array of a 1 TiB file with the new algorithm took a little over a week and required only 7.2 TiB of disk space (including input and output), whereas on the same machine the previously best algorithm would require 3.5 times as much disk space and take about four times longer.Tutkielman aiheena olevilla algoritmeilla voidaan tehokkaasti muodostaa perustietorakenteita, joita käytetään rakennuspalikoina (tiivistetyissä) tekstihakurakenteissa. Tekstihakurakenteet ovat tietorakenteita, jotka mahdollistavat tehokkaat merkkijonohaut tekstissä. Pääasiallisena kiinnostuksen kohteena ovat algoritmit suurille tekstimassoille, joita ei pystytä käsittelemään keskusmuistissa, ja jotka siksi vaativat paljon hitaamman ulkoisen muistin käyttöä. Kohdetietorakenteita on kolme: loppuosataulukko, LCP-taulukko ja Lempel-Ziv (LZ77) jäsennys. Näitä käytetään laajasti komponentteina tai välivaiheina modernien tekstihakurakenteiden muodostamisessa. Loppuosataulukko listaa tekstin kaikki loppuosat aakkosjärjestyksessä. Yksinkertaisuudestaan huolimatta loppuosataulukko on tehokas työkalu, jota käytetään laajalti ei vain tekstihakurakenteissa vaan myös tekstintiivistyksessä, merkkijonokombinatoriikassa ja laskennallisessa biologiassa. Tutkielman ensimmäinen tulos on parannettu algoritmi loppuosataulukon muodostamiseen ulkoisessa muistissa perustuen tekstin osille muodostettujen loppuosataulukkojen yhdistämiseen koko tekstin loppuosataulukoksi. Monissa sovelluksissa loppuosataulukon rinnalla tarvitaan tietoa aakkosellisesti vierekkäisten loppuosien pisimpien yhteisten alkuosien pituuksista. Tämän tiedon sisältävää taulukkoa sanotaan LCP (longest common prefix) taulukoksi. Tutkielman toinen tulos on ensimmäinen LCP taulukon ulkoisessa muistissa muodostava algoritmi, joka ei ole loppuosataulukonmuodostusalgoritmin laajennus. Vahvasti toisteiselle tekstille on olemassa erikoistuneita tekstihakurakenteita, jotka ovat yleiskäyttöisiä tekstihakurakenteita tehokkaampia (erityisesti muistin käytön suhteen). Yksi suosituimmista tällaisista hakurakenneperheistä perustuu Lempel-Ziv (LZ77) jäsennykseen. LZ77-jäsennys on tekstin tallennusmuoto, jossa pitkät toistuvat osajonot on korvattu viittauksilla aiempiin esiintymiin. Tekstihakurakenteiden lisäksi LZ77-jäsennystä käytetään laajasti tekstintiivistyksessä. Tutkielman kolmas osuus on sarja uusia algoritmeja LZ77-jäsennyksen muodostamiseen, sekä sisäisessä että ulkoisessa muistissa. Tutkielmassa esitellyt algoritmit ovat merkittävä parannus aiempaan tilanteeseen. Esimerkiksi: (i) uusi menetelmä LCP-taulukon muodostamiseen ulkoisessa muistissa on 2-4 kertaa aiempia menetelmiä nopeampi ja samanaikaisesti vähentää levytilankäytön neljännekseen; (ii) parannettu loppuosataulukonmuodostusalgoritmi mahdollistaa paljon aiemmin nähtyjä suurempien syötteiden käsittelyn. Kokeissa yhden teratavun kokoisen tiedoston loppuosataulukon muodostaminen vei vähän yli viikon ja vaati vain 7,2 teratavua levytilaa (syöte ja tulos mukaanlukien), kun aiemmat menetelmät olisivat vaatineet 3,5-kertaisen määrän levytilaa ja vieneet noin nelinkertaisen ajan

    Transform Based And Search Aware Text Compression Schemes And Compressed Domain Text Retrieval

    Get PDF
    In recent times, we have witnessed an unprecedented growth of textual information via the Internet, digital libraries and archival text in many applications. While a good fraction of this information is of transient interest, useful information of archival value will continue to accumulate. We need ways to manage, organize and transport this data from one point to the other on data communications links with limited bandwidth. We must also have means to speedily find the information we need from this huge mass of data. Sometimes, a single site may also contain large collections of data such as a library database, thereby requiring an efficient search mechanism even to search within the local data. To facilitate the information retrieval, an emerging ad hoc standard for uncompressed text is XML which preprocesses the text by putting additional user defined metadata such as DTD or hyperlinks to enable searching with better efficiency and effectiveness. This increases the file size considerably, underscoring the importance of applying text compression. On account of efficiency (in terms of both space and time), there is a need to keep the data in compressed form for as much as possible. Text compression is concerned with techniques for representing the digital text data in alternate representations that takes less space. Not only does it help conserve the storage space for archival and online data, it also helps system performance by requiring less number of secondary storage (disk or CD Rom) accesses and improves the network transmission bandwidth utilization by reducing the transmission time. Unlike static images or video, there is no international standard for text compression, although compressed formats like .zip, .gz, .Z files are increasingly being used. In general, data compression methods are classified as lossless or lossy. Lossless compression allows the original data to be recovered exactly. Although used primarily for text data, lossless compression algorithms are useful in special classes of images such as medical imaging, finger print data, astronomical images and data bases containing mostly vital numerical data, tables and text information. Many lossy algorithms use lossless methods at the final stage of the encoding stage underscoring the importance of lossless methods for both lossy and lossless compression applications. In order to be able to effectively utilize the full potential of compression techniques for the future retrieval systems, we need efficient information retrieval in the compressed domain. This means that techniques must be developed to search the compressed text without decompression or only with partial decompression independent of whether the search is done on the text or on some inversion table corresponding to a set of key words for the text. In this dissertation, we make the following contributions: (1) Star family compression algorithms: We have proposed an approach to develop a reversible transformation that can be applied to a source text that improves existing algorithm\u27s ability to compress. We use a static dictionary to convert the English words into predefined symbol sequences. These transformed sequences create additional context information that is superior to the original text. Thus we achieve some compression at the preprocessing stage. We have a series of transforms which improve the performance. Star transform requires a static dictionary for a certain size. To avoid the considerable complexity of conversion, we employ the ternary tree data structure that efficiently converts the words in the text to the words in the star dictionary in linear time. (2) Exact and approximate pattern matching in Burrows-Wheeler transformed (BWT) files: We proposed a method to extract the useful context information in linear time from the BWT transformed text. The auxiliary arrays obtained from BWT inverse transform brings logarithm search time. Meanwhile, approximate pattern matching can be performed based on the results of exact pattern matching to extract the possible candidate for the approximate pattern matching. Then fast verifying algorithm can be applied to those candidates which could be just small parts of the original text. We present algorithms for both k-mismatch and k-approximate pattern matching in BWT compressed text. A typical compression system based on BWT has Move-to-Front and Huffman coding stages after the transformation. We propose a novel approach to replace the Move-to-Front stage in order to extend compressed domain search capability all the way to the entropy coding stage. A modification to the Move-to-Front makes it possible to randomly access any part of the compressed text without referring to the part before the access point. (3) Modified LZW algorithm that allows random access and partial decoding for the compressed text retrieval: Although many compression algorithms provide good compression ratio and/or time complexity, LZW is the first one studied for the compressed pattern matching because of its simplicity and efficiency. Modifications on LZW algorithm provide the extra advantage for fast random access and partial decoding ability that is especially useful for text retrieval systems. Based on this algorithm, we can provide a dynamic hierarchical semantic structure for the text, so that the text search can be performed on the expected level of granularity. For example, user can choose to retrieve a single line, a paragraph, or a file, etc. that contains the keywords. More importantly, we will show that parallel encoding and decoding algorithm is trivial with the modified LZW. Both encoding and decoding can be performed with multiple processors easily and encoding and decoding process are independent with respect to the number of processors

    28th Annual Symposium on Combinatorial Pattern Matching : CPM 2017, July 4-6, 2017, Warsaw, Poland

    Get PDF
    Peer reviewe

    Sublinear Computation Paradigm

    Get PDF
    This open access book gives an overview of cutting-edge work on a new paradigm called the “sublinear computation paradigm,” which was proposed in the large multiyear academic research project “Foundations of Innovative Algorithms for Big Data.” That project ran from October 2014 to March 2020, in Japan. To handle the unprecedented explosion of big data sets in research, industry, and other areas of society, there is an urgent need to develop novel methods and approaches for big data analysis. To meet this need, innovative changes in algorithm theory for big data are being pursued. For example, polynomial-time algorithms have thus far been regarded as “fast,” but if a quadratic-time algorithm is applied to a petabyte-scale or larger big data set, problems are encountered in terms of computational resources or running time. To deal with this critical computational and algorithmic bottleneck, linear, sublinear, and constant time algorithms are required. The sublinear computation paradigm is proposed here in order to support innovation in the big data era. A foundation of innovative algorithms has been created by developing computational procedures, data structures, and modelling techniques for big data. The project is organized into three teams that focus on sublinear algorithms, sublinear data structures, and sublinear modelling. The work has provided high-level academic research results of strong computational and algorithmic interest, which are presented in this book. The book consists of five parts: Part I, which consists of a single chapter on the concept of the sublinear computation paradigm; Parts II, III, and IV review results on sublinear algorithms, sublinear data structures, and sublinear modelling, respectively; Part V presents application results. The information presented here will inspire the researchers who work in the field of modern algorithms

    35th Symposium on Theoretical Aspects of Computer Science: STACS 2018, February 28-March 3, 2018, Caen, France

    Get PDF

    Efficient Storage of Genomic Sequences in High Performance Computing Systems

    Get PDF
    ABSTRACT: In this dissertation, we address the challenges of genomic data storage in high performance computing systems. In particular, we focus on developing a referential compression approach for Next Generation Sequence data stored in FASTQ format files. The amount of genomic data available for researchers to process has increased exponentially, bringing enormous challenges for its efficient storage and transmission. General-purpose compressors can only offer limited performance for genomic data, thus the need for specialized compression solutions. Two trends have emerged as alternatives to harness the particular properties of genomic data: non-referential and referential compression. Non-referential compressors offer higher compression rations than general purpose compressors, but still below of what a referential compressor could theoretically achieve. However, the effectiveness of referential compression depends on selecting a good reference and on having enough computing resources available. This thesis presents one of the first referential compressors for FASTQ files. We first present a comprehensive analytical and experimental evaluation of the most relevant tools for genomic raw data compression, which led us to identify the main needs and opportunities in this field. As a consequence, we propose a novel compression workflow that aims at improving the usability of referential compressors. Subsequently, we discuss the implementation and performance evaluation for the core of the proposed workflow: a referential compressor for reads in FASTQ format that combines local read-to-reference alignments with a specialized binary-encoding strategy. The compression algorithm, named UdeACompress, achieved very competitive compression ratios when compared to the best compressors in the current state of the art, while showing reasonable execution times and memory use. In particular, UdeACompress outperformed all competitors when compressing long reads, typical of the newest sequencing technologies. Finally, we study the main aspects of the data-level parallelism in the Intel AVX-512 architecture, in order to develop a parallel version of the UdeACompress algorithms to reduce the runtime. Through the use of SIMD programming, we managed to significantly accelerate the main bottleneck found in UdeACompress, the Suffix Array Construction

    Using semantic knowledge to improve compression on log files

    Get PDF
    With the move towards global and multi-national companies, information technology infrastructure requirements are increasing. As the size of these computer networks increases, it becomes more and more difficult to monitor, control, and secure them. Networks consist of a number of diverse devices, sensors, and gateways which are often spread over large geographical areas. Each of these devices produce log files which need to be analysed and monitored to provide network security and satisfy regulations. Data compression programs such as gzip and bzip2 are commonly used to reduce the quantity of data for archival purposes after the log files have been rotated. However, there are many other compression programs which exist - each with their own advantages and disadvantages. These programs each use a different amount of memory and take different compression and decompression times to achieve different compression ratios. System log files also contain redundancy which is not necessarily exploited by standard compression programs. Log messages usually use a similar format with a defined syntax. In the log files, all the ASCII characters are not used and the messages contain certain "phrases" which often repeated. This thesis investigates the use of compression as a means of data reduction and how the use of semantic knowledge can improve data compression (also applying results to different scenarios that can occur in a distributed computing environment). It presents the results of a series of tests performed on different log files. It also examines the semantic knowledge which exists in maillog files and how it can be exploited to improve the compression results. The results from a series of text preprocessors which exploit this knowledge are presented and evaluated. These preprocessors include: one which replaces the timestamps and IP addresses with their binary equivalents and one which replaces words from a dictionary with unused ASCII characters. In this thesis, data compression is shown to be an effective method of data reduction producing up to 98 percent reduction in filesize on a corpus of log files. The use of preprocessors which exploit semantic knowledge results in up to 56 percent improvement in overall compression time and up to 32 percent reduction in compressed size.TeXpdfTeX-1.40.

    Algorithm engineering : string processing

    Get PDF
    The string matching problem has attracted a lot of interest throughout the history of computer science, and is crucial to the computing industry. The theoretical community in Computer Science has a developed a rich literature in the design and analysis of string matching algorithms. To date, most of this work has been based on the asymptotic analysis of the algorithms. This analysis rarely tell us how the algorithm will perform in practice and considerable experimentation and fine-tuning is typically required to get the most out of a theoretical idea. In this thesis, promising string matching algorithms discovered by the theoretical community are implemented, tested and refined to the point where they can be usefully applied in practice. In the course of this work we have presented the following new algorithms. We prove that the time complexity of the new algorithms, for the average case is linear. We also compared the new algorithms with the existing algorithms by experimentation. " We implemented the existing one dimensional string matching algorithms for English texts. From the findings of the experimental results we identified the best two algorithms. We combined these two algorithms and introduce a new algorithm. " We developed a new two dimensional string matching algorithm. This algorithm uses the structure of the pattern to reduce the number of comparisons required to search for the pattern. " We described a method for efficiently storing text. Although this reduces the size of the storage space, it is not a compression method as in the literature. Our aim is to improve both space and time taken by a string matching algorithm. Our new algorithm searches for patterns in the efficiently stored text without decompressing the text. " We illustrated that by pre-processing the text we can improve the speed of the string matching algorithm when we search for a large number of patterns in a given text. " We proposed a hardware solution for searching in an efficiently stored DNA text
    corecore