354 research outputs found

    Scalable String and Suffix Sorting: Algorithms, Techniques, and Tools

    Get PDF
    This dissertation focuses on two fundamental sorting problems: string sorting and suffix sorting. The first part considers parallel string sorting on shared-memory multi-core machines, the second part external memory suffix sorting using the induced sorting principle, and the third part distributed external memory suffix sorting with a new distributed algorithmic big data framework named Thrill.Comment: 396 pages, dissertation, Karlsruher Instituts f\"ur Technologie (2018). arXiv admin note: text overlap with arXiv:1101.3448 by other author

    Parallel text index construction

    Get PDF
    In dieser Dissertation betrachten wir die parallele Konstruktion von Text-Indizes. Text-Indizes stellen Zusatzinformationen über Texte bereit, die Anfragen hinsichtlich dieser Texte beschleunigen können. Ein Beispiel hierfür sind Volltext-Indizes, welche für eine effiziente Phrasensuche genutzt werden, also etwa für die Frage, ob eine Phrase in einem Text vorkommt oder nicht. Diese Dissertation befasst sich hauptsächlich, aber nicht ausschließlich mit der parallelen Konstruktion von Text-Indizes im geteilten und verteilten Speicher. Im ersten Teil der Dissertation betrachten wir Wavelet-Trees. Dabei handelt es sich um kompakte Indizes, welche Rank- und Select-Anfragen von binären Alphabeten auf Alphabete beliebiger Größe verallgemeinern. Im zweiten Teil der Dissertation betrachten wir das Suffix-Array, den am besten erforschten Text-Index überhaupt. Das Suffix-Array enthält die Startpositionen aller lexikografisch sortierten Suffixe eines Textes, d.h., wir möchten alle Suffixe eines Textes sortieren. Oft wird das Suffix-Array um das Longest-Common-Prefix-Array (LCP-Array) erweitert. Das LCP-Array enthält die Länge der längsten gemeinsamen Präfixe zweier lexikografisch konsekutiven Suffixe. Abschließend nutzen wir verteilte Suffix- und LCP-Arrays, um den Distributed-Patricia-Trie zu konstruieren. Dieser erlaubt es uns, verschiedene Phrase-Anfragen effizienter zu beantworten, als wenn wir nur das Suffix-Array nutzen.The focus of this dissertation is the parallel construction of text indices. Text indices provide additional information about a text that allow to answer queries faster. Full-text indices for example are used to efficiently answer phrase queries, i.e., if and where a phrase occurs in a text. The research in this dissertation is focused on but not limited to parallel construction algorithms for text indices in both shared and distributed memory. In the first part, we look at wavelet trees: a compact index that generalizes rank and select queries from binary alphabets to alphabets of arbitrary size. In the second part of this dissertation, we consider the suffix array---one of the most researched text indices.The suffix array of a text contains the starting positions of the text's lexicographically sorted suffixes, i.e., we want to sort all its suffixes. Finally, we use the distributed suffix arrays (and LCP arrays) to compute distributed Patricia tries. This allows us to answer different phrase queries more efficiently than using only the suffix array

    Space efficient algorithms for string processing

    Get PDF
    The suffix array (SA), which is an array containing the suffixes of a string sorted into lexicographical order, was introduced in the late eighties as a space efficient alternative to the suffix tree. It has since emerged as a useful data structure in string processing problems such as pattern matching, pattern discovery, and data compression. The SA is often coupled with the longest-common-prefix (LCP) array that contains the length of the longest common prefixes between consecutive suffixes in the SA. When enhanced with the LCP array, the SA can provide efficient solutions to the above applications including a problem called pattern mining. To date, all the mining algorithms lie at either extreme of the efficiency spectrum: they are either fast and use enormous amounts of space, or they are compact and orders of magnitude slower. We present a mining algorithm that achieves the best of both these extremes, having runtime comparable to the fastest published algorithms while using less space than the most space efficient. In all these applications, the construction of the SA --- also known as suffix sorting --- is one of the main computational bottlenecks. Most papers describing the SA assume the SA fits in RAM memory, limiting their applications. The fastest algorithms in this large memory suffix sorting category use powerful pointer copying heuristics to expedite suffix sorting. Several space efficient algorithms have emerged in the last five years, where the trend is to use as little RAM as possible. They do so by finding a clever way to trade runtime, or by using slow compressed data structures, or by using external memory (disk), or some combination of these techniques. In this thesis, we focus on improving the runtime of a space efficient algorithm due to Kärkkäinen by adapting the heuristics from large memory suffix sorting to a semi-external setting. Also, pointer copying has been heavily used to speed up the construction of the SA, but not the LCP array. We also discuss our attempts of combining the pointer copying heuristics to an efficient LCP construction algorithm due to Kärkkäinen, Manzini and Puglisi. The Burrows-Wheeler transform (BWT) was discovered independently of the SA, but it is now known that the two data structures are deeply linked. The BWT is central to practical compression tools such as szip and bzip2. Many papers have been published on constructing the BWT either in RAM or in external memory but few on inverting the BWT to obtain the original string --- in fact none in external memory. For larger datasets, the existing traditional approaches cannot be used to invert the BWT. In such cases, we have to use disk. We close the gap between theory and practice by examining the problem of inverting the BWT efficiently on disk. We provide a practical implementation of the only theoretical proposal for the problem by Ferragina, Gagie and Manzini. We also provide new, faster solutions to the problem based on simple scanning and compression techniques

    Inducing Suffix and LCP Arrays in External Memory

    Get PDF
    We consider full text index construction in external memory (EM). Our first contribution is an inducing algorithm for suffix arrays in external memory, which utilizes an efficient EM priority queue and runs in sorting complexity. Practical tests show that this algorithm outperforms the previous best EM suffix sorter [Dementiev et al., JEA 2008] by a factor of about two in time and I/O-volume. Our second contribution is to augment the first algorithm to also construct the array of longest common prefixes (LCPs). This yields the first EM construction algorithm for LCP arrays. The overhead in time and I/O volume for this extended algorithm over plain suffix array construction is roughly two. Our algorithms scale far beyond problem sizes previously considered in the literature (text size of 80 GiB using only 4 GiB of RAM in our experiments).

    Faster External Memory LCP Array Construction

    Get PDF
    The suffix array, perhaps the most important data structure in modern string processing, needs to be augmented with the longest-common-prefix (LCP) array in many applications. Their construction is often a major bottleneck especially when the data is too big for internal memory. We describe two new algorithms for computing the LCP array from the suffix array in external memory. Experiments demonstrate that the new algorithms are about a factor of two faster than the fastest previous algorithm
    corecore