6 research outputs found

    Parallel Construction of Wavelet Trees on Multicore Architectures

    Get PDF
    The wavelet tree has become a very useful data structure to efficiently represent and query large volumes of data in many different domains, from bioinformatics to geographic information systems. One problem with wavelet trees is their construction time. In this paper, we introduce two algorithms that reduce the time complexity of a wavelet tree's construction by taking advantage of nowadays ubiquitous multicore machines. Our first algorithm constructs all the levels of the wavelet in parallel in O(n)O(n) time and O(nlgâĄÏƒ+σlg⁥n)O(n\lg\sigma + \sigma\lg n) bits of working space, where nn is the size of the input sequence and σ\sigma is the size of the alphabet. Our second algorithm constructs the wavelet tree in a domain-decomposition fashion, using our first algorithm in each segment, reaching O(lg⁥n)O(\lg n) time and O(nlgâĄÏƒ+pσlg⁥n/lgâĄÏƒ)O(n\lg\sigma + p\sigma\lg n/\lg\sigma) bits of extra space, where pp is the number of available cores. Both algorithms are practical and report good speedup for large real datasets.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    About BIRDS project (Bioinformatics and Information Retrieval Data Structures Analysis and Design)

    Full text link
    BIRDS stands for "Bioinformatics and Information Retrieval Data Structures analysis and design" and is a 4-year project (2016--2019) that has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie grant agreement No 690941. The overall goal of BIRDS is to establish a long term international network involving leading researchers in the development of efficient data structures in the fields of Bioinformatics and Information Retrieval, to strengthen the partnership through the exchange of knowledge and expertise, and to develop integrated approaches to improve current approaches in both fields. The research will address challenges in storing, processing, indexing, searching and navigating genome-scale data by designing new algorithms and data structures for sequence analysis, networks representation or compressing and indexing repetitive data. BIRDS project is carried out by 7 research institutions from Australia (University of Melbourne), Chile (University of Chile and University of Concepci\'on), Finland (University of Helsinki), Japan (Kyushu University), Portugal (Instituto de Engenharia de Sistemas e Computadores, Investiga\c{c}\~ao e Desenvolvimento em Lisboa, INESC-ID), and Spain (University of A Coru\~na), and a Spanish SME (Enxenio S.L.). It is coordinated by the University of A Coru\~na (Spain).Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941. CERI 201

    Parallel External Memory Wavelet Tree and Wavelet Matrix Construction

    Get PDF

    Parallel text index construction

    Get PDF
    In dieser Dissertation betrachten wir die parallele Konstruktion von Text-Indizes. Text-Indizes stellen Zusatzinformationen ĂŒber Texte bereit, die Anfragen hinsichtlich dieser Texte beschleunigen können. Ein Beispiel hierfĂŒr sind Volltext-Indizes, welche fĂŒr eine effiziente Phrasensuche genutzt werden, also etwa fĂŒr die Frage, ob eine Phrase in einem Text vorkommt oder nicht. Diese Dissertation befasst sich hauptsĂ€chlich, aber nicht ausschließlich mit der parallelen Konstruktion von Text-Indizes im geteilten und verteilten Speicher. Im ersten Teil der Dissertation betrachten wir Wavelet-Trees. Dabei handelt es sich um kompakte Indizes, welche Rank- und Select-Anfragen von binĂ€ren Alphabeten auf Alphabete beliebiger GrĂ¶ĂŸe verallgemeinern. Im zweiten Teil der Dissertation betrachten wir das Suffix-Array, den am besten erforschten Text-Index ĂŒberhaupt. Das Suffix-Array enthĂ€lt die Startpositionen aller lexikografisch sortierten Suffixe eines Textes, d.h., wir möchten alle Suffixe eines Textes sortieren. Oft wird das Suffix-Array um das Longest-Common-Prefix-Array (LCP-Array) erweitert. Das LCP-Array enthĂ€lt die LĂ€nge der lĂ€ngsten gemeinsamen PrĂ€fixe zweier lexikografisch konsekutiven Suffixe. Abschließend nutzen wir verteilte Suffix- und LCP-Arrays, um den Distributed-Patricia-Trie zu konstruieren. Dieser erlaubt es uns, verschiedene Phrase-Anfragen effizienter zu beantworten, als wenn wir nur das Suffix-Array nutzen.The focus of this dissertation is the parallel construction of text indices. Text indices provide additional information about a text that allow to answer queries faster. Full-text indices for example are used to efficiently answer phrase queries, i.e., if and where a phrase occurs in a text. The research in this dissertation is focused on but not limited to parallel construction algorithms for text indices in both shared and distributed memory. In the first part, we look at wavelet trees: a compact index that generalizes rank and select queries from binary alphabets to alphabets of arbitrary size. In the second part of this dissertation, we consider the suffix array---one of the most researched text indices.The suffix array of a text contains the starting positions of the text's lexicographically sorted suffixes, i.e., we want to sort all its suffixes. Finally, we use the distributed suffix arrays (and LCP arrays) to compute distributed Patricia tries. This allows us to answer different phrase queries more efficiently than using only the suffix array
    corecore