6 research outputs found
Parallel Construction of Wavelet Trees on Multicore Architectures
The wavelet tree has become a very useful data structure to efficiently
represent and query large volumes of data in many different domains, from
bioinformatics to geographic information systems. One problem with wavelet
trees is their construction time. In this paper, we introduce two algorithms
that reduce the time complexity of a wavelet tree's construction by taking
advantage of nowadays ubiquitous multicore machines.
Our first algorithm constructs all the levels of the wavelet in parallel in
time and bits of working space, where
is the size of the input sequence and is the size of the alphabet. Our
second algorithm constructs the wavelet tree in a domain-decomposition fashion,
using our first algorithm in each segment, reaching time and
bits of extra space, where is the
number of available cores. Both algorithms are practical and report good
speedup for large real datasets.Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sk{\l}odowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094
About BIRDS project (Bioinformatics and Information Retrieval Data Structures Analysis and Design)
BIRDS stands for "Bioinformatics and Information Retrieval Data Structures
analysis and design" and is a 4-year project (2016--2019) that has received
funding from the European Union's Horizon 2020 research and innovation
programme under the Marie Sklodowska-Curie grant agreement No 690941.
The overall goal of BIRDS is to establish a long term international network
involving leading researchers in the development of efficient data structures
in the fields of Bioinformatics and Information Retrieval, to strengthen the
partnership through the exchange of knowledge and expertise, and to develop
integrated approaches to improve current approaches in both fields. The
research will address challenges in storing, processing, indexing, searching
and navigating genome-scale data by designing new algorithms and data
structures for sequence analysis, networks representation or compressing and
indexing repetitive data.
BIRDS project is carried out by 7 research institutions from Australia
(University of Melbourne), Chile (University of Chile and University of
Concepci\'on), Finland (University of Helsinki), Japan (Kyushu University),
Portugal (Instituto de Engenharia de Sistemas e Computadores,
Investiga\c{c}\~ao e Desenvolvimento em Lisboa, INESC-ID), and Spain
(University of A Coru\~na), and a Spanish SME (Enxenio S.L.). It is coordinated
by the University of A Coru\~na (Spain).Comment: This research has received funding from the European Union's Horizon
2020 research and innovation programme under the Marie Sklodowska-Curie
Actions H2020-MSCA-RISE-2015 BIRDS GA No. 690941. CERI 201
Parallel text index construction
In dieser Dissertation betrachten wir die parallele Konstruktion von Text-Indizes. Text-Indizes stellen Zusatzinformationen ĂŒber Texte bereit, die Anfragen hinsichtlich dieser Texte beschleunigen können. Ein Beispiel hierfĂŒr sind Volltext-Indizes, welche fĂŒr eine effiziente Phrasensuche genutzt werden, also etwa fĂŒr die Frage, ob eine Phrase in einem Text vorkommt oder nicht. Diese Dissertation befasst sich hauptsĂ€chlich, aber nicht ausschlieĂlich mit der parallelen Konstruktion von Text-Indizes im geteilten und verteilten Speicher.
Im ersten Teil der Dissertation betrachten wir Wavelet-Trees. Dabei handelt es sich um kompakte Indizes, welche Rank- und Select-Anfragen von binĂ€ren Alphabeten auf Alphabete beliebiger GröĂe verallgemeinern. Im zweiten Teil der Dissertation betrachten wir das Suffix-Array, den am besten erforschten Text-Index ĂŒberhaupt. Das Suffix-Array enthĂ€lt die Startpositionen aller lexikografisch sortierten Suffixe eines Textes, d.h., wir möchten alle Suffixe eines Textes sortieren. Oft wird das Suffix-Array um das Longest-Common-Prefix-Array (LCP-Array) erweitert. Das LCP-Array enthĂ€lt die LĂ€nge der lĂ€ngsten gemeinsamen PrĂ€fixe zweier lexikografisch konsekutiven Suffixe. AbschlieĂend nutzen wir verteilte Suffix- und LCP-Arrays, um den Distributed-Patricia-Trie zu konstruieren. Dieser erlaubt es uns, verschiedene Phrase-Anfragen effizienter zu beantworten, als wenn wir nur das Suffix-Array nutzen.The focus of this dissertation is the parallel construction of text indices. Text indices provide additional information about a text that allow to answer queries faster. Full-text indices for example are used to efficiently answer phrase queries, i.e., if and where a phrase occurs in a text. The research in this dissertation is focused on but not limited to parallel construction algorithms for text indices in both shared and distributed memory.
In the first part, we look at wavelet trees: a compact index that generalizes rank and select queries from binary alphabets to alphabets of arbitrary size. In the second part of this dissertation, we consider the suffix array---one of the most researched text indices.The suffix array of a text contains the starting positions of the text's lexicographically sorted suffixes, i.e., we want to sort all its suffixes. Finally, we use the distributed suffix arrays (and LCP arrays) to compute distributed Patricia tries. This allows us to answer different phrase queries more efficiently than using only the suffix array