6 research outputs found

    Parallel Construction of Wavelet Trees on Multicore Architectures

    Get PDF
    The wavelet tree has become a very useful data structure to efficiently represent and query large volumes of data in many different domains, from bioinformatics to geographic information systems. One problem with wavelet trees is their construction time. In this paper, we introduce two algorithms that reduce the time complexity of a wavelet tree's construction by taking advantage of nowadays ubiquitous multicore machines. Our first algorithm constructs all the levels of the wavelet in parallel in O(n)O(n) time and O(nlgσ+σlgn)O(n\lg\sigma + \sigma\lg n) bits of working space, where nn is the size of the input sequence and σ\sigma is the size of the alphabet. Our second algorithm constructs the wavelet tree in a domain-decomposition fashion, using our first algorithm in each segment, reaching O(lgn)O(\lg n) time and O(nlgσ+pσlgn/lgσ)O(n\lg\sigma + p\sigma\lg n/\lg\sigma) bits of extra space, where pp is the number of available cores. Both algorithms are practical and report good speedup for large real datasets.Comment: This research has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie Actions H2020-MSCA-RISE-2015 BIRDS GA No. 69094

    Optimization of Linked List Prefix Computations on Multithreaded GPUs Using CUDA

    Get PDF
    We present a number of optimization techniques to compute prefix sums on linked lists and implement them on multithreaded GPUs using CUDA. Prefix computations on linked structures involve in general highly irregular fine grain memory accesses that are typical of many computations on linked lists, trees, and graphs. While the current generation of GPUs provides substantial computational power and extremely high bandwidth memory accesses, they may appear at first to be primarily geared toward streamed, highly data parallel computations. In this paper, we introduce an optimized multithreaded GPU algorithm for prefix computations through a randomization process that reduces the problem to a large number of fine-grain computations. We map these fine-grain computations onto multithreaded GPUs in such a way that the processing cost per element is shown to be close to the best possible. Our experimental results show scalability for list sizes ranging from 1M nodes to 256M nodes, and significantly improve on the recently published parallel implementations of list ranking, including implementations on the Cell Processor, the MTA-8, and the NVIDIA GeForce 200 series. They also compare favorably to the performance of the best known CUDA algorithm for the scan operation on the Tesla C1060

    High-Performance Computing Algorithms for Constructing Inverted Files on Emerging Multicore Processors

    Get PDF
    Current trends in processor architectures increasingly include more cores on a single chip and more complex memory hierarchies, and such a trend is likely to continue in the foreseeable future. These processors offer unprecedented opportunities for speeding up demanding computations if the available resources can be effectively utilized. Simultaneously, parallel programming languages such as OpenMP and MPI have been commonly used on clusters of multicore CPUs while newer programming languages such as OpenCL and CUDA have been widely adopted on recent heterogeneous systems and GPUs respectively. The main goal of this dissertation is to develop techniques and methodologies for exploiting these emerging parallel architectures and parallel programming languages to solve large scale irregular applications such as the construction of inverted files. The extraction of inverted files from large collections of documents forms a critical component of all information retrieval systems including web search engines. In this problem, the disk I/O throughput is the major performance bottleneck especially when intermediate results are written onto disks. In addition to the I/O bottleneck, a number of synchronization and consistency issues must be resolved in order to build the dictionary and postings lists efficiently. To address these issues, we introduce a dictionary data structure using a hybrid of trie and B-trees and a high-throughput pipeline strategy that completely avoids the use of disks as temporary storage for intermediate results, while ensuring the consumption of the input data at a high rate. The high-throughput pipelined strategy produces parallel parsed streams that are consumed at the same rate by parallel indexers. The pipelined strategy is implemented on a single multicore CPU as well as on a cluster of such nodes. We were able to achieve a throughput of more than 262MB/s on the ClueWeb09 dataset on a single node. On a cluster of 32 nodes, our experimental results show scalable performance using different metrics, significantly improving on prior published results. On the other hand, we develop a new approach for handling time-evolving documents using additional small temporal indexing structures. The lifetime of the collection is partitioned into multiple time windows, which guarantees a very fast temporal query response time at a small space overhead relative to the non-temporal case. Extensive experimental results indicate that the overhead in both indexing and querying is small in this more complicated case, and the query performance can indeed be improved using finer temporal partitioning of the collection. Finally, we employ GPUs to accelerate the indexing process for building inverted files and to develop a very fast algorithm for the highly irregular list ranking problem. For the indexing problem, the workload is split between CPUs and GPUs in such a way that the strengths of both architectures are exploited. For the list ranking problem involved in the decompression of inverted files, an optimized GPU algorithm is introduced by reducing the problem to a large number of fine grain computations in such a way that the processing cost per element is shown to be close to the best possible

    Prefix computations on symmetric multiprocessors

    No full text
    We introduce a new prefix computation algorithm on linked lists which builds upon the sparse ruling set approach of Reid-Miller and Blelloch. Besides being somewhat simpler and requiring nearly half the number of memory accesses, we can bound our complexity with high probability instead of merely on average. Moreover, whereas Reid-Miller and Blelloch targeted their algorithm for implementation on a vector multiprocessor architecture, we develop our algorithm for implementation on the symmetric multiprocessor architecture (SMP). These symmetric multiprocessors dominate the high-end server market and are currently the primary candidate for constructing large scale multiprocessor systems. Our prefix computation algorithm was implemented in C using POSIX threads and run on four symmetric multiprocessors: the HP-Convex Exemplar (S-Class), the IBM SP-2 (High Node), the SGI Power Challenge, and the DEC AlphaServer. We ran our code using a variety of benchmarks which we identified to examine the dependence of our algorithm on memory access patterns. For some problems
    corecore