1 research outputs found

    Scalable Sequence Similarity Search and Join in Main Memory on Multi-Cores

    No full text
    Abstract. Similarity-basedqueriesplayanimportantroleinmanylarge scale applications. Inbioinformatics, DNAsequencingproduces huge collections of strings, that need to be compared and merged. One strategy to speed up similarity-based queries is parallelization on clusters using MapReduce. However, distributing data over a cluster also incurs high cost. At the same time, modern hardware offers parallelization through multi-cores and can be equipped with large main memories at low cost. We present PeARL, a data structure and algorithms for similarity-based queries on many-core servers. PeARL indexes large string collections in compressed tries which are entirely held in main memory. Parallelization of searches and joins is performed using MapReduce as the underlying execution paradigm. We show that our data structure is capable of performing many real-world applications in sequence comparisons in main memory. Our evaluation reveals that PeARL reaches a significant performance gain compared to single-threaded solutions. However, the evaluation also shows that scalability should be further improved, e.g., by reducing sequential parts of the algorithms.
    corecore