1,175 research outputs found

    Parallelizing Set Similarity Joins

    Get PDF
    Eine der grĂ¶ĂŸten Herausforderungen in Data Science ist heutzutage, Daten miteinander in Beziehung zu setzen und Ă€hnliche Daten zu finden. Hierzu kann der aus relationalen Datenbanken bekannte Join-Operator eingesetzt werden. Das Konzept der Ähnlichkeit wird hĂ€ufig durch mengenbasierte Ähnlichkeitsfunktionen gemessen. Um solche Funktionen als Join-PrĂ€dikat nutzen zu können, setzt diese Arbeit voraus, dass Records aus Mengen von Tokens bestehen. Die Arbeit fokussiert sich auf den mengenbasierten Ähnlichkeitsjoin, Set Similarity Join (SSJ). Die Datenmenge, die es heute zu verarbeiten gilt, ist groß und wĂ€chst weiter. Der SSJ hingegen ist eine rechenintensive Operation. Um ihn auf großen Daten ausfĂŒhren zu können, sind neue AnsĂ€tze notwendig. Diese Arbeit fokussiert sich auf das Mittel der Parallelisierung. Sie leistet folgende drei BeitrĂ€ge auf dem Gebiet der SSJs. Erstens beschreibt und untersucht die Arbeit den aktuellen Stand paralleler SSJ-AnsĂ€tze. Diese Arbeit vergleicht zehn Map-Reduce-basierte AnsĂ€tze aus der Literatur sowohl analytisch als auch experimentell. Der grĂ¶ĂŸte Schwachpunkt aller AnsĂ€tze ist ĂŒberraschenderweise eine geringe Skalierbarkeit aufgrund zu hoher Datenreplikation und/ oder ungleich verteilter Daten. Keiner der AnsĂ€tze kann den SSJ auf großen Daten berechnen. Zweitens macht die Arbeit die verfĂŒgbare hohe CPU-ParallelitĂ€t moderner Rechner fĂŒr den SSJ nutzbar. Sie stellt einen neuen daten-parallelen multi-threaded SSJ-Ansatz vor. Der vorgestellte Ansatz ermöglicht erhebliche Laufzeit-Beschleunigungen gegenĂŒber der AusfĂŒhrung auf einem Thread. Drittens stellt die Arbeit einen neuen hoch skalierbaren verteilten SSJ-Ansatz vor. Mit einer kostenbasierten Heuristik und einem daten-unabhĂ€ngigen Skalierungsmechanismus vermeidet er Daten-Replikation und wiederholte Berechnungen. Der Ansatz beschleunigt die Join-AusfĂŒhrung signifikant und ermöglicht die AusfĂŒhrung auf erheblich grĂ¶ĂŸeren Datenmengen als bisher betrachtete parallele AnsĂ€tze.One of today's major challenges in data science is to compare and relate data of similar nature. Using the join operation known from relational databases could help solving this problem. Given a collection of records, the join operation finds all pairs of records, which fulfill a user-chosen predicate. Real-world problems could require complex predicates, such as similarity. A common way to measure similarity are set similarity functions. In order to use set similarity functions as predicates, we assume records to be represented by sets of tokens. In this thesis, we focus on the set similarity join (SSJ) operation. The amount of data to be processed today is typically large and grows continually. On the other hand, the SSJ is a compute-intensive operation. To cope with the increasing size of input data, additional means are needed to develop scalable implementations for SSJ. In this thesis, we focus on parallelization. We make the following three major contributions to SSJ. First, we elaborate on the state-of-the-art in parallelizing SSJ. We compare ten MapReduce-based approaches from the literature analytically and experimentally. Their main limit is surprisingly a low scalability due to too high and/or skewed data replication. None of the approaches could compute the join on large datasets. Second, we leverage the abundant CPU parallelism of modern commodity hardware, which has not yet been considered to scale SSJ. We propose a novel data-parallel multi-threaded SSJ. Our approach provides significant speedups compared to single-threaded executions. Third, we propose a novel highly scalable distributed SSJ approach. With a cost-based heuristic and a data-independent scaling mechanism we avoid data replication and recomputation. A heuristic assigns similar shares of compute costs to each node. Our approach significantly scales up the join execution and processes much larger datasets than all parallel approaches designed and implemented so far

    KOIOS: Top-k Semantic Overlap Set Search

    Full text link
    We study the top-k set similarity search problem using semantic overlap. While vanilla overlap requires exact matches between set elements, semantic overlap allows elements that are syntactically different but semantically related to increase the overlap. The semantic overlap is the maximum matching score of a bipartite graph, where an edge weight between two set elements is defined by a user-defined similarity function, e.g., cosine similarity between embeddings. Common techniques like token indexes fail for semantic search since similar elements may be unrelated at the character level. Further, verifying candidates is expensive (cubic versus linear for syntactic overlap), calling for highly selective filters. We propose KOIOS, the first exact and efficient algorithm for semantic overlap search. KOIOS leverages sophisticated filters to minimize the number of required graph-matching calculations. Our experiments show that for medium to large sets less than 5% of the candidate sets need verification, and more than half of those sets are further pruned without requiring the expensive graph matching. We show the efficiency of our algorithm on four real datasets and demonstrate the improved result quality of semantic over vanilla set similarity search

    Hierarchical and Adaptive Filter and Refinement Algorithms for Geometric Intersection Computations on GPU

    Get PDF
    Geometric intersection algorithms are fundamental in spatial analysis in Geographic Information System (GIS). This dissertation explores high performance computing solution for geometric intersection on a huge amount of spatial data using Graphics Processing Unit (GPU). We have developed a hierarchical filter and refinement system for parallel geometric intersection operations involving large polygons and polylines by extending the classical filter and refine algorithm using efficient filters that leverage GPU computing. The inputs are two layers of large polygonal datasets and the computations are spatial intersection on pairs of cross-layer polygons. These intersections are the compute-intensive spatial data analytic kernels in spatial join and map overlay operations in spatial databases and GIS. Efficient filters, such as PolySketch, PolySketch++ and Point-in-polygon filters have been developed to reduce refinement workload on GPUs. We also showed the application of such filters in speeding-up line segment intersections and point-in-polygon tests. Programming models like CUDA and OpenACC have been used to implement the different versions of the Hierarchical Filter and Refine (HiFiRe) system. Experimental results show good performance of our filter and refinement algorithms. Compared to standard R-tree filter, on average, our filter technique can still discard 76% of polygon pairs which do not have segment intersection points. PolySketch filter reduces on average 99.77% of the workload of finding line segment intersections. Compared to existing Common Minimum Bounding Rectangle (CMBR) filter that is applied on each cross-layer candidate pair, the workload after using PolySketch-based CMBR filter is on average 98% smaller. The execution time of our HiFiRe system on two shapefiles, namely USA Water Bodies (contains 464K polygons) and USA Block Group Boundaries (contains 220K polygons), is about 3.38 seconds using NVidia Titan V GPU
    • 

    corecore