1,175 research outputs found
ćčççă§ćźć šăȘéćééĄäŒŒç”ćă«éąăăç 究
çæłąć€§ćŠ (University of Tsukuba)201
Parallelizing Set Similarity Joins
Eine der gröĂten Herausforderungen in Data Science ist heutzutage, Daten miteinander in Beziehung zu setzen und Ă€hnliche Daten zu finden. Hierzu kann der aus relationalen Datenbanken bekannte Join-Operator eingesetzt werden. Das Konzept der Ăhnlichkeit wird hĂ€ufig durch mengenbasierte Ăhnlichkeitsfunktionen gemessen. Um solche Funktionen als Join-PrĂ€dikat nutzen zu können, setzt diese Arbeit voraus, dass Records aus Mengen von Tokens bestehen. Die Arbeit fokussiert sich auf den mengenbasierten Ăhnlichkeitsjoin, Set Similarity Join (SSJ).
Die Datenmenge, die es heute zu verarbeiten gilt, ist groĂ und wĂ€chst weiter. Der SSJ hingegen ist eine rechenintensive Operation. Um ihn auf groĂen Daten ausfĂŒhren zu können, sind neue AnsĂ€tze notwendig. Diese Arbeit fokussiert sich auf das Mittel der Parallelisierung. Sie leistet folgende drei BeitrĂ€ge auf dem Gebiet der SSJs.
Erstens beschreibt und untersucht die Arbeit den aktuellen Stand paralleler SSJ-AnsĂ€tze. Diese Arbeit vergleicht zehn Map-Reduce-basierte AnsĂ€tze aus der Literatur sowohl analytisch als auch experimentell. Der gröĂte Schwachpunkt aller AnsĂ€tze ist ĂŒberraschenderweise eine geringe Skalierbarkeit aufgrund zu hoher Datenreplikation und/ oder ungleich verteilter Daten. Keiner der AnsĂ€tze kann den SSJ auf groĂen Daten berechnen.
Zweitens macht die Arbeit die verfĂŒgbare hohe CPU-ParallelitĂ€t moderner Rechner fĂŒr den SSJ nutzbar. Sie stellt einen neuen daten-parallelen multi-threaded SSJ-Ansatz vor. Der vorgestellte Ansatz ermöglicht erhebliche Laufzeit-Beschleunigungen gegenĂŒber der AusfĂŒhrung auf einem Thread.
Drittens stellt die Arbeit einen neuen hoch skalierbaren verteilten SSJ-Ansatz vor. Mit einer kostenbasierten Heuristik und einem daten-unabhĂ€ngigen Skalierungsmechanismus vermeidet er Daten-Replikation und wiederholte Berechnungen. Der Ansatz beschleunigt die Join-AusfĂŒhrung signifikant und ermöglicht die AusfĂŒhrung auf erheblich gröĂeren Datenmengen als bisher betrachtete parallele AnsĂ€tze.One of today's major challenges in data science is to compare and relate data of similar nature. Using the join operation known from relational databases could help solving this problem. Given a collection of records, the join operation finds all pairs of records, which fulfill a user-chosen predicate. Real-world problems could require complex predicates, such as similarity. A common way to measure similarity are set similarity functions. In order to use set similarity functions as predicates, we assume records to be represented by sets of tokens. In this thesis, we focus on the set similarity join (SSJ) operation.
The amount of data to be processed today is typically large and grows continually. On the other hand, the SSJ is a compute-intensive operation. To cope with the increasing size of input data, additional means are needed to develop scalable implementations for SSJ. In this thesis, we focus on parallelization. We make the following three major contributions to SSJ.
First, we elaborate on the state-of-the-art in parallelizing SSJ. We compare ten MapReduce-based approaches from the literature analytically and experimentally. Their main limit is surprisingly a low scalability due to too high and/or skewed data replication. None of the approaches could compute the join on large datasets.
Second, we leverage the abundant CPU parallelism of modern commodity hardware, which has not yet been considered to scale SSJ. We propose a novel data-parallel multi-threaded SSJ. Our approach provides significant speedups compared to single-threaded executions.
Third, we propose a novel highly scalable distributed SSJ approach. With a cost-based heuristic and a data-independent scaling mechanism we avoid data replication and recomputation. A heuristic assigns similar shares of compute costs to each node. Our approach significantly scales up the join execution and processes much larger datasets than all parallel approaches designed and implemented so far
Recommended from our members
Breaking Computational Barriers to Perform Time Series Pattern Mining at Scale and at the Edge
Uncovering repeated behavior in time series is an important problem in many domains such as medicine, geophysics, meteorology, and many more. With the continuing surge of smart/embedded devices generating time series data, there is an ever growing need to perform analysis on datasets of increasing size. Additionally, there is an increasing need for analysis at low power edge devices due to latency problems inherent to the speed of light and the sheer amount of data being recorded. The matrix profile has proven to be a tool highly suitable for pattern mining in time series; however, a naive approach to computing the matrix profile makes it impossible to use effectively in both the cloud and at the edge. This dissertation shows how, through the use of GPUs and machine learning, the matrix profile is computed more feasibly, both at cloud-scale and at sensor-scale. In addition, it illustrates why both of these types of computation are important and what new insights they can provide to practitioners working with time series data
KOIOS: Top-k Semantic Overlap Set Search
We study the top-k set similarity search problem using semantic overlap.
While vanilla overlap requires exact matches between set elements, semantic
overlap allows elements that are syntactically different but semantically
related to increase the overlap. The semantic overlap is the maximum matching
score of a bipartite graph, where an edge weight between two set elements is
defined by a user-defined similarity function, e.g., cosine similarity between
embeddings. Common techniques like token indexes fail for semantic search since
similar elements may be unrelated at the character level. Further, verifying
candidates is expensive (cubic versus linear for syntactic overlap), calling
for highly selective filters. We propose KOIOS, the first exact and efficient
algorithm for semantic overlap search. KOIOS leverages sophisticated filters to
minimize the number of required graph-matching calculations. Our experiments
show that for medium to large sets less than 5% of the candidate sets need
verification, and more than half of those sets are further pruned without
requiring the expensive graph matching. We show the efficiency of our algorithm
on four real datasets and demonstrate the improved result quality of semantic
over vanilla set similarity search
Hierarchical and Adaptive Filter and Refinement Algorithms for Geometric Intersection Computations on GPU
Geometric intersection algorithms are fundamental in spatial analysis in Geographic Information System (GIS). This dissertation explores high performance computing solution for geometric intersection on a huge amount of spatial data using Graphics Processing Unit (GPU). We have developed a hierarchical filter and refinement system for parallel geometric intersection operations involving large polygons and polylines by extending the classical filter and refine algorithm using efficient filters that leverage GPU computing. The inputs are two layers of large polygonal datasets and the computations are spatial intersection on pairs of cross-layer polygons. These intersections are the compute-intensive spatial data analytic kernels in spatial join and map overlay operations in spatial databases and GIS. Efficient filters, such as PolySketch, PolySketch++ and Point-in-polygon filters have been developed to reduce refinement workload on GPUs. We also showed the application of such filters in speeding-up line segment intersections and point-in-polygon tests. Programming models like CUDA and OpenACC have been used to implement the different versions of the Hierarchical Filter and Refine (HiFiRe) system. Experimental results show good performance of our filter and refinement algorithms. Compared to standard R-tree filter, on average, our filter technique can still discard 76% of polygon pairs which do not have segment intersection points. PolySketch filter reduces on average 99.77% of the workload of finding line segment intersections. Compared to existing Common Minimum Bounding Rectangle (CMBR) filter that is applied on each cross-layer candidate pair, the workload after using PolySketch-based CMBR filter is on average 98% smaller. The execution time of our HiFiRe system on two shapefiles, namely USA Water Bodies (contains 464K polygons) and USA Block Group Boundaries (contains 220K polygons), is about 3.38 seconds using NVidia Titan V GPU
- âŠ