207 research outputs found

    Selectivity estimation on set containment search

    Full text link
    © Springer Nature Switzerland AG 2019. In this paper, we study the problem of selectivity estimation on set containment search. Given a query record Q and a record dataset S, we aim to accurately and efficiently estimate the selectivity of set containment search of query Q over S. The problem has many important applications in commercial fields and scientific studies. To the best of our knowledge, this is the first work to study this important problem. We first extend existing distinct value estimating techniques to solve this problem and develop an inverted list and G-KMV sketch based approach IL-GKMV. We analyse that the performance of IL-GKMV degrades with the increase of vocabulary size. Motivated by limitations of existing techniques and the inherent challenges of the problem, we resort to developing effective and efficient sampling approaches and propose an ordered trie structure based sampling approach named OT-Sampling. OT-Sampling partitions records based on element frequency and occurrence patterns and is significantly more accurate compared with simple random sampling method and IL-GKMV. To further enhance performance, a divide-and-conquer based sampling approach, DC-Sampling, is presented with an inclusion/exclusion prefix to explore the pruning opportunities. We theoretically analyse the proposed techniques regarding various accuracy estimators. Our comprehensive experiments on 6 real datasets verify the effectiveness and efficiency of our proposed techniques

    Parallelizing Set Similarity Joins

    Get PDF
    Eine der grĂ¶ĂŸten Herausforderungen in Data Science ist heutzutage, Daten miteinander in Beziehung zu setzen und Ă€hnliche Daten zu finden. Hierzu kann der aus relationalen Datenbanken bekannte Join-Operator eingesetzt werden. Das Konzept der Ähnlichkeit wird hĂ€ufig durch mengenbasierte Ähnlichkeitsfunktionen gemessen. Um solche Funktionen als Join-PrĂ€dikat nutzen zu können, setzt diese Arbeit voraus, dass Records aus Mengen von Tokens bestehen. Die Arbeit fokussiert sich auf den mengenbasierten Ähnlichkeitsjoin, Set Similarity Join (SSJ). Die Datenmenge, die es heute zu verarbeiten gilt, ist groß und wĂ€chst weiter. Der SSJ hingegen ist eine rechenintensive Operation. Um ihn auf großen Daten ausfĂŒhren zu können, sind neue AnsĂ€tze notwendig. Diese Arbeit fokussiert sich auf das Mittel der Parallelisierung. Sie leistet folgende drei BeitrĂ€ge auf dem Gebiet der SSJs. Erstens beschreibt und untersucht die Arbeit den aktuellen Stand paralleler SSJ-AnsĂ€tze. Diese Arbeit vergleicht zehn Map-Reduce-basierte AnsĂ€tze aus der Literatur sowohl analytisch als auch experimentell. Der grĂ¶ĂŸte Schwachpunkt aller AnsĂ€tze ist ĂŒberraschenderweise eine geringe Skalierbarkeit aufgrund zu hoher Datenreplikation und/ oder ungleich verteilter Daten. Keiner der AnsĂ€tze kann den SSJ auf großen Daten berechnen. Zweitens macht die Arbeit die verfĂŒgbare hohe CPU-ParallelitĂ€t moderner Rechner fĂŒr den SSJ nutzbar. Sie stellt einen neuen daten-parallelen multi-threaded SSJ-Ansatz vor. Der vorgestellte Ansatz ermöglicht erhebliche Laufzeit-Beschleunigungen gegenĂŒber der AusfĂŒhrung auf einem Thread. Drittens stellt die Arbeit einen neuen hoch skalierbaren verteilten SSJ-Ansatz vor. Mit einer kostenbasierten Heuristik und einem daten-unabhĂ€ngigen Skalierungsmechanismus vermeidet er Daten-Replikation und wiederholte Berechnungen. Der Ansatz beschleunigt die Join-AusfĂŒhrung signifikant und ermöglicht die AusfĂŒhrung auf erheblich grĂ¶ĂŸeren Datenmengen als bisher betrachtete parallele AnsĂ€tze.One of today's major challenges in data science is to compare and relate data of similar nature. Using the join operation known from relational databases could help solving this problem. Given a collection of records, the join operation finds all pairs of records, which fulfill a user-chosen predicate. Real-world problems could require complex predicates, such as similarity. A common way to measure similarity are set similarity functions. In order to use set similarity functions as predicates, we assume records to be represented by sets of tokens. In this thesis, we focus on the set similarity join (SSJ) operation. The amount of data to be processed today is typically large and grows continually. On the other hand, the SSJ is a compute-intensive operation. To cope with the increasing size of input data, additional means are needed to develop scalable implementations for SSJ. In this thesis, we focus on parallelization. We make the following three major contributions to SSJ. First, we elaborate on the state-of-the-art in parallelizing SSJ. We compare ten MapReduce-based approaches from the literature analytically and experimentally. Their main limit is surprisingly a low scalability due to too high and/or skewed data replication. None of the approaches could compute the join on large datasets. Second, we leverage the abundant CPU parallelism of modern commodity hardware, which has not yet been considered to scale SSJ. We propose a novel data-parallel multi-threaded SSJ. Our approach provides significant speedups compared to single-threaded executions. Third, we propose a novel highly scalable distributed SSJ approach. With a cost-based heuristic and a data-independent scaling mechanism we avoid data replication and recomputation. A heuristic assigns similar shares of compute costs to each node. Our approach significantly scales up the join execution and processes much larger datasets than all parallel approaches designed and implemented so far

    Top-k spatial-keyword publish/subscribe over sliding window

    Full text link
    © 2017, Springer-Verlag Berlin Heidelberg. With the prevalence of social media and GPS-enabled devices, a massive amount of geo-textual data have been generated in a stream fashion, leading to a variety of applications such as location-based recommendation and information dissemination. In this paper, we investigate a novel real-time top-k monitoring problem over sliding window of streaming data; that is, we continuously maintain the top-k most relevant geo-textual messages (e.g., geo-tagged tweets) for a large number of spatial-keyword subscriptions (e.g., registered users interested in local events) simultaneously. To provide the most recent information under controllable memory cost, sliding window model is employed on the streaming geo-textual data. To the best of our knowledge, this is the first work to study top-k spatial-keyword publish/subscribe over sliding window. A novel centralized system, called Skype (Top-kSpatial-keyword Publish/Subscribe), is proposed in this paper. In Skype, to continuously maintain top-k results for massive subscriptions, we devise a novel indexing structure upon subscriptions such that each incoming message can be immediately delivered on its arrival. To reduce the expensive top-k re-evaluation cost triggered by message expiration, we develop a novel cost-basedk-skyband technique to reduce the number of re-evaluations in a cost-effective way. Extensive experiments verify the great efficiency and effectiveness of our proposed techniques. Furthermore, to support better scalability and higher throughput, we propose a distributed version of Skype, namely DSkype, on top of Storm, which is a popular distributed stream processing system. With the help of fine-tuned subscription/message distribution mechanisms, DSkype can achieve orders of magnitude speed-up than its centralized version

    Efficient processing of large-scale spatio-temporal data

    Get PDF
    Millionen GerĂ€te, wie z.B. Mobiltelefone, Autos und Umweltsensoren senden ihre Positionen zusammen mit einem Zeitstempel und weiteren Nutzdaten an einen Server zu verschiedenen Analysezwecken. Die Positionsinformationen und ĂŒbertragenen Ereignisinformationen werden als Punkte oder Polygone dargestellt. Eine weitere Art rĂ€umlicher Daten sind Rasterdaten, die zum Beispiel von Kameras und Sensoren produziert werden. Diese großen rĂ€umlich-zeitlichen Datenmengen können nur auf skalierbaren Plattformen wie Hadoop und Apache Spark verarbeitet werden, die jedoch z.B. die Nachbarschaftsinformation nicht ausnutzen können - was die AusfĂŒhrung bestimmter Anfragen praktisch unmöglich macht. Die wiederholten AusfĂŒhrungen der Analyseprogramme wĂ€hrend ihrer Entwicklung und durch verschiedene Nutzer resultieren in langen AusfĂŒhrungszeiten und hohen Kosten fĂŒr gemietete Ressourcen, die durch die Wiederverwendung von Zwischenergebnissen reduziert werden können. Diese Arbeit beschĂ€ftigt sich mit den beiden oben beschriebenen Herausforderungen. Wir prĂ€sentieren zunĂ€chst das STARK Framework fĂŒr die Verarbeitung rĂ€umlich-zeitlicher Vektor- und Rasterdaten in Apache Spark. Wir identifizieren verschiedene Algorithmen fĂŒr Operatoren und analysieren, wie diese von den Eigenschaften der zugrundeliegenden Plattform profitieren können. Weiterhin wird untersucht, wie Indexe in der verteilten und parallelen Umgebung realisiert werden können. Außerdem vergleichen wir Partitionierungsmethoden, die unterschiedlich gut mit ungleichmĂ€ĂŸiger Datenverteilung und der GrĂ¶ĂŸe der Datenmenge umgehen können und prĂ€sentieren einen Ansatz um die auf Operatorebene zu verarbeitende Datenmenge frĂŒhzeitig zu reduzieren. Um die AusfĂŒhrungszeit von Programmen zu verkĂŒrzen, stellen wir einen Ansatz zur transparenten Materialisierung von Zwischenergebnissen vor. Dieser Ansatz benutzt ein Entscheidungsmodell, welches auf den tatsĂ€chlichen Operatorkosten basiert. In der Evaluierung vergleichen wir die verschiedenen Implementierungs- sowie Konfigurationsmöglichkeiten in STARK und identifizieren Szenarien wann Partitionierung und Indexierung eingesetzt werden sollten. Außerdem vergleichen wir STARK mit verwandten Systemen. Im zweiten Teil der Evaluierung zeigen wir, dass die transparente Wiederverwendung der materialisierten Zwischenergebnisse die AusfĂŒhrungszeit der Programme signifikant verringern kann.Millions of location-aware devices, such as mobile phones, cars, and environmental sensors constantly report their positions often in combination with a timestamp to a server for different kinds of analyses. While the location information of the devices and reported events is represented as points and polygons, raster data is another type of spatial data, which is for example produced by cameras and sensors. This Big spatio-temporal Data needs to be processed on scalable platforms, such as Hadoop and Apache Spark, which, however, are unaware of, e.g., spatial neighborhood, what makes them practically impossible to use for this kind of data. The repeated executions of the programs during development and by different users result in long execution times and potentially high costs in rented clusters, which can be reduced by reusing commonly computed intermediate results. Within this thesis, we tackle the two challenges described above. First, we present the STARK framework for processing spatio-temporal vector and raster data on the Apache Spark stack. For operators, we identify several possible algorithms and study how they can benefit from the underlying platform's properties. We further investigate how indexes can be realized in the distributed and parallel architecture of Big Data processing engines and compare methods for data partitioning, which perform differently well with respect to data skew and data set size. Furthermore, an approach to reduce the amount of data to process at operator level is presented. In order to reduce the execution times, we introduce an approach to transparently recycle intermediate results of dataflow programs, based on operator costs. To compute the costs, we instrument the programs with profiling code to gather the execution time and result size of the operators. In the evaluation, we first compare the various implementation and configuration possibilities in STARK and identify scenarios when and how partitioning and indexing should be applied. We further compare STARK to related systems and show that we can achieve significantly better execution times, not only when exploiting existing partitioning information. In the second part of the evaluation, we show that with the transparent cost-based materialization and recycling of intermediate results, the execution times of programs can be reduced significantly

    Efficient Approximate String Matching with Synonyms and Taxonomies

    Get PDF
    Strings are ubiquitous. When being collected from various sources, strings are often inconsistent, which means that they can have the same or similar meaning expressed in different forms, such as with typographical mistakes. Finding similar strings given such inconsistent datasets has been researched extensively during past years under an umbrella problem called approximate string matching. This thesis aims to enhance the quality of the approximate string matching by detecting similar strings using their meanings besides typographical errors. Specifically, this thesis focuses on utilising synonyms and taxonomies, since both are commonly available knowledge sources. This research is to use each type of knowledge to address either a selection or join tasks, where the first task aims to find strings similar to a given string, and the second task is to find pairs of strings that are similar. The desired output is either all strings similar to a given extent (i.e., all-match) or the top-k most similar strings. The first contribution of this thesis is to address the top-k selection problem considering synonyms. Here, we propose algorithms with different optimisation goals: to minimise the space cost, to maximise the selection speed, or to maximise the selection speed under a space constraint. We model the last goal as a variant of an 0/1 knapsack problem and propose an efficient solution based on the branch and bound paradigm. Next, this thesis solves the top-k join problem considering taxonomy relations. Three algorithms, two based on sorted lists and one based on tries, are proposed, in which we use pre-computations to accelerate list scan or use predictions to eliminate unnecessary trie accesses. Experiments show that the trie-based algorithm has a very fast response time on a vast dataset. The third contribution of this thesis is to deal with the all-match join problem considering taxonomy relations. To this end, we identify the shortcoming of a standard prefix filtering principle and propose an adaptive filtering algorithm that is tuneable towards the minimised join time. We also design a sampling-based estimation procedure to suggest the best parameter in a short time with high accuracy. Lastly, this thesis researches the all-match join task by integrating typographical errors, synonyms, and taxonomies simultaneously. Key contributions here include a new unified similarity measure that employs multiple measures, as well as a non-trivial approximation algorithm with a tight theoretical guarantee. We furthermore propose two prefix filtering principles: a fast heuristic and accurate dynamic programming, to strive for the minimised join time.Merkkijonoja esiintyy kaikkialla. Kun merkkijonoja kerÀtÀÀn erilaisista lÀhteistÀ, ne ovat usein yhteensopimattomia. TÀmÀ tarkoittaa, ettÀ niillÀ voi olla sama merkitys riippumatta siitÀ, ettÀ ne ovat eri muodossa. Muotoon liittyvÀt eroavaisuudet voivat johtua esimerkiksi typografisista virheistÀ. Samanlaisten merkkijonojen löytÀminen yhteensopimattomista tietoaineistoista on laajasti tutkittu kysymys viime vuosien aikana. Yhteisnimitys tÀlle suuntauksella on likimÀÀrÀinen merkkijonojen yhteensovittaminen (approximate string matching). TÀmÀn työn pÀÀmÀÀrÀnÀ on parantaa merkkijonojen likimÀÀrÀistÀ yhteensovittamista ottamalla typografisten virheiden lisÀksi huomioon merkkijonojen merkitys. TÀssÀ työssÀ keskitymme erityisesti hyödyntÀmÀÀn synonyymeja sekÀ taksonomisia luokittelujÀrjestelmiÀ, koska kummatkin ovat yleisesti saatavilla olevia tietolÀhteitÀ. Tutkimuksessamme on kummankin tyyppistÀ lÀhdettÀ kÀytetty joko kysely- tai liitostehtÀvissÀ. KyselytehtÀvÀssÀ tarkoituksena on löytÀÀ annettua merkkijonoa vastaavat merkkijonot. LiitostehtÀvÀssÀ tarkoituksena on löytÀÀ ne merkkijonoparit, jotka vastaavat toisiaan. Tuloksena saadaan joko kaikki vastaavat merkkijonot haluttuun vastaavuuteen asti (all-match) tai ensimmÀiset k kappaletta (top-k) eniten toisiaan vastaavia merkkijonoja. TÀmÀn työn ensimmÀisen vastauksen top-k kyselyongelmaan annamme synonyymien avulla. KehittÀmissÀmme algoritmeissa pyrimme erilaisiin optimaalisiin ratkaisuihin, kuten kÀytetyn muistin minimointiin, suoritusnopeuden maksimointiin sekÀ nÀiden yhdistelmÀÀn, jossa nopeus maksimoidaan samalla rajoittaen muistinkÀyttöÀ. JÀlkimmÀinen ongelma on erikoistapaus 0/1 knapsack ongelmasta, ja ratkaisemme ongelman tehokkaan haarauta ja rajoita paradigman avulla (branch and bound paradigm). Työn toinen vastaus top-k liitosongelmaan annetaan taksonomisten relaatioiden avulla. TÀtÀ varten olemme kehittÀneet kolme algoritmia, joista kaksi perustuu jÀrjestettyihin listoihin ja yksi etuliitepuutietorakenteeseen (trie). Listojen lÀpikÀymistÀ nopeutetaan etukÀteen suoritettavilla alustuksilla. Etuliitepuihin perustuvaa algoritmia tehostetaan ennakoivasti poistamalla turhat haut puurakenteeseen. Kokeiden perusteella etuliitepuihin perustuvalla algoritmilla on erittÀin nopea vastausaika, kun kyseessÀ on iso tietoaineisto. Kolmas vastaus työssÀ kÀsittelee all-match liitosongelmaa taksonomisten relaatioiden tapauksessa. Osoitamme millÀ tavalla standardi etuliiterajausperiaate (prefix filtering principle) on vajavainen ja vastauksena tÀhÀn kehitÀmme mukautuvan rajausalgoritmin, joka on sÀÀdettÀvissÀ siten, ettÀ liitoksen muodostamiseen tarvittava aika voidaan minimoida. TÀmÀn lisÀksi laadimme datasta poimittaviin nÀytteisiin perustuvan algoritmin, jonka avulla voidaan arvioida paras parametri lyhyessÀ ajassa korkealla tarkkuudella. Lopuksi työssÀ tutkimme all-match liitosongelmaa yhdistÀmÀllÀ typografiset virheet sekÀ synonyymien ja taksonomioiden kÀytön samanaikaisesti. Avainratkaisut tÀssÀ osassa pitÀvÀt sisÀllÀÀn yhtenÀisen mitan merkkijonojen samankaltaisuudelle, jossa hyödynnÀmme useita vastaavaan tarkoitukseen kehitettyjÀ mittoja. TÀhÀn liittyen kehitÀmme epÀtriviaalin algoritmin, jolla ongelmaa voidaan approksimoida ja jolla on vahva teoreettinen perusta. LisÀksi laadimme kaksi etuliiterajaukseen liittyvÀÀ periaatetta: nopean heuristisen periaatteen ja tarkan dynaamiseen ohjelmointiin perustuvan periaatteen. NÀillÀ pyritÀÀn minimoimaan liitoksen muodostamiseen kuluva aika
