Search CORE

207 research outputs found

Generalizing the Pigeonhole Principle for Similarity Query Processing in Hamming Space

Author: Ishikawa Yoshiharu
Lin Xuemin
Qin Jianbin
Wang Guoren
Wang Wei
Wang Yaoshu
Xiao Chuan
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/02/2021
Field of study

Selectivity estimation on set containment search

Author: K Tzoumas
R Baeza-Yates
R Jampani
S Helmer
S Melnik
S Suri
X Wang
Z Bar-Yossef
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

© Springer Nature Switzerland AG 2019. In this paper, we study the problem of selectivity estimation on set containment search. Given a query record Q and a record dataset S, we aim to accurately and efficiently estimate the selectivity of set containment search of query Q over S. The problem has many important applications in commercial fields and scientific studies. To the best of our knowledge, this is the first work to study this important problem. We first extend existing distinct value estimating techniques to solve this problem and develop an inverted list and G-KMV sketch based approach IL-GKMV. We analyse that the performance of IL-GKMV degrades with the increase of vocabulary size. Motivated by limitations of existing techniques and the inherent challenges of the problem, we resort to developing effective and efficient sampling approaches and propose an ordered trie structure based sampling approach named OT-Sampling. OT-Sampling partitions records based on element frequency and occurrence patterns and is significantly more accurate compared with simple random sampling method and IL-GKMV. To further enhance performance, a divide-and-conquer based sampling approach, DC-Sampling, is presented with an inclusion/exclusion prefix to explore the pruning opportunities. We theoretically analyse the proposed techniques regarding various accuracy estimators. Our comprehensive experiments on 6 real datasets verify the effectiveness and efficiency of our proposed techniques

Crossref

OPUS - University of Technology Sydney

Parallelizing Set Similarity Joins

Author: Fier Fabian
Publication venue: Humboldt-Universität zu Berlin
Publication date: 24/01/2022
Field of study

Eine der größten Herausforderungen in Data Science ist heutzutage, Daten miteinander in Beziehung zu setzen und ähnliche Daten zu finden. Hierzu kann der aus relationalen Datenbanken bekannte Join-Operator eingesetzt werden. Das Konzept der Ähnlichkeit wird häufig durch mengenbasierte Ähnlichkeitsfunktionen gemessen. Um solche Funktionen als Join-Prädikat nutzen zu können, setzt diese Arbeit voraus, dass Records aus Mengen von Tokens bestehen. Die Arbeit fokussiert sich auf den mengenbasierten Ähnlichkeitsjoin, Set Similarity Join (SSJ). Die Datenmenge, die es heute zu verarbeiten gilt, ist groß und wächst weiter. Der SSJ hingegen ist eine rechenintensive Operation. Um ihn auf großen Daten ausführen zu können, sind neue Ansätze notwendig. Diese Arbeit fokussiert sich auf das Mittel der Parallelisierung. Sie leistet folgende drei Beiträge auf dem Gebiet der SSJs. Erstens beschreibt und untersucht die Arbeit den aktuellen Stand paralleler SSJ-Ansätze. Diese Arbeit vergleicht zehn Map-Reduce-basierte Ansätze aus der Literatur sowohl analytisch als auch experimentell. Der größte Schwachpunkt aller Ansätze ist überraschenderweise eine geringe Skalierbarkeit aufgrund zu hoher Datenreplikation und/ oder ungleich verteilter Daten. Keiner der Ansätze kann den SSJ auf großen Daten berechnen. Zweitens macht die Arbeit die verfügbare hohe CPU-Parallelität moderner Rechner für den SSJ nutzbar. Sie stellt einen neuen daten-parallelen multi-threaded SSJ-Ansatz vor. Der vorgestellte Ansatz ermöglicht erhebliche Laufzeit-Beschleunigungen gegenüber der Ausführung auf einem Thread. Drittens stellt die Arbeit einen neuen hoch skalierbaren verteilten SSJ-Ansatz vor. Mit einer kostenbasierten Heuristik und einem daten-unabhängigen Skalierungsmechanismus vermeidet er Daten-Replikation und wiederholte Berechnungen. Der Ansatz beschleunigt die Join-Ausführung signifikant und ermöglicht die Ausführung auf erheblich größeren Datenmengen als bisher betrachtete parallele Ansätze.One of today's major challenges in data science is to compare and relate data of similar nature. Using the join operation known from relational databases could help solving this problem. Given a collection of records, the join operation finds all pairs of records, which fulfill a user-chosen predicate. Real-world problems could require complex predicates, such as similarity. A common way to measure similarity are set similarity functions. In order to use set similarity functions as predicates, we assume records to be represented by sets of tokens. In this thesis, we focus on the set similarity join (SSJ) operation. The amount of data to be processed today is typically large and grows continually. On the other hand, the SSJ is a compute-intensive operation. To cope with the increasing size of input data, additional means are needed to develop scalable implementations for SSJ. In this thesis, we focus on parallelization. We make the following three major contributions to SSJ. First, we elaborate on the state-of-the-art in parallelizing SSJ. We compare ten MapReduce-based approaches from the literature analytically and experimentally. Their main limit is surprisingly a low scalability due to too high and/or skewed data replication. None of the approaches could compute the join on large datasets. Second, we leverage the abundant CPU parallelism of modern commodity hardware, which has not yet been considered to scale SSJ. We propose a novel data-parallel multi-threaded SSJ. Our approach provides significant speedups compared to single-threaded executions. Third, we propose a novel highly scalable distributed SSJ approach. With a cost-based heuristic and a data-independent scaling mechanism we avoid data replication and recomputation. A heuristic assigns similar shares of compute costs to each node. Our approach significantly scales up the join execution and processes much larger datasets than all parallel approaches designed and implemented so far

Dokumenten-Publikationsserver der Humboldt-Universität zu Berlin

Pigeonring: A Principle for Faster Thresholded Similarity Search

Author: Altschul S. F.
Andoni A.
Apostol T.
Arasu A.
Broder A. Z.
Christiani T.
Ciaccia P.
Daepp U.
Gionis A.
Gravano L.
Hwang Y.
Jégou H.
Kim S.
Li C.
Lv Q.
Mann W.
Meek C.
Qin J.
Razborov A. A.
Samet H.
Savasere A.
Tabei Y.
Tao T.
Wang J.
Weiss Y.
Yi B.
Publication venue: 'VLDB Endowment'
Publication date: 01/09/2018
Field of study

Crossref

Edinburgh Research Explorer

効率的で安全な集合間類似結合に関する研究

Author: Mateus SilqueiraHicksonCruz
Publication venue
Publication date: 01/01/2018
Field of study

筑波大学 (University of Tsukuba)201

Tsukuba Repository

Top-k spatial-keyword publish/subscribe over sliding window

Author: Huang Z
Lin X
Wang X
Zhang W
Zhang Y
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/06/2017
Field of study

© 2017, Springer-Verlag Berlin Heidelberg. With the prevalence of social media and GPS-enabled devices, a massive amount of geo-textual data have been generated in a stream fashion, leading to a variety of applications such as location-based recommendation and information dissemination. In this paper, we investigate a novel real-time top-k monitoring problem over sliding window of streaming data; that is, we continuously maintain the top-k most relevant geo-textual messages (e.g., geo-tagged tweets) for a large number of spatial-keyword subscriptions (e.g., registered users interested in local events) simultaneously. To provide the most recent information under controllable memory cost, sliding window model is employed on the streaming geo-textual data. To the best of our knowledge, this is the first work to study top-k spatial-keyword publish/subscribe over sliding window. A novel centralized system, called Skype (Top-kSpatial-keyword Publish/Subscribe), is proposed in this paper. In Skype, to continuously maintain top-k results for massive subscriptions, we devise a novel indexing structure upon subscriptions such that each incoming message can be immediately delivered on its arrival. To reduce the expensive top-k re-evaluation cost triggered by message expiration, we develop a novel cost-basedk-skyband technique to reduce the number of re-evaluations in a cost-effective way. Extensive experiments verify the great efficiency and effectiveness of our proposed techniques. Furthermore, to support better scalability and higher throughput, we propose a distributed version of Skype, namely DSkype, on top of Storm, which is a popular distributed stream processing system. With the help of fine-tuned subscription/message distribution mechanisms, DSkype can achieve orders of magnitude speed-up than its centralized version

OPUS - University of Technology Sydney

Efficient processing of large-scale spatio-temporal data

Author: Hagedorn Stefan
Publication venue
Publication date: 01/01/2020
Field of study

Millionen Geräte, wie z.B. Mobiltelefone, Autos und Umweltsensoren senden ihre Positionen zusammen mit einem Zeitstempel und weiteren Nutzdaten an einen Server zu verschiedenen Analysezwecken. Die Positionsinformationen und übertragenen Ereignisinformationen werden als Punkte oder Polygone dargestellt. Eine weitere Art räumlicher Daten sind Rasterdaten, die zum Beispiel von Kameras und Sensoren produziert werden. Diese großen räumlich-zeitlichen Datenmengen können nur auf skalierbaren Plattformen wie Hadoop und Apache Spark verarbeitet werden, die jedoch z.B. die Nachbarschaftsinformation nicht ausnutzen können - was die Ausführung bestimmter Anfragen praktisch unmöglich macht. Die wiederholten Ausführungen der Analyseprogramme während ihrer Entwicklung und durch verschiedene Nutzer resultieren in langen Ausführungszeiten und hohen Kosten für gemietete Ressourcen, die durch die Wiederverwendung von Zwischenergebnissen reduziert werden können. Diese Arbeit beschäftigt sich mit den beiden oben beschriebenen Herausforderungen. Wir präsentieren zunächst das STARK Framework für die Verarbeitung räumlich-zeitlicher Vektor- und Rasterdaten in Apache Spark. Wir identifizieren verschiedene Algorithmen für Operatoren und analysieren, wie diese von den Eigenschaften der zugrundeliegenden Plattform profitieren können. Weiterhin wird untersucht, wie Indexe in der verteilten und parallelen Umgebung realisiert werden können. Außerdem vergleichen wir Partitionierungsmethoden, die unterschiedlich gut mit ungleichmäßiger Datenverteilung und der Größe der Datenmenge umgehen können und präsentieren einen Ansatz um die auf Operatorebene zu verarbeitende Datenmenge frühzeitig zu reduzieren. Um die Ausführungszeit von Programmen zu verkürzen, stellen wir einen Ansatz zur transparenten Materialisierung von Zwischenergebnissen vor. Dieser Ansatz benutzt ein Entscheidungsmodell, welches auf den tatsächlichen Operatorkosten basiert. In der Evaluierung vergleichen wir die verschiedenen Implementierungs- sowie Konfigurationsmöglichkeiten in STARK und identifizieren Szenarien wann Partitionierung und Indexierung eingesetzt werden sollten. Außerdem vergleichen wir STARK mit verwandten Systemen. Im zweiten Teil der Evaluierung zeigen wir, dass die transparente Wiederverwendung der materialisierten Zwischenergebnisse die Ausführungszeit der Programme signifikant verringern kann.Millions of location-aware devices, such as mobile phones, cars, and environmental sensors constantly report their positions often in combination with a timestamp to a server for different kinds of analyses. While the location information of the devices and reported events is represented as points and polygons, raster data is another type of spatial data, which is for example produced by cameras and sensors. This Big spatio-temporal Data needs to be processed on scalable platforms, such as Hadoop and Apache Spark, which, however, are unaware of, e.g., spatial neighborhood, what makes them practically impossible to use for this kind of data. The repeated executions of the programs during development and by different users result in long execution times and potentially high costs in rented clusters, which can be reduced by reusing commonly computed intermediate results. Within this thesis, we tackle the two challenges described above. First, we present the STARK framework for processing spatio-temporal vector and raster data on the Apache Spark stack. For operators, we identify several possible algorithms and study how they can benefit from the underlying platform's properties. We further investigate how indexes can be realized in the distributed and parallel architecture of Big Data processing engines and compare methods for data partitioning, which perform differently well with respect to data skew and data set size. Furthermore, an approach to reduce the amount of data to process at operator level is presented. In order to reduce the execution times, we introduce an approach to transparently recycle intermediate results of dataflow programs, based on operator costs. To compute the costs, we instrument the programs with profiling code to gather the execution time and result size of the operators. In the evaluation, we first compare the various implementation and configuration possibilities in STARK and identify scenarios when and how partitioning and indexing should be applied. We further compare STARK to related systems and show that we can achieve significantly better execution times, not only when exploiting existing partitioning information. In the second part of the evaluation, we show that with the transparent cost-based materialization and recycling of intermediate results, the execution times of programs can be reduced significantly

Digitale Bibliothek Thüringen

Geospatial Dataspaces: Location-based services and integration of User-Generated Content

Author: Ευσταθιάδης Χριστόδουλος
Publication venue
Publication date: 14/06/2016
Field of study

DSpace at NTUA

Efficient Approximate String Matching with Synonyms and Taxonomies

Author: Xu Pengfei
Publication venue: 'University of Helsinki Libraries'
Publication date: 19/02/2021
Field of study

Strings are ubiquitous. When being collected from various sources, strings are often inconsistent, which means that they can have the same or similar meaning expressed in different forms, such as with typographical mistakes. Finding similar strings given such inconsistent datasets has been researched extensively during past years under an umbrella problem called approximate string matching. This thesis aims to enhance the quality of the approximate string matching by detecting similar strings using their meanings besides typographical errors. Specifically, this thesis focuses on utilising synonyms and taxonomies, since both are commonly available knowledge sources. This research is to use each type of knowledge to address either a selection or join tasks, where the first task aims to find strings similar to a given string, and the second task is to find pairs of strings that are similar. The desired output is either all strings similar to a given extent (i.e., all-match) or the top-k most similar strings. The first contribution of this thesis is to address the top-k selection problem considering synonyms. Here, we propose algorithms with different optimisation goals: to minimise the space cost, to maximise the selection speed, or to maximise the selection speed under a space constraint. We model the last goal as a variant of an 0/1 knapsack problem and propose an efficient solution based on the branch and bound paradigm. Next, this thesis solves the top-k join problem considering taxonomy relations. Three algorithms, two based on sorted lists and one based on tries, are proposed, in which we use pre-computations to accelerate list scan or use predictions to eliminate unnecessary trie accesses. Experiments show that the trie-based algorithm has a very fast response time on a vast dataset. The third contribution of this thesis is to deal with the all-match join problem considering taxonomy relations. To this end, we identify the shortcoming of a standard prefix filtering principle and propose an adaptive filtering algorithm that is tuneable towards the minimised join time. We also design a sampling-based estimation procedure to suggest the best parameter in a short time with high accuracy. Lastly, this thesis researches the all-match join task by integrating typographical errors, synonyms, and taxonomies simultaneously. Key contributions here include a new unified similarity measure that employs multiple measures, as well as a non-trivial approximation algorithm with a tight theoretical guarantee. We furthermore propose two prefix filtering principles: a fast heuristic and accurate dynamic programming, to strive for the minimised join time.Merkkijonoja esiintyy kaikkialla. Kun merkkijonoja kerätään erilaisista lähteistä, ne ovat usein yhteensopimattomia. Tämä tarkoittaa, että niillä voi olla sama merkitys riippumatta siitä, että ne ovat eri muodossa. Muotoon liittyvät eroavaisuudet voivat johtua esimerkiksi typografisista virheistä. Samanlaisten merkkijonojen löytäminen yhteensopimattomista tietoaineistoista on laajasti tutkittu kysymys viime vuosien aikana. Yhteisnimitys tälle suuntauksella on likimääräinen merkkijonojen yhteensovittaminen (approximate string matching). Tämän työn päämääränä on parantaa merkkijonojen likimääräistä yhteensovittamista ottamalla typografisten virheiden lisäksi huomioon merkkijonojen merkitys. Tässä työssä keskitymme erityisesti hyödyntämään synonyymeja sekä taksonomisia luokittelujärjestelmiä, koska kummatkin ovat yleisesti saatavilla olevia tietolähteitä. Tutkimuksessamme on kummankin tyyppistä lähdettä käytetty joko kysely- tai liitostehtävissä. Kyselytehtävässä tarkoituksena on löytää annettua merkkijonoa vastaavat merkkijonot. Liitostehtävässä tarkoituksena on löytää ne merkkijonoparit, jotka vastaavat toisiaan. Tuloksena saadaan joko kaikki vastaavat merkkijonot haluttuun vastaavuuteen asti (all-match) tai ensimmäiset k kappaletta (top-k) eniten toisiaan vastaavia merkkijonoja. Tämän työn ensimmäisen vastauksen top-k kyselyongelmaan annamme synonyymien avulla. Kehittämissämme algoritmeissa pyrimme erilaisiin optimaalisiin ratkaisuihin, kuten käytetyn muistin minimointiin, suoritusnopeuden maksimointiin sekä näiden yhdistelmään, jossa nopeus maksimoidaan samalla rajoittaen muistinkäyttöä. Jälkimmäinen ongelma on erikoistapaus 0/1 knapsack ongelmasta, ja ratkaisemme ongelman tehokkaan haarauta ja rajoita paradigman avulla (branch and bound paradigm). Työn toinen vastaus top-k liitosongelmaan annetaan taksonomisten relaatioiden avulla. Tätä varten olemme kehittäneet kolme algoritmia, joista kaksi perustuu järjestettyihin listoihin ja yksi etuliitepuutietorakenteeseen (trie). Listojen läpikäymistä nopeutetaan etukäteen suoritettavilla alustuksilla. Etuliitepuihin perustuvaa algoritmia tehostetaan ennakoivasti poistamalla turhat haut puurakenteeseen. Kokeiden perusteella etuliitepuihin perustuvalla algoritmilla on erittäin nopea vastausaika, kun kyseessä on iso tietoaineisto. Kolmas vastaus työssä käsittelee all-match liitosongelmaa taksonomisten relaatioiden tapauksessa. Osoitamme millä tavalla standardi etuliiterajausperiaate (prefix filtering principle) on vajavainen ja vastauksena tähän kehitämme mukautuvan rajausalgoritmin, joka on säädettävissä siten, että liitoksen muodostamiseen tarvittava aika voidaan minimoida. Tämän lisäksi laadimme datasta poimittaviin näytteisiin perustuvan algoritmin, jonka avulla voidaan arvioida paras parametri lyhyessä ajassa korkealla tarkkuudella. Lopuksi työssä tutkimme all-match liitosongelmaa yhdistämällä typografiset virheet sekä synonyymien ja taksonomioiden käytön samanaikaisesti. Avainratkaisut tässä osassa pitävät sisällään yhtenäisen mitan merkkijonojen samankaltaisuudelle, jossa hyödynnämme useita vastaavaan tarkoitukseen kehitettyjä mittoja. Tähän liittyen kehitämme epätriviaalin algoritmin, jolla ongelmaa voidaan approksimoida ja jolla on vahva teoreettinen perusta. Lisäksi laadimme kaksi etuliiterajaukseen liittyvää periaatetta: nopean heuristisen periaatteen ja tarkan dynaamiseen ohjelmointiin perustuvan periaatteen. Näillä pyritään minimoimaan liitoksen muodostamiseen kuluva aika

Helsingin yliopiston digitaalinen arkisto