5 research outputs found

    Dynamic set kNN self-join

    Full text link
    In many applications, data objects can be represented as sets. For example, in video on-demand and social network services, the user data consists of a set of movies that have been watched and a set of users (friends), respectively, and they can be used for recommendation and information extraction. The problem of set similarity self-join hence has been studied extensively. Existing studies assume that sets are static, but in the above applications, sets are dynamically updated, and this requires continuous updating the join result. In this paper, we study a novel problem, dynamic set kNN self-join, i.e., for each set, we continuously compute its k nearest neighbor sets. Our problem poses a challenge for the efficiency of computation, because just an element insertion (deletion) into (from) a set may affect the kNN results of many sets. To address this challenge, we first investigate the property of the dynamic set kNN self-join problem to observe the search space derived from a set update. Then, based on this observation, we propose an efficient algorithm. This algorithm employs an indexing technique that enables incremental similarity computation and prunes unnecessary similarity computation. Our empirical studies using real datasets show the efficiency and scalability of our algorithm.Amagata D., Hara T., Xiao C.. Dynamic set kNN self-join. Proceedings - International Conference on Data Engineering 2019-April, 818 (2019); https://doi.org/10.1109/ICDE.2019.00078

    Efficient processing of similarity queries with applications

    Get PDF
    Today, a myriad of data sources, from the Internet to business operations to scientific instruments, produce large and different types of data. Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological applications, call for identifying and processing similarities in big data. As a result, it is imperative to develop new similarity query processing approaches and systems that scale from low dimensional data to high dimensional data, from single machine to clusters of hundreds of machines, and from disk-based to memory-based processing. This dissertation introduces and studies several similarity-aware query operators, analyzes and optimizes their performance. The first contribution of this dissertation is an SQL-based Similarity Group-by operator (SGB, for short) that extends the semantics of the standard SQL Group-by operator to group data with similar but not necessarily equal values. We realize these SGB operators by extending the Standard SQL Group-by and introduce two new SGB operators for multi-dimensional data. We implement and test the new SGB operators and their algorithms inside an open-source centralized database server (PostgreSQL). In the second contribution of this dissertation, we study how to efficiently process Hamming-distance-based similarity queries (Hamming-distance select and Hamming-distance join) that are crucial to many applications. We introduce a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries (namely, the selects and joins). In the third and last contribution of this dissertation, we develop a system for similarity query processing and optimization in an in-memory and distributed setup for big spatial data. We propose a query scheduler and a distributed query optimizer that use a new cost model to optimize the cost of similarity query processing in this in-memory distributed setup. The scheduler and query optimizer generates query execution plans that minimize the effect of query skew. The query scheduler employs new spatial indexing techniques based on bloom filters to forward queries to the appropriate local sites. The proposed query processing and optimization techniques are prototyped inside Spark, a distributed main-memory computation system

    Elastic Dataflow Processing on the Cloud

    Get PDF
    Τα νεφη εχουν μετατραπει σε μια ελκυστικη πλατφορμα για την πολυπλοκη επεξεργασια δεδομενων μεγαλης κλιμακας, ειδικα εξαιτιας της εννοιας της ελαστικοτητας, η οποια και τα χαρακτηριζει: οι υπολογιστικοι ποροι μπορουν να εκμισθωθουν δυναμικα και να χρησιμοποιουνται για οσο χρονο ειναι απαραιτητο. Αυτο δινει την δυνατοτητα να δημιουργηθει μια εικονικη υποδομη η οποια μπορει να αλλαζει δυναμικα στο χρονο. Οι συγχρονες εφαρμογες απαιτουν την εκτελεση πολυπλοκων ερωτηματων σε Μεγαλα Δεδομενα για την εξορυξη γνωσης και την υποστηριξη επιχειρησιακων αποφασεων. Τα πολυπλοκα αυτα ερωτηματα, εκφραζονται σε γλωσσες υψηλου επιπεδου και τυπικα μεταφραζονται σε ροες επεξεργασιας δεδομενων, η απλα ροες δεδομενων. Ενα λογικο ερωτημα που τιθεται ειναι κατα ποσον η ελαστικοτητα επηρεαζει την εκτελεση των ροων δεδομενων και με πιο τροπο. Ειναι λογικο οτι η εκτελεση να ειναι πιθανον γρηγοροτερη αν χρησιμοποιηθουν περισ- σοτεροι υπολογιστικοι ποροι, αλλα το κοστος θα ειναι υψηλοτερο. Αυτο δημιουργει την εννοια της οικο-ελαστικοτητας, ενος επιπλεον τυπου ελαστικοτητας ο οποιος προερχεται απο την οικονο- μικη θεωρια, και συλλαμβανει τις εναλλακτικες μεταξυ του χρονου εκτελεσης και του χρηματικου κοστους οπως προκυπτει απο την χρηση των πορων. Στα πλαισια αυτης της διδακτορικης διατριβης, προσεγγιζουμε την ελαστικοτητα με ενα ενοποιημενο μοντελο που περιλαμβανει και τις δυο ειδων ελαστικοτητες που υπαρχουν στα υπολογιστικα νεφη. Αυτη η ενοποιημενη προσεγγιση της ελαστικοτητας ειναι πολυ σημαντικη στην σχεδιαση συστηματων που ρυθμιζονται αυτοματα (auto-tuned) σε περιβαλλοντα νεφους. Αρχικα δειχνουμε οτι η οικο-ελαστικοτητα υπαρχει σε αρκετους τυπους υπολογισμου που εμφανιζονται συχνα στην πραξη και οτι μπορει να βρεθει χρησιμοποιωντας εναν απλο, αλλα ταυτοχρονα αποδοτικο και ε- πεκτασιμο αλγοριθμο. Επειτα, παρουσιαζουμε δυο εφαρμογες που χρησιμοποιουν αλγοριθμους οι οποιοι χρησιμοποιουν το ενοποιημενο μοντελο ελαστικοτητας που προτεινουμε για να μπορουν να προσαρμοζουν δυναμικα το συστημα στα ερωτηματα της εισοδου: 1) την ελαστικη επεξεργασια αναλυτικων ερωτηματων τα οποια εχουν πλανα εκτελεσης με μορφη δεντρων με σκοπο την μεγι- στοποιηση του κερδους και 2) την αυτοματη διαχειριση χρησιμων ευρετηριων λαμβανοντας υποψη το χρηματικο κοστος των υπολογιστικων και των αποθηκευτικων πορων. Τελος, παρουσιαζουμε το EXAREME, ενα συστημα για την ελαστικη επεξεργασια μεγαλου ογκου δεδομενων στο νεφος το οποιο εχει χρησιμοποιηθει και επεκταθει σε αυτην την δουλεια. Το συστημα προσφερει δηλωτικες γλωσσες που βασιζονται στην SQL επεκταμενη με συναρτησεις οι οποιες μπορει να οριστουν απο χρηστες (User-Defined Functions, UDFs). Επιπλεον, το συντακτικο της γλωσσας εχει επεκταθει με στοιχεια παραλληλισμου. Το EXAREME εχει σχεδιαστει για να εκμεταλλευεται τις ελαστικοτη- τες που προσφερουν τα νεφη, δεσμευοντας και αποδεσμευοντας υπολογιστικους πορους δυναμικα με σκοπο την προσαρμογη στα ερωτηματα.Clouds have become an attractive platform for the large-scale processing of modern applications on Big Data, especially due to the concept of elasticity, which characterizes them: resources can be leased on demand and used for as much time as needed, offering the ability to create virtual infrastructures that change dynamically over time. Such applications often require processing of complex queries that are expressed in a high-level language and are typically transformed into data processing flows (dataflows). A logical question that arises is whether elasticity affects dataflow execution and in which way. It seems reasonable that the execution is faster when more resources are used, however the monetary cost is higher. This gives rise to the concept eco-elasticity, an additional kind of elasticity that comes from economics, and captures the trade-offs between the response time of the system and the amount of money we pay for it as influenced by the use of different amounts of resources. In this thesis, we approach the elasticity of clouds in a unified way that combines both the traditional notion and eco-elasticity. This unified elasticity concept is essential for the development of auto-tuned systems in cloud environments. First, we demonstrate that eco-elasticity exists in several common tasks that appear in practice and that can be discovered using a simple, yet highly scalable and efficient algorithm. Next, we present two cases of auto-tuned algorithms that use the unified model of elasticity in order to adapt to the query workload: 1) processing analytical queries in the form of tree execution plans in order to maximize profit and 2) automated index management taking into account compute and storage re- sources. Finally, we describe EXAREME, a system for elastic data processing on the cloud that has been used and extended in this work. The system offers declarative languages that are based on SQL with user-defined functions (UDFs) extended with parallelism primi- tives. EXAREME exploits both elasticities of clouds by dynamically allocating and deallocating compute resources in order to adapt to the query workload

    Advanced distributed data integration infrastructure and research data management portal

    Get PDF
    The amount of data available due to the rapid spread of advanced information technology is exploding. At the same time, continued research on data integration systems aims to provide users with uniform data access and efficient data sharing. The ability to share data is particularly important for interdisciplinary research, where a comprehensive picture of the subject requires large amounts of data from disparate data sources from a variety of disciplines. While there are numerous data sets available from various groups worldwide, the existing data sources are principally oriented toward regional comparative efforts rather than global applications. They vary widely both in content and format. Such data sources cannot be easily integrated, and maintained by small groups of developers. I propose an advanced infrastructure for large-scale data integration based on crowdsourcing. In particular, I propose a novel architecture and algorithms to efficiently store dynamically incoming heterogeneous datasets enabling both data integration and data autonomy. My proposed infrastructure combines machine learning algorithms and human expertise to perform efficient schema alignment and maintain relationships between the datasets. It provides efficient data exploration functionality without requiring users to write complex queries, as well as performs approximate information fusion when exact match does not exist. Finally, I introduce Col*Fusion system that implements the proposed advance data integration infrastructure
    corecore