109 research outputs found

    Data Management for Data Science - Towards Embedded Analytics

    Get PDF
    The rise of Data Science has caused an influx of new usersin need of data management solutions. However, insteadof utilizing existing RDBMS solutions they are opting touse a stack of independent solutions for data storage andprocessing glued together by scripting languages. This is notbecause they do not need the functionality that an integratedRDBMS provides, but rather because existing RDBMS im-plementations do not cater to their use case. To solve theseissues, we propose a new class of data management systems:embedded analytical systems. These systems are tightlyintegrated with analytical tools, and provide fast and effi-cient access to the data stored within them. In this work,we describe the unique challenges and opportunities w.r.tworkloads, resilience and cooperation that are faced by thisnew class of systems and the steps we have taken towardsaddressing them in the DuckDB system

    Neighborhood Matching Network for Entity Alignment

    Full text link
    Structural heterogeneity between knowledge graphs is an outstanding challenge for entity alignment. This paper presents Neighborhood Matching Network (NMN), a novel entity alignment framework for tackling the structural heterogeneity challenge. NMN estimates the similarities between entities to capture both the topological structure and the neighborhood difference. It provides two innovative components for better learning representations for entity alignment. It first uses a novel graph sampling method to distill a discriminative neighborhood for each entity. It then adopts a cross-graph neighborhood matching module to jointly encode the neighborhood difference for a given entity pair. Such strategies allow NMN to effectively construct matching-oriented entity representations while ignoring noisy neighbors that have a negative impact on the alignment task. Extensive experiments performed on three entity alignment datasets show that NMN can well estimate the neighborhood similarity in more tough cases and significantly outperforms 12 previous state-of-the-art methods.Comment: 11 pages, accepted by ACL 202

    Weiterentwicklung analytischer Datenbanksysteme

    Get PDF
    This thesis contributes to the state of the art in analytical database systems. First, we identify and explore extensions to better support analytics on event streams. Second, we propose a novel polygon index to enable efficient geospatial data processing in main memory. Third, we contribute a new deep learning approach to cardinality estimation, which is the core problem in cost-based query optimization.Diese Arbeit trägt zum aktuellen Forschungsstand von analytischen Datenbanksystemen bei. Wir identifizieren und explorieren Erweiterungen um Analysen auf Eventströmen besser zu unterstützen. Wir stellen eine neue Indexstruktur für Polygone vor, die eine effiziente Verarbeitung von Geodaten im Hauptspeicher ermöglicht. Zudem präsentieren wir einen neuen Ansatz für Kardinalitätsschätzungen mittels maschinellen Lernens

    Single-Board-Computer Clusters for Cloudlet Computing in Internet of Things

    Get PDF
    The number of connected sensors and devices is expected to increase to billions in the near future. However, centralised cloud-computing data centres present various challenges to meet the requirements inherent to Internet of Things (IoT) workloads, such as low latency, high throughput and bandwidth constraints. Edge computing is becoming the standard computing paradigm for latency-sensitive real-time IoT workloads, since it addresses the aforementioned limitations related to centralised cloud-computing models. Such a paradigm relies on bringing computation close to the source of data, which presents serious operational challenges for large-scale cloud-computing providers. In this work, we present an architecture composed of low-cost Single-Board-Computer clusters near to data sources, and centralised cloud-computing data centres. The proposed cost-efficient model may be employed as an alternative to fog computing to meet real-time IoT workload requirements while keeping scalability. We include an extensive empirical analysis to assess the suitability of single-board-computer clusters as cost-effective edge-computing micro data centres. Additionally, we compare the proposed architecture with traditional cloudlet and cloud architectures, and evaluate them through extensive simulation. We finally show that acquisition costs can be drastically reduced while keeping performance levels in data-intensive IoT use cases.Ministerio de Economía y Competitividad TIN2017-82113-C2-1-RMinisterio de Economía y Competitividad RTI2018-098062-A-I00European Union’s Horizon 2020 No. 754489Science Foundation Ireland grant 13/RC/209

    Flow-Loss: Learning Cardinality Estimates That Matter

    Full text link
    Previous approaches to learned cardinality estimation have focused on improving average estimation error, but not all estimates matter equally. Since learned models inevitably make mistakes, the goal should be to improve the estimates that make the biggest difference to an optimizer. We introduce a new loss function, Flow-Loss, that explicitly optimizes for better query plans by approximating the optimizer's cost model and dynamic programming search algorithm with analytical functions. At the heart of Flow-Loss is a reduction of query optimization to a flow routing problem on a certain plan graph in which paths correspond to different query plans. To evaluate our approach, we introduce the Cardinality Estimation Benchmark, which contains the ground truth cardinalities for sub-plans of over 16K queries from 21 templates with up to 15 joins. We show that across different architectures and databases, a model trained with Flow-Loss improves the cost of plans (using the PostgreSQL cost model) and query runtimes despite having worse estimation accuracy than a model trained with Q-Error. When the test set queries closely match the training queries, both models improve performance significantly over PostgreSQL and are close to the optimal performance (using true cardinalities). However, the Q-Error trained model degrades significantly when evaluated on queries that are slightly different (e.g., similar but not identical query templates), while the Flow-Loss trained model generalizes better to such situations. For example, the Flow-Loss model achieves up to 1.5x better runtimes on unseen templates compared to the Q-Error model, despite leveraging the same model architecture and training data

    BitE : Accelerating Learned Query Optimization in a Mixed-Workload Environment

    Full text link
    Although the many efforts to apply deep reinforcement learning to query optimization in recent years, there remains room for improvement as query optimizers are complex entities that require hand-designed tuning of workloads and datasets. Recent research present learned query optimizations results mostly in bulks of single workloads which focus on picking up the unique traits of the specific workload. This proves to be problematic in scenarios where the different characteristics of multiple workloads and datasets are to be mixed and learned together. Henceforth, in this paper, we propose BitE, a novel ensemble learning model using database statistics and metadata to tune a learned query optimizer for enhancing performance. On the way, we introduce multiple revisions to solve several challenges: we extend the search space for the optimal Abstract SQL Plan(represented as a JSON object called ASP) by expanding hintsets, we steer the model away from the default plans that may be biased by configuring the experience with all unique plans of queries, and we deviate from the traditional loss functions and choose an alternative method to cope with underestimation and overestimation of reward. Our model achieves 19.6% more improved queries and 15.8% less regressed queries compared to the existing traditional methods whilst using a comparable level of resources.Comment: This work was done when the first three author were interns in SAP Labs Korea and they have equal contributio
    corecore