17 research outputs found

    Hercules Against Data Series Similarity Search

    Full text link
    We propose Hercules, a parallel tree-based technique for exact similarity search on massive disk-based data series collections. We present novel index construction and query answering algorithms that leverage different summarization techniques, carefully schedule costly operations, optimize memory and disk accesses, and exploit the multi-threading and SIMD capabilities of modern hardware to perform CPU-intensive calculations. We demonstrate the superiority and robustness of Hercules with an extensive experimental evaluation against state-of-the-art techniques, using many synthetic and real datasets, and query workloads of varying difficulty. The results show that Hercules performs up to one order of magnitude faster than the best competitor (which is not always the same). Moreover, Hercules is the only index that outperforms the optimized scan on all scenarios, including the hard query workloads on disk-based datasets. This paper was published in the Proceedings of the VLDB Endowment, Volume 15, Number 10, June 2022

    Indexing for Very Large Data Series Collections

    Get PDF
    Data series are a prevalent data type that has attracted lots of interest in recent years. Specifically, there has been an explosive interest towards the analysis of large volumes of data series in many different domains. This is both in businesses (e.g., in mobile applications) and in sciences e.g., in biology). In several time-critical scenarios, analysts need to be able to explore these data as soon as they become available, which is not currently possible for very large data series collections. In this thesis, we present the first adaptive indexing mechanism, specifically tailored to solve the problem of indexing and querying very large data series collections. The main idea is that instead of building the complete index over the complete data set up-front and querying only later, we interactively and adaptively build parts of the index, only for the parts of the data on which the users pose queries. The contents and the resolution of the index are purely driven by query patterns; the more queries that arrive, the more data series are indexed and at a higher resolution. Adaptive indexing significantly outperforms previous solutions, gracefully handling large data series collections, reducing the data to query delay: by the time state-of-the-art indexing techniques finish indexing 1 billion data series (and before answering even a single query), our method has already answered 3 * 10^5 queries. At the same time, we present novel algorithms for both full indexing of data series collections, as well as for efficient exact query answering. Our algorithms perform efficient skip-sequential scans of the data, avoiding the need of costly random accesses on the disk. Moreover, up to this point very little attention has been paid to properly evaluating data series index structures, with most previous work relying solely on randomly selected data series to use as queries (with/without adding noise). In this thesis, we show that random workloads are inherently not suitable for the task at hand and we argue that there is a need for carefully generating a query workload. We define measures that capture the characteristics of queries, and we propose a method for generating workloads with the desired properties, that is, effectively evaluating and comparing data series summarizations and indexes. In our experimental evaluation, with carefully controlled query workloads, we shed light on key factors affecting the performance of nearest neighbor search in large data series collections. Finally, apart from ad hoc data exploration, we also investigate methods for the systematic analysis of very large data series collections, supporting business intelligence applications. We present techniques, which borrow ideas from Strategic Management, for a goal-oriented analysis of large collections of performance indicator data series. Such algorithms can additionally be sped up through the use of the index structures presented in this work

    Enhancing the Collective Knowledge for the Engineering of Ontologies in Open and Socially Constructed Learning Spaces

    No full text
    The aim of this paper is to present a novel technological approach for enhancing the collective knowledge of communities of learners on the engineering of ontologies within a collaborative, open and socially constructed environment. The proposed technology aims at shaping information spaces into ontologies in a collaborative, communicative and learner-centered way during the ontology development life-cycle. The paper conjectures that such a collaborative environment can yield educational benefits, thus there is need to follow principles that apply in the Computer Supported Collaborative Learning (CSCL) paradigm. This work is mainly based on a collaborative and human-centered ontology engineering methodology and on a meta-ontology framework for developing ontologies, namely HCOME and HCOME-3O respectively. The integration of key technologies such as Semantic Wiki and Argumentation models with Ontology Engineering methodologies and tools serve as an enabler of learning spaces construction for different domain-specific information spaces in open settings. Inside these learning spaces innovative conceptualizations (both domain and development) are conceived, described by intertwined ontological meta-models following the HCOME-3O specifications for future reference and tutoring support. Such learning spaces support two types of ontology engineering courses: a) courses related to the know-how of shaping information spaces into ontologies (namely, the development knowledge) and b) courses related to the analysis of the domain itself (namely, the domain knowledge). The paper reports on the evaluation of the approach within a CSCL setting in Ontology Engineering, using the integrated set of tools and the framework that have been developed for the collaborative engineering of ontologies

    Data Series Management (Dagstuhl Seminar 19282)

    No full text
    We now witness a very strong interest by users across different domains on data series (a.k.a. time series) management. It is not unusual for industrial applications that produce data series to involve numbers of sequences (or subsequences) in the order of billions (i.e., multiple TBs). As a result, analysts are unable to handle the vast amounts of data series that they have to manage and process. The goal of this seminar is to enable researchers and practitioners to exchange ideas and foster collaborations in the topic of data series management and identify the corresponding open research directions. The main questions answered are the following: i) What are the data series management needs across various domains and what are the shortcomings of current systems, ii) How can we use machine learning to optimize our current data systems, and how can these systems help in machine learning pipelines? iii) How can visual analytics assist the process of analyzing big data series collections? The seminar focuses on the following key topics related to data series management: 1)Data series storage and access paterns, 2) Query optimization, 3) Machine learning and data mining for data serie, 4) Visualization for data series exploration, 5) Applications in multiple domains
    corecore