1,972 research outputs found

    Efficient processing of large-scale spatio-temporal data

    Get PDF
    Millionen Geräte, wie z.B. Mobiltelefone, Autos und Umweltsensoren senden ihre Positionen zusammen mit einem Zeitstempel und weiteren Nutzdaten an einen Server zu verschiedenen Analysezwecken. Die Positionsinformationen und übertragenen Ereignisinformationen werden als Punkte oder Polygone dargestellt. Eine weitere Art räumlicher Daten sind Rasterdaten, die zum Beispiel von Kameras und Sensoren produziert werden. Diese großen räumlich-zeitlichen Datenmengen können nur auf skalierbaren Plattformen wie Hadoop und Apache Spark verarbeitet werden, die jedoch z.B. die Nachbarschaftsinformation nicht ausnutzen können - was die Ausführung bestimmter Anfragen praktisch unmöglich macht. Die wiederholten Ausführungen der Analyseprogramme während ihrer Entwicklung und durch verschiedene Nutzer resultieren in langen Ausführungszeiten und hohen Kosten für gemietete Ressourcen, die durch die Wiederverwendung von Zwischenergebnissen reduziert werden können. Diese Arbeit beschäftigt sich mit den beiden oben beschriebenen Herausforderungen. Wir präsentieren zunächst das STARK Framework für die Verarbeitung räumlich-zeitlicher Vektor- und Rasterdaten in Apache Spark. Wir identifizieren verschiedene Algorithmen für Operatoren und analysieren, wie diese von den Eigenschaften der zugrundeliegenden Plattform profitieren können. Weiterhin wird untersucht, wie Indexe in der verteilten und parallelen Umgebung realisiert werden können. Außerdem vergleichen wir Partitionierungsmethoden, die unterschiedlich gut mit ungleichmäßiger Datenverteilung und der Größe der Datenmenge umgehen können und präsentieren einen Ansatz um die auf Operatorebene zu verarbeitende Datenmenge frühzeitig zu reduzieren. Um die Ausführungszeit von Programmen zu verkürzen, stellen wir einen Ansatz zur transparenten Materialisierung von Zwischenergebnissen vor. Dieser Ansatz benutzt ein Entscheidungsmodell, welches auf den tatsächlichen Operatorkosten basiert. In der Evaluierung vergleichen wir die verschiedenen Implementierungs- sowie Konfigurationsmöglichkeiten in STARK und identifizieren Szenarien wann Partitionierung und Indexierung eingesetzt werden sollten. Außerdem vergleichen wir STARK mit verwandten Systemen. Im zweiten Teil der Evaluierung zeigen wir, dass die transparente Wiederverwendung der materialisierten Zwischenergebnisse die Ausführungszeit der Programme signifikant verringern kann.Millions of location-aware devices, such as mobile phones, cars, and environmental sensors constantly report their positions often in combination with a timestamp to a server for different kinds of analyses. While the location information of the devices and reported events is represented as points and polygons, raster data is another type of spatial data, which is for example produced by cameras and sensors. This Big spatio-temporal Data needs to be processed on scalable platforms, such as Hadoop and Apache Spark, which, however, are unaware of, e.g., spatial neighborhood, what makes them practically impossible to use for this kind of data. The repeated executions of the programs during development and by different users result in long execution times and potentially high costs in rented clusters, which can be reduced by reusing commonly computed intermediate results. Within this thesis, we tackle the two challenges described above. First, we present the STARK framework for processing spatio-temporal vector and raster data on the Apache Spark stack. For operators, we identify several possible algorithms and study how they can benefit from the underlying platform's properties. We further investigate how indexes can be realized in the distributed and parallel architecture of Big Data processing engines and compare methods for data partitioning, which perform differently well with respect to data skew and data set size. Furthermore, an approach to reduce the amount of data to process at operator level is presented. In order to reduce the execution times, we introduce an approach to transparently recycle intermediate results of dataflow programs, based on operator costs. To compute the costs, we instrument the programs with profiling code to gather the execution time and result size of the operators. In the evaluation, we first compare the various implementation and configuration possibilities in STARK and identify scenarios when and how partitioning and indexing should be applied. We further compare STARK to related systems and show that we can achieve significantly better execution times, not only when exploiting existing partitioning information. In the second part of the evaluation, we show that with the transparent cost-based materialization and recycling of intermediate results, the execution times of programs can be reduced significantly

    Multi-Step Processing of Spatial Joins

    Get PDF
    Spatial joins are one of the most important operations for combining spatial objects of several relations. In this paper, spatial join processing is studied in detail for extended spatial objects in twodimensional data space. We present an approach for spatial join processing that is based on three steps. First, a spatial join is performed on the minimum bounding rectangles of the objects returning a set of candidates. Various approaches for accelerating this step of join processing have been examined at the last year’s conference [BKS 93a]. In this paper, we focus on the problem how to compute the answers from the set of candidates which is handled by the following two steps. First of all, sophisticated approximations are used to identify answers as well as to filter out false hits from the set of candidates. For this purpose, we investigate various types of conservative and progressive approximations. In the last step, the exact geometry of the remaining candidates has to be tested against the join predicate. The time required for computing spatial join predicates can essentially be reduced when objects are adequately organized in main memory. In our approach, objects are first decomposed into simple components which are exclusively organized by a main-memory resident spatial data structure. Overall, we present a complete approach of spatial join processing on complex spatial objects. The performance of the individual steps of our approach is evaluated with data sets from real cartographic applications. The results show that our approach reduces the total execution time of the spatial join by factors

    A 3d geoscience information system framework

    Get PDF
    Two-dimensional geographical information systems are extensively used in the geosciences to create and analyse maps. However, these systems are unable to represent the Earth's subsurface in three spatial dimensions. The objective of this thesis is to overcome this deficiency, to provide a general framework for a 3d geoscience information system (GIS), and to contribute to the public discussion about the development of an infrastructure for geological observation data, geomodels, and geoservices. Following the objective, the requirements for a 3d GIS are analysed. According to the requirements, new geologically sensible query functionality for geometrical, topological and geological properties has been developed and the integration of 3d geological modeling and data management system components in a generic framework has been accomplished. The 3d geoscience information system framework presented here is characterized by the following features: - Storage of geological observation data and geomodels in a XML-database server. According to a new data model, geological observation data can be referenced by a set of geomodels. - Functionality for querying observation data and 3d geomodels based on their 3d geometrical, topological, material, and geological properties were developed and implemented as plug-in for a 3d geomodeling user application. - For database queries, the standard XML query language has been extended with 3d spatial operators. The spatial database query operations are computed using a XML application server which has been developed for this specific purpose. This technology allows sophisticated 3d spatial and geological database queries. Using the developed methods, queries can be answered like: "Select all sandstone horizons which are intersected by the set of faults F". This request contains a topological and a geological material parameter. The combination of queries with other GIS methods, like visual and statistical analysis, allows geoscience investigations in a novel 3d GIS environment. More generally, a 3d GIS enables geologists to read and understand a 3d digital geomodel analogously as they read a conventional 2d geological map

    Optimizing Analytical Queries over Semantic Web Sources

    Get PDF

    Efficient processing of similarity queries with applications

    Get PDF
    Today, a myriad of data sources, from the Internet to business operations to scientific instruments, produce large and different types of data. Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological applications, call for identifying and processing similarities in big data. As a result, it is imperative to develop new similarity query processing approaches and systems that scale from low dimensional data to high dimensional data, from single machine to clusters of hundreds of machines, and from disk-based to memory-based processing. This dissertation introduces and studies several similarity-aware query operators, analyzes and optimizes their performance. The first contribution of this dissertation is an SQL-based Similarity Group-by operator (SGB, for short) that extends the semantics of the standard SQL Group-by operator to group data with similar but not necessarily equal values. We realize these SGB operators by extending the Standard SQL Group-by and introduce two new SGB operators for multi-dimensional data. We implement and test the new SGB operators and their algorithms inside an open-source centralized database server (PostgreSQL). In the second contribution of this dissertation, we study how to efficiently process Hamming-distance-based similarity queries (Hamming-distance select and Hamming-distance join) that are crucial to many applications. We introduce a new index, termed the HA-Index, that speeds up distance comparisons and eliminates redundancies when performing the two flavors of Hamming distance range queries (namely, the selects and joins). In the third and last contribution of this dissertation, we develop a system for similarity query processing and optimization in an in-memory and distributed setup for big spatial data. We propose a query scheduler and a distributed query optimizer that use a new cost model to optimize the cost of similarity query processing in this in-memory distributed setup. The scheduler and query optimizer generates query execution plans that minimize the effect of query skew. The query scheduler employs new spatial indexing techniques based on bloom filters to forward queries to the appropriate local sites. The proposed query processing and optimization techniques are prototyped inside Spark, a distributed main-memory computation system

    Querying and managing complex networks

    Get PDF
    Orientador: André SantanchèTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Compreender e quantificar as propriedades emergentes de redes naturais e de redes construídas pelo homem, tais como cadeias alimentares, interações sociais e infra-estruturas de transporte é uma tarefa desafiadora. O campo de redes complexas foi desenvolvido para agregar medições, algoritmos e técnicas para lidar com tais tópicos. Embora as pesquisas em redes complexas tenham sido aplicadas com sucesso em várias áreas de atividade humana, ainda há uma falta de infra-estruturas comuns para tarefas rotineiras, especialmente aquelas relacionadas à gestão de dados. Por outro lado, o campo de bancos de dados tem se concentrado em questões de gestão de dados desde o seu início, há várias décadas. Sistemas de banco de dados, no entanto, oferecem suporte reduzido à análise de redes. Para prover um melhor suporte para tarefas de análise de redes complexas, um sistema de banco de dados deve oferecer recursos de consulta e gerenciamento de dados adequados. Esta tese defende uma maior integração entre as áreas e apresenta nossos esforços para atingir este objetivo. Aqui nós descrevemos o Sistema de Gerenciamento de Dados Complexos (CDMS), que permite consultas exploratórias sobre redes complexas através de uma linguagem de consulta declarativa. Os resultados da consulta são classificados com base em medições de rede avaliadas no momento da consulta. Para suportar o processamento de consultas, nós introduzimos a Beta-álgebra, que oferece um operador capaz de representar diversas medições típicas de análise de redes complexas. A álgebra oferece oportunidades para otimizações transparentes de consulta baseadas em reescritas, propostas e discutidas aqui. Também introduzimos o mecanismo mapper de gestão de relacionamentos, que está integrado à linguagem de consulta. Os mecanismos de consulta e gerenciamento de dados flexíveis propostos são também úteis em cenários além da análise de redes complexas. Nós demonstramos o uso do CDMS em aplicações tais como integração de dados institucionais, recuperação de informação, classificação e recomendação. Todos os aspectos da proposta foram implementadas e testados com dados reais e sintéticosAbstract: Understanding and quantifying the emergent properties of natural and man-made networks such as food webs, social interactions, and transportation infrastructures is a challenging task. The complex networks field was developed to encompass measurements, algorithms, and techniques to tackle such topics. Although complex networks research has been successfully applied to several areas of human activity, there is still a lack of common infrastructures for routine tasks, especially those related to data management. On the other hand, the databases field has focused on mastering data management issues since its beginnings, several decades ago. Database systems, however, offer limited network analysis capabilities. To enable a better support for complex network analysis tasks, a database system must offer adequate querying and data management capabilities. This thesis advocates for a tighter integration between the areas and presents our efforts towards this goal. Here we describe the Complex Data Management System (CDMS), which enables explorative querying of complex networks through a declarative query language. Query results are ranked based on network measurements assessed at query time. To support query processing, we introduce the Beta-algebra, which offers an operator capable of representing diverse measurements typical of complex network analysis. The algebra offers opportunities for transparent query optimization through query rewritings, proposed and discussed here. We also introduce the mapper mechanism for relationship management, which is integrated in the query language. The flexible query language and data management mechanisms are useful in scenarios other than complex network analysis. We demonstrate the use of the CDMS in applications such as institutional data integration, information retrieval, classification and recommendation. All aspects of the proposal are implemented and have been tested with real and synthetic dataDoutoradoCiência da ComputaçãoDoutor em Ciência da Computação2012/15988-9FAPESPCAPE

    Dimensional enrichment of statistical linked open data

    Get PDF
    On-Line Analytical Processing (OLAP) is a data analysis technique typically used for local and well-prepared data. However, initiatives like Open Data and Open Government bring new and publicly available data on the web that are to be analyzed in the same way. The use of semantic web technologies for this context is especially encouraged by the Linked Data initiative. There is already a considerable amount of statistical linked open data sets published using the RDF Data Cube Vocabulary (QB) which is designed for these purposes. However, QB lacks some essential schema constructs (e.g., dimension levels) to support OLAP. Thus, the QB4OLAP vocabulary has been proposed to extend QB with the necessary constructs and be fully compliant with OLAP. In this paper, we focus on the enrichment of an existing QB data set with QB4OLAP semantics. We first thoroughly compare the two vocabularies and outline the benefits of QB4OLAP. Then, we propose a series of steps to automate the enrichment of QB data sets with specific QB4OLAP semantics; being the most important, the definition of aggregate functions and the detection of new concepts in the dimension hierarchy construction. The proposed steps are defined to form a semi-automatic enrichment method, which is implemented in a tool that enables the enrichment in an interactive and iterative fashion. The user can enrich the QB data set with QB4OLAP concepts (e.g., full-fledged dimension hierarchies) by choosing among the candidate concepts automatically discovered with the steps proposed. Finally, we conduct experiments with 25 users and use three real-world QB data sets to evaluate our approach. The evaluation demonstrates the feasibility of our approach and shows that, in practice, our tool facilitates, speeds up, and guarantees the correct results of the enrichment process.Peer ReviewedPostprint (author's final draft

    Acquisition and Declarative Analytical Processing of Spatio-Temporal Observation Data

    Get PDF
    A generic framework for spatio-temporal observation data acquisition and declarative analytical processing has been designed and implemented in this Thesis. The main contributions of this Thesis may be summarized as follows: 1) generalization of a data acquisition and dissemination server, with great applicability in many scientific and industrial domains, providing flexibility in the incorporation of different technologies for data acquisition, data persistence and data dissemination, 2) definition of a new hybrid logical-functional paradigm to formalize a novel data model for the integrated management of entity and sampled data, 3) definition of a novel spatio-temporal declarative data analysis language for the previous data model, 4) definition of a data warehouse data model supporting observation data semantics, including application of the above language to the declarative definition of observation processes executed during observation data load, and 5) column-oriented parallel and distributed implementation of the spatial analysis declarative language. The huge amount of data to be processed forces the exploitation of current multi-core hardware architectures and multi-node cluster infrastructures

    Mastering the Spatio-Temporal Knowledge Discovery Process

    Get PDF
    The thesis addresses a topic of great importance: a framework for data mining positioning data collected by personal mobile devices. The main contribution of this thesis is the creation of a theoretical and practical framework in order to manage the complex Knowledge discovery process on mobility data. Hence the creation of such framework leads to the integration of very different aspects of the process with their assumptions and requirements. The result is a homogeneous system which gives the possibility to exploit the power of all the components with the same flexibilities of a database such as a new way to use the ontology for an automatic reasoning on trajectory data. Furthermore two extensions are invented and developed and then integrated in the system to confirm the extensibility of it: a innovative way to reconstruct the trajectories considering the uncertainty of the path followed and a Location prediction algorithm called WhereNext. Another important contribution of the thesis is the experimentation on a real case of study on analysis of mobility data. It has been shown the usefulness of the system for a mobility manager who is provided with a knowledge discovery framework
    • …
    corecore