8 research outputs found

    Manipulating Interpolated Data is Easier than You Thought

    No full text
    Data defined by interpolation is frequently found in new applications involving geographical entities, moving objects, or spatiotemporal data. These data lead to potentially infinite collections of items, (e.g., the elevation of any point in a map), whose definitions are based on the association of a collection of samples with an interpolation function. The naive manipulation of the data through direct access to both the samples and the interpolation functions leads to cumbersome or inaccurate queries. It is desirable to hide the samples and the interpolation functions from the logical level, while their manipulation is performed automatically. We propose to model such data using infinite relations (e.g., the map with elevation yields an infinite ternary relation) which can be manipulated through standard relational query languages (e.g., SQL), with no mention of the interpolated definition. The clear separation between logical and physical levels ensures the accu..

    Efficient Processing of Raster and Vector Data

    Get PDF
    [Abstract] In this work, we propose a framework to store and manage spatial data, which includes new efficient algorithms to perform operations accepting as input a raster dataset and a vector dataset. More concretely, we present algorithms for solving a spatial join between a raster and a vector dataset imposing a restriction on the values of the cells of the raster; and an algorithm for retrieving K objects of a vector dataset that overlap cells of a raster dataset, such that the K objects are those overlapping the highest (or lowest) cell values among all objects. The raster data is stored using a compact data structure, which can directly manipulate compressed data without the need for prior decompression. This leads to better running times and lower memory consumption. In our experimental evaluation comparing our solution to other baselines, we obtain the best space/time trade-offs.This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 690941; from the Ministerio de Ciencia, Innovación y Universidades (PGE and ERDF) grant numbers TIN2016-78011-C4-1-R; TIN2016-77158 C4-3-R; RTC-2017-5908-7; from Xunta de Galicia (co-founded with ERDF) grant numbers ED431C 2017/58; ED431G/01; IN852A 2018/14; and University of Bío-Bío grant numbers 192119 2/R; 195119 GI/VCXunta de Galicia; ED431C 2017/58Xunta de Galicia; ED431G/01Xunta de Galicia; IN852A 2018/14Universidad del Bío-Bío (Chile); 192119 2/RUniversidad del Bío-Bío (Chile); 195119 GI/V

    Efficient processing of raster and vector data

    Get PDF
    [Abstract] In this work, we propose a framework to store and manage spatial data, which includes new efficient algorithms to perform operations accepting as input a raster dataset and a vector dataset. More concretely, we present algorithms for solving a spatial join between a raster and a vector dataset imposing a restriction on the values of the cells of the raster; and an algorithm for retrieving K objects of a vector dataset that overlap cells of a raster dataset, such that the K objects are those overlapping the highest (or lowest) cell values among all objects. The raster data is stored using a compact data structure, which can directly manipulate compressed data without the need for prior decompression. This leads to better running times and lower memory consumption. In our experimental evaluation comparing our solution to other baselines, we obtain the best space/time trade-offs.Ministerio de Ciencia, Innovación y Universidades; TIN2016-78011-C4-1-RMinisterio de Ciencia, Innovación y Universidades; TIN2016-77158 C4-3-RMinisterio de Ciencia, Innovación y Universidades; RTC-2017-5908-7Xunta de Galicia; ED431C 2017/58Xunta de Galicia; ED431G/01Xunta de Galicia; IN852A 2018/14University of Bío-Bío; 192119 2/RUniversity of Bío-Bío; 195119 GI/V

    Representing and querying regression models in a relational database management system

    Get PDF
    Thesis (S.M.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2007.Includes bibliographical references (p. 77-79).Curve fitting is a widely employed, useful modeling tool in several financial, scientific, engineering and data mining applications, and in applications like sensor networks that need to tolerate missing or noisy data. These applications need to both fit functions to their data using regression, and pose relational-style queries over regression models. Unfortunately, existing DBMSs are ill suited for this task because they do not include support for creating, representing and querying functional data, short of brute-force discretization of functions into a collection of tuples. This thesis describes FunctionDB, a novel DBMS that extends the state of the art. FunctionDB treats functions output by regression as first-class citizens that can be queried declaratively and manipulated like traditional database relations. The key contributions of FunctionDB are a compact, algebraic representation for regression models as piecewise functions, and an algebraic query processor that executes declarative queries directly on this representation as combinations of algebraic operations like function inversion, zero finding and symbolic integration. FunctionDB is evaluated on two real world data sets: measurements from a temperature sensor network, and traffic traces from cars driving on Boston roads. The results show that operating in the functional domain has substantial accuracy advantages (over 15% for some queries) and order of magnitude (10x-100x) performance gains over existing approaches that represent models as discrete collections of points. The thesis also describes an algorithm to maintain regression models online, as new raw data is inserted into the system. The algorithm supports a sustained insertion rate of the order of a million records per second, while generating models no less compact than a clairvoyant (offline) strategy.by Arvind Thiagarajan.S.M

    Forecasting in Database Systems

    Get PDF
    Time series forecasting is a fundamental prerequisite for decision-making processes and crucial in a number of domains such as production planning and energy load balancing. In the past, forecasting was often performed by statistical experts in dedicated software environments outside of current database systems. However, forecasts are increasingly required by non-expert users or have to be computed fully automatically without any human intervention. Furthermore, we can observe an ever increasing data volume and the need for accurate and timely forecasts over large multi-dimensional data sets. As most data subject to analysis is stored in database management systems, a rising trend addresses the integration of forecasting inside a DBMS. Yet, many existing approaches follow a black-box style and try to keep changes to the database system as minimal as possible. While such approaches are more general and easier to realize, they miss significant opportunities for improved performance and usability. In this thesis, we introduce a novel approach that seamlessly integrates time series forecasting into a traditional database management system. In contrast to flash-back queries that allow a view on the data in the past, we have developed a Flash-Forward Database System (F2DB) that provides a view on the data in the future. It supports a new query type - a forecast query - that enables forecasting of time series data and is automatically and transparently processed by the core engine of an existing DBMS. We discuss necessary extensions to the parser, optimizer, and executor of a traditional DBMS. We furthermore introduce various optimization techniques for three different types of forecast queries: ad-hoc queries, recurring queries, and continuous queries. First, we ease the expensive model creation step of ad-hoc forecast queries by reducing the amount of processed data with traditional sampling techniques. Second, we decrease the runtime of recurring forecast queries by materializing models in a specialized index structure. However, a large number of time series as well as high model creation and maintenance costs require a careful selection of such models. Therefore, we propose a model configuration advisor that determines a set of forecast models for a given query workload and multi-dimensional data set. Finally, we extend forecast queries with continuous aspects allowing an application to register a query once at our system. As new time series values arrive, we send notifications to the application based on predefined time and accuracy constraints. All of our optimization approaches intend to increase the efficiency of forecast queries while ensuring high forecast accuracy

    Efficient Evaluation of Data-intensive Batch-queries in Open Simulation Laboratories

    Get PDF
    Better instruments, faster and bigger supercomputers and easier collaboration and sharing of data in the sciences have introduced the need to manage increasingly large datasets. Advances in high-performance computing (HPC) have empowered many science disciplines' computational branches. However, many scientists lack access to HPC facilities or the necessary sophistication to develop and run HPC codes. The benefits of testing new theories and experimenting with large numerical simulations have thus been restricted to a few top users. In this dissertation, I describe the ``remote immersive analysis" approach to computational science and present new techniques and methods for the efficient evaluation of scientific analysis tasks in analysis cluster environments. I will discuss several techniques developed for the efficient evaluation of data-intensive batch-queries in large numerical simulation databases. An I/O streaming method for the evaluation of decomposable kernel computations utilizes partial-sums to evaluate a batch query by performing a single sequential pass over the data. Spatial filtering computations, which use a box filter, share not only data, but also computation and can be evaluated over an intermediate summed volumes dataset derived from the original data. This is more efficient for certain workloads even when the intermediate dataset is computed dynamically. Threshold queries have immense data requirements and potentially operate over entire time-steps of the simulation. An efficient and scalable data-parallel approach evaluates threshold queries of fields derived from the raw simulation data and stores their results in an application-aware semantic cache for fast subsequent retrieval. Finally, synchronization at a mediator, task parallel and data-parallel approaches for the evaluation of particle tracking queries are compared and examined. These techniques are developed, deployed and evaluated in the Johns Hopkins Turbulence Databases (JHTDB), an open simulation laboratory for turbulence research. The JHTDB stores the output of world-class numerical simulations of turbulence and provides public access to and means to explore their complete space-time history. The techniques discussed implement core scientific analysis routines and significantly increase the utility of the service. Additionally, they improve the performance of these routines by up-to an order of magnitude or more when compared with direct implementations or implementations adapted from the simulation code

    Formal extension of the relational model for the management of spatial and spatio-temporal data

    Get PDF
    [Resumen] En los últioms años, se ha realizado un gran esfuerzo investigador en la manipulación de datos especiales y Sistemas de Información Geográfica (SIG). Una clara limitación de las primeras aproximaciones es la falta de integración entre datos geográficos y alfanuméricos. Para resolver esto surge el área de Bases de Datos Espaciales. Los problemas que aparecen en este campo son muchos y complejos. Un primer ejemplo son las peculiaridades de las operaciones espaciales, como el calculo de la intersección espacial de dos superficies. Otro ejemplo es el elegir las estructuras de datos apropiadas (relaciones, capas, etc.) y el conjunto de operaciones adeucado. La combinación con las Bases de Datos Temporales da lugar a las Bases de Datos Espacio-temporales, en las que la inclusión de la dimensión temporal complica más los problemas anteriores. A pesar de la gran cantidad de aproximaciones propuestas, no se ha llegado todavía a una solución satisfactoria. La presente tesis propone una nueva solución que resuelve todos los problemas de modelado de datos espaciales y espacio-temporales resaltados arriba. Parte del trabajo se completó durante el proyecto ""CHOROCRONOS"": A Research Network for Saptiotemporal Database Systems"", financiado por la Unión Europea. El modelo propuesto en la tesis define tres tipos de dato punto, línea y superficie, que encajan perfectamente en la percepción humana. La definición de estos tipos de dato se basa en la definición previa de Quanta Espacial. Las estructuras de datos usadas son las relaciones no anidadas de modelo relacional puro. El conjunto de operaciones relacionales permite alcanzar casi por completo la funcionalidad propuesta en otros modelos. Todas las operaciones han sido definidas en base a un núcleo reducido de operaciones primitvas. Todos los tipos de datos, espaciales, espacio-temporales y convencionales se manipulan de forma uniforme con este conjunto de operaciones