511,766 research outputs found

    Capturing the Laws of (Data) Nature

    Get PDF
    Model fitting is at the core of many scientific and industrial applications. These models encode a wealth of domain knowledge, something a database decidedly lacks. Except for simple cases, databases could not hope to achieve a deeper understanding of the hidden relationships in the data yet. We propose to harvest the statistical models that users fit to the stored data as part of their analysis and use them to advance physical data storage and approximate query answering to unprecedented levels of performance. We motivate our approach with an astronomical use case and discuss its pote

    Learning Models over Relational Data using Sparse Tensors and Functional Dependencies

    Full text link
    Integrated solutions for analytics over relational databases are of great practical importance as they avoid the costly repeated loop data scientists have to deal with on a daily basis: select features from data residing in relational databases using feature extraction queries involving joins, projections, and aggregations; export the training dataset defined by such queries; convert this dataset into the format of an external learning tool; and train the desired model using this tool. These integrated solutions are also a fertile ground of theoretically fundamental and challenging problems at the intersection of relational and statistical data models. This article introduces a unified framework for training and evaluating a class of statistical learning models over relational databases. This class includes ridge linear regression, polynomial regression, factorization machines, and principal component analysis. We show that, by synergizing key tools from database theory such as schema information, query structure, functional dependencies, recent advances in query evaluation algorithms, and from linear algebra such as tensor and matrix operations, one can formulate relational analytics problems and design efficient (query and data) structure-aware algorithms to solve them. This theoretical development informed the design and implementation of the AC/DC system for structure-aware learning. We benchmark the performance of AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting and advertisement planning applications, AC/DC can learn polynomial regression models and factorization machines with at least the same accuracy as its competitors and up to three orders of magnitude faster than its competitors whenever they do not run out of memory, exceed 24-hour timeout, or encounter internal design limitations.Comment: 61 pages, 9 figures, 2 table

    Data infrastructures and spatial models for biodiversity assessment and analysis: applications to vertebrate communities.

    Get PDF
    In conservation biology the computation of biodiversity maps, based on statistical models is a central concern. These maps, produced with objective and repeatable methods are an essential tool for conservation and monitoring programs as well as for landuse planning. Since the computation of biodiversity maps requires complex and time consuming procedures for data processing and analysis, it is necessary to design methods for homogeneous, scalable and repeatable data management and analysis. Moreover, the huge volume of data used in ecological modelling requires suitable software architectures to store, analyze, retrieve and distribute information in order to support research and management actions in due time. First of all we developed an analysis system (SOS - Species Open Spreader) providing statistical and mathematical models to predict species distribution in relation to a set of predictive environmental and geographical variables The system is composed of a module for data input/output toward and from the GIS and of a package of scripts for the application of different modelling techniques. At present, three statistical techniques are integrated in SOS: Logistic Regression Analysis (LRA), Environmental Niche Factor Analysis (ENFA) and flexible Discriminant Analysis with method BRUTO. Furthermore, two empirical spatial methods of analysis are available within SOS: Habitat Suitability Index (HSI) and Spatial Overlay. The system is designed to work with the GIS (Geographical Information System) soft-ware GRASS and the statistical environment R, coupled together through the SPGRASS6 library. Three different outputs are expected: text and graphical outputs with statistical results and suitability maps. Second, we tested the use of spatial Database Management Systems (Spatial DBMS) to handle wildlife and socio-economic data and we developed a web database application to provide facilities for database access. The information system was built for the Meru district (Tanzania) in the context of an Italian cooperation project of land use planning in Maasai rural areas. We tested two di_erent solutions: SpatiaLite and PostgreSQL-PostGIS; they both offer advanced technical facilities and spatial extensions to analyze spatial data. SpatiaLite is a new solution and offers the main advantages to consist of a unique file and to present a user-friendly interface, which make it the best solution for many applications. in spite of this we used PostgreSQL-PostGIS since it represents a well-established information system supported by libraries for web applications development. We applied SOS to three case studies at different spatial scale: Brescia plain (small scale), Mount Meru region - Tanzania (medium scale) and Lombardy region (big scale) in order to produce maps of species potential distribution and biodiversity maps for planning and management. We applied logistic regression analyses to compute models and ROC analysis for classification performance evaluation. The automation of processes through SOS gave us the possibility to build models for a large number of vertebrate species. The analysis produced very reliable results at middle and big scale while regression methods did not converge at small scale. This is probably due to habitat homogeneity and to the use of environmental variables with an insufficient level of detail. The potential distribution and biodiversity maps produced also had in all cases an applicative use in fact we used mammal species models computed for Mt. Meru region to produce a map of biodiversity within the area: this map represents an informative base for land use planning at village level within a cooperation project for Maasai economic development and environmental redemption. Amphibians and reptiles models, computed for Lombardy, represent a good informative base for planning management actions in the region

    GEOMAGIA50.v3: 1. general structure and modifications to the archeological and volcanic database

    Get PDF
    Background: GEOMAGIA50.v3 is a comprehensive online database providing access to published paleomagnetic, rock magnetic, and chronological data from a variety of materials that record Earth’s magnetic field over the past 50 ka.Findings: Since its original release in 2006, the structure and function of the database have been updated and a significant number of data have been added. Notable modifications are the following: (1) the inclusion of additional intensity, directional and metadata from archeological and volcanic materials and an improved documentation of radiocarbon dates; (2) a new data model to accommodate paleomagnetic, rock magnetic, and chronological data from lake and marine sediments; (3) a refinement of the geographic constraints in the archeomagnetic/volcanic query allowing selection of particular locations; (4) more flexible methodological and statistical constraints in the archeomagnetic/volcanic query; (5) the calculation of predictions of the Holocene geomagnetic field from a series of time varying global field models; (6) searchable reference lists; and (7) an updated web interface. This paper describes general modifications to the database and specific aspects of the archeomagnetic and volcanic database. The reader is referred to a companion publication for a description of the sediment database.Conclusions: The archeomagnetic and volcanic part of GEOMAGIA50.v3 currently contains 14,645 data (declination, inclination, and paleointensity) from 461 studies published between 1959 and 2014. We review the paleomagnetic methods used to obtain these data and discuss applications of the data within the database. The database continues to expand as legacy data are added and new studies published. The web-based interface can be found at http://geomagia.gfz-potsdam.de webcite

    Speculative Approximations for Terascale Analytics

    Full text link
    Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the incapacity to evaluate multiple parameter configurations simultaneously and the lack of support to quickly identify sub-optimal configurations are the principal causes. In this paper, we develop two database-inspired techniques for efficient model calibration. Speculative parameter testing applies advanced parallel multi-query processing methods to evaluate several configurations concurrently. The number of configurations is determined adaptively at runtime, while the configurations themselves are extracted from a distribution that is continuously learned following a Bayesian process. Online aggregation is applied to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration. We design concurrent online aggregation estimators and define halting conditions to accurately and timely stop the execution. We apply the proposed techniques to distributed gradient descent optimization -- batch and incremental -- for support vector machines and logistic regression models. We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big Data analytics system -- and evaluate their performance over terascale-size synthetic and real datasets. The results confirm that as many as 32 configurations can be evaluated concurrently almost as fast as one, while sub-optimal configurations are detected accurately in as little as a 1/20th1/20^{\text{th}} fraction of the time

    Wide-Area Measurement-Based Applications for Power System Monitoring and Dynamic Modeling

    Get PDF
    Due to the increasingly complex behavior exhibited by large-scale power systems with more uncertain renewables introduced to the grid, wide-area measurement system (WAMS) has been utilized to complement the traditional supervisory control and data acquisition (SCADA) system to improve operators’ situational awareness. By providing wide-area GPS-time-synchronized measurements of grid status at high time-resolution, it is able to reveal power system dynamics which cannot be captured before and has become an essential tool to deal with current and future power grid challenges. According to the time requirements of different power system applications, the applications can be roughly divided into online applications (e.g., data visualization, fast disturbance and oscillation detection, and system response prediction and reduction) and offline applications (e.g., measurement-driven dynamic modeling and validation, post-event analysis, and statistical analysis of historical data). In this dissertation, various wide-area measurement-based applications are presented. Firstly a pioneering WAMS deployed at the distribution level, the frequency monitoring network (FNET/GridEye) is introduced. For conventional large-scale power grid dynamic simulation, two major challenges are 1) accuracy of detailed dynamic models, and 2) computation burden for online dynamic assessment. To overcome the restrictions of the traditional approach, a measurement-based system response prediction tool using a Multivariate AutoRegressive (MAR) model is developed. It is followed by a measurement-based power system dynamic reduction tool using an autoregressive model vi to represent the external system. In addition, phasor measurement unit (PMU) data are employed to perform the generator dynamic model validation study. It utilizes both simulation data and measurement data to explore the potentials and limitations of the proposed approach. As an innovative application of using wide-area power system measurement, digital recordings could be authenticated by comparing the extracted frequency and phase angle from recordings with power system measurement database. It includes four research studies, i.e., oscillator error removal, ENF phenomenology, tampering detection, and frequency localization. Finally, several preliminary data analytics studies including inertia estimation and analysis, fault-induced delayed voltage recovery (FIDVR) detection, and statistical analysis of oscillation database, are presented

    Non-parametric Ensemble Kalman methods for the inpainting of noisy dynamic textures

    No full text
    International audienceIn this work, we propose a novel non parametric method for the temporally consistent inpainting of dynamic texture sequences. The inpainting of texture image sequences is stated as a stochastic assimilation issue, for which a novel model-free and data-driven Ensemble Kalman method is introduced. Our model is inspired by the Analog Ensemble Kalman Filter (AnEnKF) recently proposed for the assimilation of geophysical space-time dynamics, where the physical model is replaced by the use of statistical analogs or nearest neighbours. Such a non-parametric framework is of key interest for image processing applications, as prior models are seldom available in general. We present experimental evidence for real dynamic texture that using only a catalog database of historical data and without having any assumption on the model, the proposed method provides relevant dynamically-consistent interpolation and outperforms the classical parametric (autoregressive) dynamical prior
    • …
    corecore