511,766 research outputs found
Capturing the Laws of (Data) Nature
Model fitting is at the core of many scientific and industrial
applications. These models encode a wealth of domain
knowledge, something a database decidedly lacks. Except for
simple cases, databases could not hope to achieve a deeper
understanding of the hidden relationships in the data yet.
We propose to harvest the statistical models that users fit
to the stored data as part of their analysis and use them to
advance physical data storage and approximate query answering
to unprecedented levels of performance. We motivate
our approach with an astronomical use case and discuss its
pote
Learning Models over Relational Data using Sparse Tensors and Functional Dependencies
Integrated solutions for analytics over relational databases are of great
practical importance as they avoid the costly repeated loop data scientists
have to deal with on a daily basis: select features from data residing in
relational databases using feature extraction queries involving joins,
projections, and aggregations; export the training dataset defined by such
queries; convert this dataset into the format of an external learning tool; and
train the desired model using this tool. These integrated solutions are also a
fertile ground of theoretically fundamental and challenging problems at the
intersection of relational and statistical data models.
This article introduces a unified framework for training and evaluating a
class of statistical learning models over relational databases. This class
includes ridge linear regression, polynomial regression, factorization
machines, and principal component analysis. We show that, by synergizing key
tools from database theory such as schema information, query structure,
functional dependencies, recent advances in query evaluation algorithms, and
from linear algebra such as tensor and matrix operations, one can formulate
relational analytics problems and design efficient (query and data)
structure-aware algorithms to solve them.
This theoretical development informed the design and implementation of the
AC/DC system for structure-aware learning. We benchmark the performance of
AC/DC against R, MADlib, libFM, and TensorFlow. For typical retail forecasting
and advertisement planning applications, AC/DC can learn polynomial regression
models and factorization machines with at least the same accuracy as its
competitors and up to three orders of magnitude faster than its competitors
whenever they do not run out of memory, exceed 24-hour timeout, or encounter
internal design limitations.Comment: 61 pages, 9 figures, 2 table
Data infrastructures and spatial models for biodiversity assessment and analysis: applications to vertebrate communities.
In conservation biology the computation of biodiversity maps, based on statistical models is a central concern. These maps, produced with objective and repeatable methods are an essential tool for conservation and monitoring programs as well as for landuse planning.
Since the computation of biodiversity maps requires complex and time consuming procedures for data processing and analysis, it is necessary to design methods for homogeneous, scalable and repeatable data management and analysis.
Moreover, the huge volume of data used in ecological modelling requires suitable software architectures to store, analyze, retrieve and distribute information in order to support research and management actions in due time.
First of all we developed an analysis system (SOS - Species Open Spreader) providing statistical and mathematical models to predict species distribution in relation to a set of predictive environmental and geographical variables The system is composed of a module for data input/output toward and from the GIS and of a package of scripts for the application of different modelling techniques. At present, three statistical techniques are integrated in SOS: Logistic Regression Analysis (LRA), Environmental Niche Factor Analysis (ENFA) and flexible Discriminant Analysis with method BRUTO. Furthermore, two empirical spatial methods of analysis are available within SOS: Habitat Suitability Index (HSI) and Spatial Overlay.
The system is designed to work with the GIS (Geographical Information System) soft-ware GRASS and the statistical environment R, coupled together through the SPGRASS6 library. Three different outputs are expected: text and graphical outputs with statistical results and suitability maps.
Second, we tested the use of spatial Database Management Systems (Spatial DBMS) to handle wildlife and socio-economic data and we developed a web database application to provide facilities for database access. The information system was built for the Meru district (Tanzania) in the context of an Italian cooperation project of land use planning in Maasai rural areas.
We tested two di_erent solutions: SpatiaLite and PostgreSQL-PostGIS; they both offer advanced technical facilities and spatial extensions to analyze spatial data. SpatiaLite is a new solution and offers the main advantages to consist of a unique file and to present a user-friendly interface, which make it the best solution for many applications. in spite of this we used PostgreSQL-PostGIS since it represents a well-established information system supported by libraries for web applications development.
We applied SOS to three case studies at different spatial scale: Brescia plain (small scale), Mount Meru region - Tanzania (medium scale) and Lombardy region (big scale) in order to produce maps of species potential distribution and biodiversity maps for planning and management.
We applied logistic regression analyses to compute models and ROC analysis for classification performance evaluation. The automation of processes through SOS gave us the possibility to build models for a large number of vertebrate species. The analysis produced very reliable results at middle and big scale while regression methods did not converge at small scale. This is probably due to habitat homogeneity and to the use of environmental variables with an insufficient level of detail.
The potential distribution and biodiversity maps produced also had in all cases an applicative use in fact we used mammal species models computed for Mt. Meru region to produce a map of biodiversity within the area: this map represents an informative base for land use planning at village level within a cooperation project for Maasai economic development and environmental redemption.
Amphibians and reptiles models, computed for Lombardy, represent a good informative base for planning management actions in the region
GEOMAGIA50.v3: 1. general structure and modifications to the archeological and volcanic database
Background: GEOMAGIA50.v3 is a comprehensive online database providing access to published paleomagnetic, rock magnetic, and chronological data from a variety of materials that record Earth’s magnetic field over the past 50 ka.Findings: Since its original release in 2006, the structure and function of the database have been updated and a significant number of data have been added. Notable modifications are the following: (1) the inclusion of additional intensity, directional and metadata from archeological and volcanic materials and an improved documentation of radiocarbon dates; (2) a new data model to accommodate paleomagnetic, rock magnetic, and chronological data from lake and marine sediments; (3) a refinement of the geographic constraints in the archeomagnetic/volcanic query allowing selection of particular locations; (4) more flexible methodological and statistical constraints in the archeomagnetic/volcanic query; (5) the calculation of predictions of the Holocene geomagnetic field from a series of time varying global field models; (6) searchable reference lists; and (7) an updated web interface. This paper describes general modifications to the database and specific aspects of the archeomagnetic and volcanic database. The reader is referred to a companion publication for a description of the sediment database.Conclusions: The archeomagnetic and volcanic part of GEOMAGIA50.v3 currently contains 14,645 data (declination, inclination, and paleointensity) from 461 studies published between 1959 and 2014. We review the paleomagnetic methods used to obtain these data and discuss applications of the data within the database. The database continues to expand as legacy data are added and new studies published. The web-based interface can be found at http://geomagia.gfz-potsdam.de webcite
Speculative Approximations for Terascale Analytics
Model calibration is a major challenge faced by the plethora of statistical
analytics packages that are increasingly used in Big Data applications.
Identifying the optimal model parameters is a time-consuming process that has
to be executed from scratch for every dataset/model combination even by
experienced data scientists. We argue that the incapacity to evaluate multiple
parameter configurations simultaneously and the lack of support to quickly
identify sub-optimal configurations are the principal causes. In this paper, we
develop two database-inspired techniques for efficient model calibration.
Speculative parameter testing applies advanced parallel multi-query processing
methods to evaluate several configurations concurrently. The number of
configurations is determined adaptively at runtime, while the configurations
themselves are extracted from a distribution that is continuously learned
following a Bayesian process. Online aggregation is applied to identify
sub-optimal configurations early in the processing by incrementally sampling
the training dataset and estimating the objective function corresponding to
each configuration. We design concurrent online aggregation estimators and
define halting conditions to accurately and timely stop the execution. We apply
the proposed techniques to distributed gradient descent optimization -- batch
and incremental -- for support vector machines and logistic regression models.
We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big
Data analytics system -- and evaluate their performance over terascale-size
synthetic and real datasets. The results confirm that as many as 32
configurations can be evaluated concurrently almost as fast as one, while
sub-optimal configurations are detected accurately in as little as a
fraction of the time
Wide-Area Measurement-Based Applications for Power System Monitoring and Dynamic Modeling
Due to the increasingly complex behavior exhibited by large-scale power systems with more uncertain renewables introduced to the grid, wide-area measurement system (WAMS) has been utilized to complement the traditional supervisory control and data acquisition (SCADA) system to improve operators’ situational awareness. By providing wide-area GPS-time-synchronized measurements of grid status at high time-resolution, it is able to reveal power system dynamics which cannot be captured before and has become an essential tool to deal with current and future power grid challenges. According to the time requirements of different power system applications, the applications can be roughly divided into online applications (e.g., data visualization, fast disturbance and oscillation detection, and system response prediction and reduction) and offline applications (e.g., measurement-driven dynamic modeling and validation, post-event analysis, and statistical analysis of historical data).
In this dissertation, various wide-area measurement-based applications are presented. Firstly a pioneering WAMS deployed at the distribution level, the frequency monitoring network (FNET/GridEye) is introduced. For conventional large-scale power grid dynamic simulation, two major challenges are 1) accuracy of detailed dynamic models, and 2) computation burden for online dynamic assessment. To overcome the restrictions of the traditional approach, a measurement-based system response prediction tool using a Multivariate AutoRegressive (MAR) model is developed. It is followed by a measurement-based power system dynamic reduction tool using an autoregressive model vi to represent the external system. In addition, phasor measurement unit (PMU) data are employed to perform the generator dynamic model validation study. It utilizes both simulation data and measurement data to explore the potentials and limitations of the proposed approach. As an innovative application of using wide-area power system measurement, digital recordings could be authenticated by comparing the extracted frequency and phase angle from recordings with power system measurement database. It includes four research studies, i.e., oscillator error removal, ENF phenomenology, tampering detection, and frequency localization. Finally, several preliminary data analytics studies including inertia estimation and analysis, fault-induced delayed voltage recovery (FIDVR) detection, and statistical analysis of oscillation database, are presented
Non-parametric Ensemble Kalman methods for the inpainting of noisy dynamic textures
International audienceIn this work, we propose a novel non parametric method for the temporally consistent inpainting of dynamic texture sequences. The inpainting of texture image sequences is stated as a stochastic assimilation issue, for which a novel model-free and data-driven Ensemble Kalman method is introduced. Our model is inspired by the Analog Ensemble Kalman Filter (AnEnKF) recently proposed for the assimilation of geophysical space-time dynamics, where the physical model is replaced by the use of statistical analogs or nearest neighbours. Such a non-parametric framework is of key interest for image processing applications, as prior models are seldom available in general. We present experimental evidence for real dynamic texture that using only a catalog database of historical data and without having any assumption on the model, the proposed method provides relevant dynamically-consistent interpolation and outperforms the classical parametric (autoregressive) dynamical prior
- …