217,023 research outputs found
Correcting sampling biases via importance reweighting for spatial modeling
In machine learning models, the estimation of errors is often complex due to
distribution bias, particularly in spatial data such as those found in
environmental studies. We introduce an approach based on the ideas of
importance sampling to obtain an unbiased estimate of the target error. By
taking into account difference between desirable error and available data, our
method reweights errors at each sample point and neutralizes the shift.
Importance sampling technique and kernel density estimation were used for
reweighteing. We validate the effectiveness of our approach using artificial
data that resemble real-world spatial datasets. Our findings demonstrate
advantages of the proposed approach for the estimation of the target error,
offering a solution to a distribution shift problem. Overall error of
predictions dropped from 7% to just 2% and it gets smaller for larger samples
The Infinite Mixture of Infinite Gaussian Mixtures
Dirichlet process mixture of Gaussians (DPMG) has been used in the literature for clustering and density estimation problems. However, many real-world data exhibit cluster distributions that cannot be captured by a single Gaussian. Modeling such data sets by DPMG creates several extraneous clusters even when clusters are relatively well-defined. Herein, we present the infinite mixture of infinite Gaussian mixtures (I2GMM) for more flexible modeling of data sets with skewed and multi-modal cluster distributions. Instead of using a single Gaussian for each cluster as in the standard DPMG model, the generative model of I2GMM uses a single DPMG for each cluster. The individual DPMGs are linked together through centering of their base distributions at the atoms of a higher level DP prior. Inference is performed by a collapsed Gibbs sampler that also enables partial parallelization. Experimental results on several artificial and real-world data sets suggest the proposed I2GMM model can predict clusters more accurately than existing variational Bayes and Gibbs sampler versions of DPMG
Numerical Data Imputation for Multimodal Data Sets: A Probabilistic Nearest-Neighbor Kernel Density Approach
Numerical data imputation algorithms replace missing values by estimates to
leverage incomplete data sets. Current imputation methods seek to minimize the
error between the unobserved ground truth and the imputed values. But this
strategy can create artifacts leading to poor imputation in the presence of
multimodal or complex distributions. To tackle this problem, we introduce the
NNKDE algorithm: a data imputation method combining nearest neighbor
estimation (NN) and density estimation with Gaussian kernels (KDE). We
compare our method with previous data imputation methods using artificial and
real-world data with different data missing scenarios and various data missing
rates, and show that our method can cope with complex original data structure,
yields lower data imputation errors, and provides probabilistic estimates with
higher likelihood than current methods. We release the code in open-source for
the community: https://github.com/DeltaFloflo/knnxkdeComment: 30 pages, 8 figures, accepted in TMLR (Reproducibility certification
AdaCat: Adaptive Categorical Discretization for Autoregressive Models
Autoregressive generative models can estimate complex continuous data
distributions, like trajectory rollouts in an RL environment, image
intensities, and audio. Most state-of-the-art models discretize continuous data
into several bins and use categorical distributions over the bins to
approximate the continuous data distribution. The advantage is that the
categorical distribution can easily express multiple modes and are
straightforward to optimize. However, such approximation cannot express sharp
changes in density without using significantly more bins, making it parameter
inefficient. We propose an efficient, expressive, multimodal parameterization
called Adaptive Categorical Discretization (AdaCat). AdaCat discretizes each
dimension of an autoregressive model adaptively, which allows the model to
allocate density to fine intervals of interest, improving parameter efficiency.
AdaCat generalizes both categoricals and quantile-based regression. AdaCat is a
simple add-on to any discretization-based distribution estimator. In
experiments, AdaCat improves density estimation for real-world tabular data,
images, audio, and trajectories, and improves planning in model-based offline
RL.Comment: Uncertainty in Artificial Intelligence (UAI) 2022 13 pages, 4 figure
Conditional Density Models Integrating Fuzzy and Probabilistic Representations of Uncertainty
__Abstract__
Conditional density estimation is an important problem in a variety of areas such as system identification, machine learning, artificial intelligence, empirical economics, macroeconomic analysis, quantitative finance and risk management.
This work considers the general problem of conditional density estimation, i.e., estimating and predicting the density of a response variable as a function of covariates. The semi-parametric models proposed and developed in this work combine fuzzy and probabilistic representations of uncertainty, while making very few assumptions regarding the functional form of the response variable's density or changes of the functional form across the space of covariates. These models possess sufficient generalization power to approximate a non-standard density and the ability to describe the underlying process using simple linguistic descriptors despite the complexity and possible non-linearity of this process.
These novel models are applied to real world quantitative finance and risk management problems by analyzing financial time-series data containing non-trivial statistical properties, such as fat tails, asymmetric distributions and changing variation over time
Estimation of density log and sonic log using artificial intelligence: an example from the Perth Basin, Australia
It is well understood that with a large number of data, an excellent interpretation of the subsurface condition can be produced, and also our understandings of the subsurface conditions can be improved significantly. However, having abundant subsurface geological and petrophysical data sometimes may not be possible, mainly due to budget issues. This situation can generate issues during hydrocarbon exploration and/or development activities.
In this paper, the authors tried to apply artificial intelligence (AI) techniques to estimate outcomes values of particular wireline log data, using available petrophysic data. Two types of AI were selected and these are artificial neural network (ANN), and multiple linear regression (MLR). This research aims to advance our understanding of AI and its application in geology. There are three objectives of this study: (1) to estimate sonic log (DT) and density log (RhoB) using different types of AI (ANN and MLR); (2) to assess the best AI technique that can be used to estimate certain wireline log data; and (3) to compare the estimated wireline log values with the real, recorded values from the subsurface.
Findings from this study show that ANN consistently provided a better accuracy percentage compared to MLR when estimating density log (RhoB). While using different set of data and technique, estimation of sonic log (DT) produced different accuracy level. Moreover, crossplot validation of the results show that the results from ANN analysis produced higher trendline reliability (R2) and correlation coefficient (R) than the results from MLR analysis. Comparison of the estimated RhoB and DT log data with the original recorded data shows minor mismatch. This is evident that AI technique can be a reliable solution to estimate particular outcomes of wireline log data, due to limited availability of the original recorded subsurface petrophysic data. It is expected that these findings would provide new insights into the application of AI in geology, and encourage the readers to explore and expand the many possibilities of the application of AI in geology
- …