Search CORE

217,023 research outputs found

Correcting sampling biases via importance reweighting for spatial modeling

Author: Koldasbayeva Diana
Prokhorov Boris
Zaytsev Alexey
Publication venue
Publication date: 14/09/2023
Field of study

In machine learning models, the estimation of errors is often complex due to distribution bias, particularly in spatial data such as those found in environmental studies. We introduce an approach based on the ideas of importance sampling to obtain an unbiased estimate of the target error. By taking into account difference between desirable error and available data, our method reweights errors at each sample point and neutralizes the shift. Importance sampling technique and kernel density estimation were used for reweighteing. We validate the effectiveness of our approach using artificial data that resemble real-world spatial datasets. Our findings demonstrate advantages of the proposed approach for the estimation of the target error, offering a solution to a distribution shift problem. Overall error of predictions dropped from 7% to just 2% and it gets smaller for larger samples

arXiv.org e-Print Archive

The Infinite Mixture of Infinite Gaussian Mixtures

Author: Dundar Murat
Rajwa Bartek
Yerebakan Halid Z.
Publication venue
Publication date: 01/01/2015
Field of study

Dirichlet process mixture of Gaussians (DPMG) has been used in the literature for clustering and density estimation problems. However, many real-world data exhibit cluster distributions that cannot be captured by a single Gaussian. Modeling such data sets by DPMG creates several extraneous clusters even when clusters are relatively well-defined. Herein, we present the infinite mixture of infinite Gaussian mixtures (I2GMM) for more flexible modeling of data sets with skewed and multi-modal cluster distributions. Instead of using a single Gaussian for each cluster as in the standard DPMG model, the generative model of I2GMM uses a single DPMG for each cluster. The individual DPMGs are linked together through centering of their base distributions at the atoms of a higher level DP prior. Inference is performed by a collapsed Gibbs sampler that also enables partial parallelization. Experimental results on several artificial and real-world data sets suggest the proposed I2GMM model can predict clusters more accurately than existing variational Bayes and Gibbs sampler versions of DPMG

IUPUIScholarWorks

Numerical Data Imputation for Multimodal Data Sets: A Probabilistic Nearest-Neighbor Kernel Density Approach

Author: Doya Kenji
Lalande Floria
Publication venue
Publication date: 29/06/2023
Field of study

Numerical data imputation algorithms replace missing values by estimates to leverage incomplete data sets. Current imputation methods seek to minimize the error between the unobserved ground truth and the imputed values. But this strategy can create artifacts leading to poor imputation in the presence of multimodal or complex distributions. To tackle this problem, we introduce the

k

\times

KDE algorithm: a data imputation method combining nearest neighbor estimation (

k

NN) and density estimation with Gaussian kernels (KDE). We compare our method with previous data imputation methods using artificial and real-world data with different data missing scenarios and various data missing rates, and show that our method can cope with complex original data structure, yields lower data imputation errors, and provides probabilistic estimates with higher likelihood than current methods. We release the code in open-source for the community: https://github.com/DeltaFloflo/knnxkdeComment: 30 pages, 8 figures, accepted in TMLR (Reproducibility certification

arXiv.org e-Print Archive

AdaCat: Adaptive Categorical Discretization for Autoregressive Models

Author: Abbeel Pieter
Jain Ajay
Li Qiyang
Publication venue
Publication date: 03/08/2022
Field of study

Autoregressive generative models can estimate complex continuous data distributions, like trajectory rollouts in an RL environment, image intensities, and audio. Most state-of-the-art models discretize continuous data into several bins and use categorical distributions over the bins to approximate the continuous data distribution. The advantage is that the categorical distribution can easily express multiple modes and are straightforward to optimize. However, such approximation cannot express sharp changes in density without using significantly more bins, making it parameter inefficient. We propose an efficient, expressive, multimodal parameterization called Adaptive Categorical Discretization (AdaCat). AdaCat discretizes each dimension of an autoregressive model adaptively, which allows the model to allocate density to fine intervals of interest, improving parameter efficiency. AdaCat generalizes both categoricals and quantile-based regression. AdaCat is a simple add-on to any discretization-based distribution estimator. In experiments, AdaCat improves density estimation for real-world tabular data, images, audio, and trajectories, and improves planning in model-based offline RL.Comment: Uncertainty in Artificial Intelligence (UAI) 2022 13 pages, 4 figure

arXiv.org e-Print Archive

Conditional Density Models Integrating Fuzzy and Probabilistic Representations of Uncertainty

Author: Almeida e Santos Nogueira R.J. (Rui Jorge)
Publication venue
Publication date: 26/06/2014
Field of study

__Abstract__ Conditional density estimation is an important problem in a variety of areas such as system identification, machine learning, artificial intelligence, empirical economics, macroeconomic analysis, quantitative finance and risk management. This work considers the general problem of conditional density estimation, i.e., estimating and predicting the density of a response variable as a function of covariates. The semi-parametric models proposed and developed in this work combine fuzzy and probabilistic representations of uncertainty, while making very few assumptions regarding the functional form of the response variable's density or changes of the functional form across the space of covariates. These models possess sufficient generalization power to approximate a non-standard density and the ability to describe the underlying process using simple linguistic descriptors despite the complexity and possible non-linearity of this process. These novel models are applied to real world quantitative finance and risk management problems by analyzing financial time-series data containing non-trivial statistical properties, such as fat tails, asymmetric distributions and changing variation over time

EUR Research Repository

Erasmus University Digital Repository

Estimation of density log and sonic log using artificial intelligence: an example from the Perth Basin, Australia

Author: Adhari Muhammad Ridha
Kardawi Muhammad Yusuf
Publication venue: 'UIR Press'
Publication date: 15/12/2022
Field of study

It is well understood that with  a large number of data, an excellent interpretation of the subsurface condition can be produced, and also our understandings of the subsurface conditions can be improved significantly. However, having abundant subsurface geological and petrophysical data sometimes may not be possible, mainly due to budget issues. This situation can generate issues during hydrocarbon exploration and/or development activities. In this paper, the authors tried to apply artificial intelligence (AI) techniques to estimate outcomes values of particular wireline log data, using available petrophysic data. Two types of AI were selected and these are artificial neural network (ANN), and multiple linear regression (MLR). This research aims to advance our understanding of AI and its application in geology. There are three objectives of this study: (1) to estimate sonic log (DT) and density log (RhoB) using different types of AI (ANN and MLR); (2) to assess the best AI technique that can be used to estimate certain wireline log data; and (3) to compare the estimated wireline log values with the real, recorded values from the subsurface. Findings from this study show that ANN consistently provided a better accuracy percentage compared to MLR when estimating density log (RhoB). While using different set of data and technique, estimation of sonic log (DT) produced different accuracy level. Moreover, crossplot validation of the results show that the results from ANN analysis produced higher trendline reliability (R2) and correlation coefficient (R) than the results from MLR analysis. Comparison of the estimated RhoB and DT log data with the original recorded data shows minor mismatch. This is evident that AI technique can be a reliable solution to estimate particular outcomes of wireline log data, due to limited availability of the original recorded subsurface petrophysic data. It is expected that these findings would provide new insights into the application of AI in geology, and encourage the readers to explore and expand the many possibilities of the application of AI in geology

Journal of Geoscience, Engineering, Environment, and Technology

e-Journal UIR (Journal Universitas Islam Riau)