563 research outputs found
Conjugate Bayes for probit regression via unified skew-normal distributions
Regression models for dichotomous data are ubiquitous in statistics. Besides
being useful for inference on binary responses, these methods serve also as
building blocks in more complex formulations, such as density regression,
nonparametric classification and graphical models. Within the Bayesian
framework, inference proceeds by updating the priors for the coefficients,
typically set to be Gaussians, with the likelihood induced by probit or logit
regressions for the responses. In this updating, the apparent absence of a
tractable posterior has motivated a variety of computational methods, including
Markov Chain Monte Carlo routines and algorithms which approximate the
posterior. Despite being routinely implemented, Markov Chain Monte Carlo
strategies face mixing or time-inefficiency issues in large p and small n
studies, whereas approximate routines fail to capture the skewness typically
observed in the posterior. This article proves that the posterior distribution
for the probit coefficients has a unified skew-normal kernel, under Gaussian
priors. Such a novel result allows efficient Bayesian inference for a wide
class of applications, especially in large p and small-to-moderate n studies
where state-of-the-art computational methods face notable issues. These
advances are outlined in a genetic study, and further motivate the development
of a wider class of conjugate priors for probit models along with methods to
obtain independent and identically distributed samples from the unified
skew-normal posterior
Discriminative Nonparametric Latent Feature Relational Models with Data Augmentation
We present a discriminative nonparametric latent feature relational model
(LFRM) for link prediction to automatically infer the dimensionality of latent
features. Under the generic RegBayes (regularized Bayesian inference)
framework, we handily incorporate the prediction loss with probabilistic
inference of a Bayesian model; set distinct regularization parameters for
different types of links to handle the imbalance issue in real networks; and
unify the analysis of both the smooth logistic log-loss and the piecewise
linear hinge loss. For the nonconjugate posterior inference, we present a
simple Gibbs sampler via data augmentation, without making restricting
assumptions as done in variational methods. We further develop an approximate
sampler using stochastic gradient Langevin dynamics to handle large networks
with hundreds of thousands of entities and millions of links, orders of
magnitude larger than what existing LFRM models can process. Extensive studies
on various real networks show promising performance.Comment: Accepted by AAAI 201
Ultimate Pólya Gamma Samplers–Efficient MCMC for Possibly Imbalanced Binary and Categorical Data
Modeling binary and categorical data is one of the most commonly encountered tasks of applied statisticians and econometricians. While Bayesian methods in this context have been available for decades now, they often require a high level of familiarity with Bayesian statistics or suffer from issues such as low sampling efficiency. To contribute to the accessibility of Bayesian models for binary and categorical data, we introduce novel latent variable representations based on Pólya-Gamma random variables for a range of commonly encountered logistic regression models. From these latent variable representations, new Gibbs sampling algorithms for binary, binomial, and multinomial logit models are derived. All models allow for a conditionally Gaussian likelihood representation, rendering extensions to more complex modeling frameworks such as state space models straightforward. However, sampling efficiency may still be an issue in these data augmentation based estimation frameworks. To counteract this, novel marginal data augmentation strategies are developed and discussed in detail. The merits of our approach are illustrated through extensive simulations and real data applications. Supplementary materials for this article are available online
Reroute Prediction Service
The cost of delays was estimated as 33 billion US dollars only in 2019 for
the US National Airspace System, a peak value following a growth trend in past
years. Aiming to address this huge inefficiency, we designed and developed a
novel Data Analytics and Machine Learning system, which aims at reducing delays
by proactively supporting re-routing decisions.
Given a time interval up to a few days in the future, the system predicts if
a reroute advisory for a certain Air Route Traffic Control Center or for a
certain advisory identifier will be issued, which may impact the pertinent
routes. To deliver such predictions, the system uses historical reroute data,
collected from the System Wide Information Management (SWIM) data services
provided by the FAA, and weather data, provided by the US National Centers for
Environmental Prediction (NCEP). The data is huge in volume, and has many items
streamed at high velocity, uncorrelated and noisy. The system continuously
processes the incoming raw data and makes it available for the next step where
an interim data store is created and adaptively maintained for efficient query
processing. The resulting data is fed into an array of ML algorithms, which
compete for higher accuracy. The best performing algorithm is used in the final
prediction, generating the final results. Mean accuracy values higher than 90%
were obtained in our experiments with this system.
Our algorithm divides the area of interest in units of aggregation and uses
temporal series of the aggregate measures of weather forecast parameters in
each geographical unit, in order to detect correlations with reroutes and where
they will most likely occur. Aiming at practical application, the system is
formed by a number of microservices, which are deployed in the cloud, making
the system distributed, scalable and highly available.Comment: Submitted to the 2023 IEEE/AIAA Digital Aviation Systems Conference
(DASC
Data Optimization in Deep Learning: A Survey
Large-scale, high-quality data are considered an essential factor for the
successful application of many deep learning techniques. Meanwhile, numerous
real-world deep learning tasks still have to contend with the lack of
sufficient amounts of high-quality data. Additionally, issues such as model
robustness, fairness, and trustworthiness are also closely related to training
data. Consequently, a huge number of studies in the existing literature have
focused on the data aspect in deep learning tasks. Some typical data
optimization techniques include data augmentation, logit perturbation, sample
weighting, and data condensation. These techniques usually come from different
deep learning divisions and their theoretical inspirations or heuristic
motivations may seem unrelated to each other. This study aims to organize a
wide range of existing data optimization methodologies for deep learning from
the previous literature, and makes the effort to construct a comprehensive
taxonomy for them. The constructed taxonomy considers the diversity of split
dimensions, and deep sub-taxonomies are constructed for each dimension. On the
basis of the taxonomy, connections among the extensive data optimization
methods for deep learning are built in terms of four aspects. We probe into
rendering several promising and interesting future directions. The constructed
taxonomy and the revealed connections will enlighten the better understanding
of existing methods and the design of novel data optimization techniques.
Furthermore, our aspiration for this survey is to promote data optimization as
an independent subdivision of deep learning. A curated, up-to-date list of
resources related to data optimization in deep learning is available at
\url{https://github.com/YaoRujing/Data-Optimization}
Implicit Counterfactual Data Augmentation for Deep Neural Networks
Machine-learning models are prone to capturing the spurious correlations
between non-causal attributes and classes, with counterfactual data
augmentation being a promising direction for breaking these spurious
associations. However, explicitly generating counterfactual data is
challenging, with the training efficiency declining. Therefore, this study
proposes an implicit counterfactual data augmentation (ICDA) method to remove
spurious correlations and make stable predictions. Specifically, first, a novel
sample-wise augmentation strategy is developed that generates semantically and
counterfactually meaningful deep features with distinct augmentation strength
for each sample. Second, we derive an easy-to-compute surrogate loss on the
augmented feature set when the number of augmented samples becomes infinite.
Third, two concrete schemes are proposed, including direct quantification and
meta-learning, to derive the key parameters for the robust loss. In addition,
ICDA is explained from a regularization aspect, with extensive experiments
indicating that our method consistently improves the generalization performance
of popular depth networks on multiple typical learning scenarios that require
out-of-distribution generalization.Comment: 17 pages, 16 figure
Variational Imbalanced Regression: Fair Uncertainty Quantification via Probabilistic Smoothing
Existing regression models tend to fall short in both accuracy and
uncertainty estimation when the label distribution is imbalanced. In this
paper, we propose a probabilistic deep learning model, dubbed variational
imbalanced regression (VIR), which not only performs well in imbalanced
regression but naturally produces reasonable uncertainty estimation as a
byproduct. Different from typical variational autoencoders assuming I.I.D.
representations (a data point's representation is not directly affected by
other data points), our VIR borrows data with similar regression labels to
compute the latent representation's variational distribution; furthermore,
different from deterministic regression models producing point estimates, VIR
predicts the entire normal-inverse-gamma distributions and modulates the
associated conjugate distributions to impose probabilistic reweighting on the
imbalanced data, thereby providing better uncertainty estimation. Experiments
in several real-world datasets show that our VIR can outperform
state-of-the-art imbalanced regression models in terms of both accuracy and
uncertainty estimation. Code will soon be available at
https://github.com/Wang-ML-Lab/variational-imbalanced-regression.Comment: Accepted at NeurIPS 202
- …