563 research outputs found

    Conjugate Bayes for probit regression via unified skew-normal distributions

    Full text link
    Regression models for dichotomous data are ubiquitous in statistics. Besides being useful for inference on binary responses, these methods serve also as building blocks in more complex formulations, such as density regression, nonparametric classification and graphical models. Within the Bayesian framework, inference proceeds by updating the priors for the coefficients, typically set to be Gaussians, with the likelihood induced by probit or logit regressions for the responses. In this updating, the apparent absence of a tractable posterior has motivated a variety of computational methods, including Markov Chain Monte Carlo routines and algorithms which approximate the posterior. Despite being routinely implemented, Markov Chain Monte Carlo strategies face mixing or time-inefficiency issues in large p and small n studies, whereas approximate routines fail to capture the skewness typically observed in the posterior. This article proves that the posterior distribution for the probit coefficients has a unified skew-normal kernel, under Gaussian priors. Such a novel result allows efficient Bayesian inference for a wide class of applications, especially in large p and small-to-moderate n studies where state-of-the-art computational methods face notable issues. These advances are outlined in a genetic study, and further motivate the development of a wider class of conjugate priors for probit models along with methods to obtain independent and identically distributed samples from the unified skew-normal posterior

    Discriminative Nonparametric Latent Feature Relational Models with Data Augmentation

    Full text link
    We present a discriminative nonparametric latent feature relational model (LFRM) for link prediction to automatically infer the dimensionality of latent features. Under the generic RegBayes (regularized Bayesian inference) framework, we handily incorporate the prediction loss with probabilistic inference of a Bayesian model; set distinct regularization parameters for different types of links to handle the imbalance issue in real networks; and unify the analysis of both the smooth logistic log-loss and the piecewise linear hinge loss. For the nonconjugate posterior inference, we present a simple Gibbs sampler via data augmentation, without making restricting assumptions as done in variational methods. We further develop an approximate sampler using stochastic gradient Langevin dynamics to handle large networks with hundreds of thousands of entities and millions of links, orders of magnitude larger than what existing LFRM models can process. Extensive studies on various real networks show promising performance.Comment: Accepted by AAAI 201

    Ultimate Pólya Gamma Samplers–Efficient MCMC for Possibly Imbalanced Binary and Categorical Data

    Get PDF
    Modeling binary and categorical data is one of the most commonly encountered tasks of applied statisticians and econometricians. While Bayesian methods in this context have been available for decades now, they often require a high level of familiarity with Bayesian statistics or suffer from issues such as low sampling efficiency. To contribute to the accessibility of Bayesian models for binary and categorical data, we introduce novel latent variable representations based on Pólya-Gamma random variables for a range of commonly encountered logistic regression models. From these latent variable representations, new Gibbs sampling algorithms for binary, binomial, and multinomial logit models are derived. All models allow for a conditionally Gaussian likelihood representation, rendering extensions to more complex modeling frameworks such as state space models straightforward. However, sampling efficiency may still be an issue in these data augmentation based estimation frameworks. To counteract this, novel marginal data augmentation strategies are developed and discussed in detail. The merits of our approach are illustrated through extensive simulations and real data applications. Supplementary materials for this article are available online

    Reroute Prediction Service

    Full text link
    The cost of delays was estimated as 33 billion US dollars only in 2019 for the US National Airspace System, a peak value following a growth trend in past years. Aiming to address this huge inefficiency, we designed and developed a novel Data Analytics and Machine Learning system, which aims at reducing delays by proactively supporting re-routing decisions. Given a time interval up to a few days in the future, the system predicts if a reroute advisory for a certain Air Route Traffic Control Center or for a certain advisory identifier will be issued, which may impact the pertinent routes. To deliver such predictions, the system uses historical reroute data, collected from the System Wide Information Management (SWIM) data services provided by the FAA, and weather data, provided by the US National Centers for Environmental Prediction (NCEP). The data is huge in volume, and has many items streamed at high velocity, uncorrelated and noisy. The system continuously processes the incoming raw data and makes it available for the next step where an interim data store is created and adaptively maintained for efficient query processing. The resulting data is fed into an array of ML algorithms, which compete for higher accuracy. The best performing algorithm is used in the final prediction, generating the final results. Mean accuracy values higher than 90% were obtained in our experiments with this system. Our algorithm divides the area of interest in units of aggregation and uses temporal series of the aggregate measures of weather forecast parameters in each geographical unit, in order to detect correlations with reroutes and where they will most likely occur. Aiming at practical application, the system is formed by a number of microservices, which are deployed in the cloud, making the system distributed, scalable and highly available.Comment: Submitted to the 2023 IEEE/AIAA Digital Aviation Systems Conference (DASC

    Data Optimization in Deep Learning: A Survey

    Full text link
    Large-scale, high-quality data are considered an essential factor for the successful application of many deep learning techniques. Meanwhile, numerous real-world deep learning tasks still have to contend with the lack of sufficient amounts of high-quality data. Additionally, issues such as model robustness, fairness, and trustworthiness are also closely related to training data. Consequently, a huge number of studies in the existing literature have focused on the data aspect in deep learning tasks. Some typical data optimization techniques include data augmentation, logit perturbation, sample weighting, and data condensation. These techniques usually come from different deep learning divisions and their theoretical inspirations or heuristic motivations may seem unrelated to each other. This study aims to organize a wide range of existing data optimization methodologies for deep learning from the previous literature, and makes the effort to construct a comprehensive taxonomy for them. The constructed taxonomy considers the diversity of split dimensions, and deep sub-taxonomies are constructed for each dimension. On the basis of the taxonomy, connections among the extensive data optimization methods for deep learning are built in terms of four aspects. We probe into rendering several promising and interesting future directions. The constructed taxonomy and the revealed connections will enlighten the better understanding of existing methods and the design of novel data optimization techniques. Furthermore, our aspiration for this survey is to promote data optimization as an independent subdivision of deep learning. A curated, up-to-date list of resources related to data optimization in deep learning is available at \url{https://github.com/YaoRujing/Data-Optimization}

    Implicit Counterfactual Data Augmentation for Deep Neural Networks

    Full text link
    Machine-learning models are prone to capturing the spurious correlations between non-causal attributes and classes, with counterfactual data augmentation being a promising direction for breaking these spurious associations. However, explicitly generating counterfactual data is challenging, with the training efficiency declining. Therefore, this study proposes an implicit counterfactual data augmentation (ICDA) method to remove spurious correlations and make stable predictions. Specifically, first, a novel sample-wise augmentation strategy is developed that generates semantically and counterfactually meaningful deep features with distinct augmentation strength for each sample. Second, we derive an easy-to-compute surrogate loss on the augmented feature set when the number of augmented samples becomes infinite. Third, two concrete schemes are proposed, including direct quantification and meta-learning, to derive the key parameters for the robust loss. In addition, ICDA is explained from a regularization aspect, with extensive experiments indicating that our method consistently improves the generalization performance of popular depth networks on multiple typical learning scenarios that require out-of-distribution generalization.Comment: 17 pages, 16 figure

    Variational Imbalanced Regression: Fair Uncertainty Quantification via Probabilistic Smoothing

    Full text link
    Existing regression models tend to fall short in both accuracy and uncertainty estimation when the label distribution is imbalanced. In this paper, we propose a probabilistic deep learning model, dubbed variational imbalanced regression (VIR), which not only performs well in imbalanced regression but naturally produces reasonable uncertainty estimation as a byproduct. Different from typical variational autoencoders assuming I.I.D. representations (a data point's representation is not directly affected by other data points), our VIR borrows data with similar regression labels to compute the latent representation's variational distribution; furthermore, different from deterministic regression models producing point estimates, VIR predicts the entire normal-inverse-gamma distributions and modulates the associated conjugate distributions to impose probabilistic reweighting on the imbalanced data, thereby providing better uncertainty estimation. Experiments in several real-world datasets show that our VIR can outperform state-of-the-art imbalanced regression models in terms of both accuracy and uncertainty estimation. Code will soon be available at https://github.com/Wang-ML-Lab/variational-imbalanced-regression.Comment: Accepted at NeurIPS 202
    corecore