580 research outputs found

    A predictive deviance criterion for selecting a generative model in semi-supervised classification

    Get PDF
    International audienceSemi-supervised classification can be hoped to improve generative classifiers by taking profit of the information provided by the unlabeled data points, especially when there are far more unlabeled data than labeled data. This paper is concerned with selecting a generative classification model from both unlabeled and labeled data. We propose a predictive deviance criterion AICcond_{cond} aiming to select a parsimonious and relevant generative classifier in the semi-supervised context. Contrary to standard information criteria as AIC and BIC, AICcond_{cond} is focusing to the classification task since it aims to measure the predictive power of a generative model by approximating its predictive deviance. On an other hand, it avoids the computational trouble encountered with cross validation criteria due to the repeated use of the EM algorithm. AICcond_{cond} is proved to have consistency properties ensuring its parsimony compared to the Bayesian Entropy Criterion (BEC) which has a similar focus than AICcond_{cond}. In addition, numerical experiments on both simulated and real data sets highlight an encouraging behavior of AICcond_{cond} for variable and model selection in comparison to the other mentioned criteria.La classification semi-supervisée donne l'opportunité d'améliorer les classifieurs génératifs par la prise en compte de l'information des points non étiquetés lorsque ceux-ci sont beaucoup plus nombreux que les points étiquetés. Cet article a trait à la sélection d'un modèle de classification génératif dans un contexte semi-supervisé. Nous proposons un crit\ère de déviance prédictive AICcond_{cond} pour choisir un modèle génératif parcimonieux de classification. Au contraire des critères classiques d'information comme AIC ou BIC, AICcond_{cond} se focalise sur le but de classification en mesurant le pouvoir prédictif d'un modèle génératif par sa déviance prédictive. Par ailleurs, il évite les problèmes de temps de calcul inhérents à la validation croisée à cause de l'emploi répété de l'algorithme EM. Nous prouvons des propriétés de convergence du critère AICcond_{cond} qui assurent sa supériorité vis-à-vis du critère d'entropie bayésienne BEC dont le but est analogue. De plus, des illustrations numériques sur des données réelles et simulées mettent en lumière un comportement prometteur de AICcond_{cond} par rapport aux critères mentionnés pour la sélection de variables et de modèles génératifs de classification à partir d'échantillons semi-supervisés

    The Importance of Being Clustered: Uncluttering the Trends of Statistics from 1970 to 2015

    Full text link
    In this paper we retrace the recent history of statistics by analyzing all the papers published in five prestigious statistical journals since 1970, namely: Annals of Statistics, Biometrika, Journal of the American Statistical Association, Journal of the Royal Statistical Society, series B and Statistical Science. The aim is to construct a kind of "taxonomy" of the statistical papers by organizing and by clustering them in main themes. In this sense being identified in a cluster means being important enough to be uncluttered in the vast and interconnected world of the statistical research. Since the main statistical research topics naturally born, evolve or die during time, we will also develop a dynamic clustering strategy, where a group in a time period is allowed to migrate or to merge into different groups in the following one. Results show that statistics is a very dynamic and evolving science, stimulated by the rise of new research questions and types of data

    Domain-Adversarial Training of Neural Networks

    Full text link
    We introduce a new representation learning approach for domain adaptation, in which data at training and test time come from similar but different distributions. Our approach is directly inspired by the theory on domain adaptation suggesting that, for effective domain transfer to be achieved, predictions must be made based on features that cannot discriminate between the training (source) and test (target) domains. The approach implements this idea in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain (no labeled target-domain data is necessary). As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. We show that this adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent, and can thus be implemented with little effort using any of the deep learning packages. We demonstrate the success of our approach for two distinct classification problems (document sentiment analysis and image classification), where state-of-the-art domain adaptation performance on standard benchmarks is achieved. We also validate the approach for descriptor learning task in the context of person re-identification application.Comment: Published in JMLR: http://jmlr.org/papers/v17/15-239.htm

    Rmixmod: The R Package of the Model-Based Unsupervised, Supervised and Semi-Supervised Classification Mixmod Library

    Get PDF
    International audienceMixmod is a well-established software package for fitting a mixture model of multivariate Gaussian or multinomial probability distribution functions to a given data set with either a clustering, a density estimation or a discriminant analysis purpose. The Rmixmod S4 package provides a bridge between the C++ core library of Mixmod (mixmodLib) and the R statistical computing environment. In this article, we give an overview of the model-based clustering and classification methods, and we show how the R package Rmixmod can be used for clustering and discriminant analysis

    A Parsimonious Tour of Bayesian Model Uncertainty

    Full text link
    Modern statistical software and machine learning libraries are enabling semi-automated statistical inference. Within this context, it appears easier and easier to try and fit many models to the data at hand, reversing thereby the Fisherian way of conducting science by collecting data after the scientific hypothesis (and hence the model) has been determined. The renewed goal of the statistician becomes to help the practitioner choose within such large and heterogeneous families of models, a task known as model selection. The Bayesian paradigm offers a systematized way of assessing this problem. This approach, launched by Harold Jeffreys in his 1935 book Theory of Probability, has witnessed a remarkable evolution in the last decades, that has brought about several new theoretical and methodological advances. Some of these recent developments are the focus of this survey, which tries to present a unifying perspective on work carried out by different communities. In particular, we focus on non-asymptotic out-of-sample performance of Bayesian model selection and averaging techniques, and draw connections with penalized maximum likelihood. We also describe recent extensions to wider classes of probabilistic frameworks including high-dimensional, unidentifiable, or likelihood-free models

    Improving neural networks for geospatial applications with geographic context embeddings

    Get PDF
    Geospatial data sits at the core of many data-driven application domains, from urban analytics to spatial epidemiology and climate science. Over recent years, ever-growing streams of data have allowed us to quantify more and more aspects of our lives and to deploy machine learning techniques to improve public and private services. But while modern neural network methods offer a flexible and scalable toolkit for high-dimensional data analysis, they can struggle with the complexities and dependencies of real-world geographic data. The particular challenges of geographic data are the subject of the geographic information sciences (GIS). This discipline has compiled a myriad of metrics and measures to quantify spatial effects and to improve modeling in the presence of spatial dependencies. In this dissertation, we deploy metrics of spatial interactions as embeddings to enrich neural network methods for geographic data. We utilize both, functional embeddings (such as measures of spatial autocorrelation) and parametric neural-network embeddings (such as semantic vector embeddings). The embeddings are then integrated into neural network methods using four different approaches: (1) model selection, (2) auxiliary task learning, (3) feature learning, and (4) embedding loss functions. Throughout the dissertation, we use experiments with various real-world datasets to highlight performance improvements of our geographically-explicit neural network methods over naive baselines. We focus specifically on generative and predictive modeling tasks. The dissertation highlights how geographic domain-expertise together with powerful neural network backbones can provide tailored, scalable modeling solutions for the era of real-time Earth observation and urban analytics

    ContaminatedMixt: An R Package for Fitting Parsimonious Mixtures of Multivariate Contaminated Normal Distributions

    Get PDF
    We introduce the R package ContaminatedMixt, conceived to disseminate the use of mixtures of multivariate contaminated normal distributions as a tool for robust clustering and classification under the common assumption of elliptically contoured groups. Thirteen variants of the model are also implemented to introduce parsimony. The expectationconditional maximization algorithm is adopted to obtain maximum likelihood parameter estimates, and likelihood-based model selection criteria are used to select the model and the number of groups. Parallel computation can be used on multicore PCs and computer clusters, when several models have to be fitted. Differently from the more popular mixtures of multivariate normal and t distributions, this approach also allows for automatic detection of mild outliers via the maximum a posteriori probabilities procedure. To exemplify the use of the package, applications to artificial and real data are presented
    • …
    corecore