52 research outputs found

    NeuMiss networks: differentiable programming for supervised learning with missing values

    Get PDF
    International audienceThe presence of missing values makes supervised learning much more challenging. Indeed, previous work has shown that even when the response is a linear function of the complete data, the optimal predictor is a complex function of the observed entries and the missingness indicator. As a result, the computational or sample complexities of consistent approaches depend on the number of missing patterns, which can be exponential in the number of dimensions. In this work, we derive the analytical form of the optimal predictor under a linearity assumption and various missing data mechanisms including Missing at Random (MAR) and self-masking (Missing Not At Random). Based on a Neumann-series approximation of the optimal predictor, we propose a new principled architecture, named NeuMiss networks. Their originality and strength come from the use of a new type of non-linearity: the multiplication by the missingness indicator. We provide an upper bound on the Bayes risk of NeuMiss networks, and show that they have good predictive accuracy with both a number of parameters and a computational complexity independent of the number of missing data patterns. As a result they scale well to problems with many features, and remain statistically efficient for medium-sized samples. Moreover, we show that, contrary to procedures using EM or imputation, they are robust to the missing data mechanism, including difficult MNAR settings such as self-masking

    A primordial origin for the atmospheric methane of Saturn's moon Titan

    Full text link
    The origin of Titan's atmospheric methane is a key issue for understanding the origin of the Saturnian satellite system. It has been proposed that serpentinization reactions in Titan's interior could lead to the formation of the observed methane. Meanwhile, alternative scenarios suggest that methane was incorporated in Titan's planetesimals before its formation. Here, we point out that serpentinization reactions in Titan's interior are not able to reproduce the deuterium over hydrogen (D/H) ratio observed at present in methane in its atmosphere, and would require a maximum D/H ratio in Titan's water ice 30% lower than the value likely acquired by the satellite during its formation, based on Cassini observations at Enceladus. Alternatively, production of methane in Titan's interior via radiolytic reactions with water can be envisaged but the associated production rates remain uncertain. On the other hand, a mechanism that easily explains the presence of large amounts of methane trapped in Titan in a way consistent with its measured atmospheric D/H ratio is its direct capture in the satellite's planetesimals at the time of their formation in the solar nebula. In this case, the mass of methane trapped in Titan's interior can be up to 1,300 times the current mass of atmospheric methane.Comment: Accepted for publication in Icaru

    Intérêt de la TEP/TDM au 18 F-FDG dans le cholangiocarcinome

    No full text
    Le cholangiocarcinome est une tumeur rare des voies biliaires. Le taux de survie est faible et le seul traitement curateur est la chirurgie, qui présente néanmoins une morbimortalité non négligeable. Une stadification précise pré-thérapeutique est donc indispensable afin d évaluer au mieux l extension loco-régionale et à distance. Nous proposons d étudier l intérêt de la TEP/TDM au 18F-FDG par rapport au scanner et/ou IRM dans cette indication.Vingt et un patients suivis au CHU d AMIENS et porteurs d un cholangiocarcinome prouvé histologiquement ont été inclus. Ils ont tous bénéficié d un scanner et/ou d une IRM ainsi que d une TEP/TDM pré-thérapeutique. La tumeur primitive, l extension ganglionnaire et métastatique ont été étudiées pour chacune des modalités d imagerie par rapport au diagnostic final basé sur l anatomo-pathologie et/ou le suivi. La TEP/TDM, le scanner et l IRM permettaient de détecter respectivement 81/81/92% des tumeurs primitives. La sensibilité et la spécificité de la TEP/TDM pour l extension ganglionnaire étaient de 38/90% et du scanner de 38 et 70%. Pour l atteinte métastatique, elles étaient de 70 et 100% pour la TEP et de 50 et 90% pour le scanner. Ces résultats sont concordants avec ceux de la littérature. La détection de la tumeur primitive est équivalente en TEP et en scanner. Il existe une bonne spécificité de la TEP/TDM pour l évaluation de l atteinte ganglionnaire. La sensibilité de la TEP/TDM dans la détection des métastases est supérieure à celle du scanner avec également une bonne spécificité. La TEP/TDM apparaît complémentaire aux autres modalités d imagerie dans la stadification du cholangiocarcinome et indiquée chez les patients jugés opérables par l imagerie conventionnelleAMIENS-BU Santé (800212102) / SudocPARIS-BIUM (751062103) / SudocSudocFranceF

    What's a good imputation to predict with missing values?

    No full text
    35th Conference on Neural Information Processing Systems (NeurIPS 2021)International audienceHow to learn a good predictor on data with missing values? Most efforts focus on first imputing as well as possible and second learning on the completed data to predict the outcome. Yet, this widespread practice has no theoretical grounding. Here we show that for almost all imputation functions, an impute-then-regress procedure with a powerful learner is Bayes optimal. This result holds for all missing-values mechanisms, in contrast with the classic statistical results that require missing-at-random settings to use imputation in probabilistic modeling. Moreover, it implies that perfect conditional imputation is not needed for good prediction asymptotically. In fact, we show that on perfectly imputed data the best regression function will generally be discontinuous, which makes it hard to learn. Crafting instead the imputation so as to leave the regression function unchanged simply shifts the problem to learning discontinuous imputations. Rather, we suggest that it is easier to learn imputation and regression jointly. We propose such a procedure, adapting NeuMiss, a neural network capturing the conditional links across observed and unobserved variables whatever the missing-value pattern. Experiments confirm that joint imputation and regression through NeuMiss is better than various two step procedures in our experiments with finite number of samples

    What's a good imputation to predict with missing values?

    No full text
    How to learn a good predictor on data with missing values? Most efforts focus on first imputing as well as possible and second learning on the completed data to predict the outcome. Yet, this widespread practice has no theoretical grounding. Here we show that for almost all imputation functions, an impute-then-regress procedure with a powerful learner is Bayes optimal. This result holds for all missing-values mechanisms, in contrast with the classic statistical results that require missing-at-random settings to use imputation in probabilistic modeling. Moreover, it implies that perfect conditional imputation may not be needed for good prediction asymptotically. In fact, we show that on perfectly imputed data the best regression function will generally be discontinuous, which makes it hard to learn. Crafting instead the imputation so as to leave the regression function unchanged simply shifts the problem to learning discontinuous imputations. Rather, we suggest that it is easier to learn imputation and regression jointly. We propose such a procedure, adapting NeuMiss, a neural network capturing the conditional links across observed and unobserved variables whatever the missing-value pattern. Experiments confirm that joint imputation and regression through NeuMiss is better than various two step procedures in our experiments with finite number of samples

    Foraminifera as potential bio-indicators of the “

    No full text
    Benthic foraminifera are used as potential bio-indicators of pollution due to the “Erika” oil spill. The foraminiferal assemblages from a site situated on the tidal mudflat of the southern Bay of Bourgneuf (Vendée, France) have been sampled 19 times on a monthly/bimonthly scale. The field study reveals uncommon low densities and poor faunas in the first 21 months of the survey. In order to understand the effect of the “Erika” fuel, foraminiferal cultures with 0 to 72.0 mg per 100 ml of “Erika” oil were maintained in controlled conditions in the laboratory. In the laboratory, an experiment with 5.5 mg per 100 ml of oil shows morphological abnormalities, cellular modifications and a low rate of reproduction. These first results confirm the potential toxicity of the fuel No. 2 from “Erika” and the sensitivity of foraminifera to this pollutant

    Linear predictor on linearly-generated data with missing values: non consistency and solutions

    Get PDF
    International audienceWe consider building predictors when the data have missing values. We study the seemingly-simple case where the target to predict is a linear function of the fully-observed data and we show that, in the presence of missing values, the optimal predictor may not be linear. In the particular Gaussian case, it can be written as a linear function of multiway interactions between the observed data and the various missing-value indicators. Due to its intrinsic complexity, we study a simple approximation and prove generalization bounds with finite samples, highlighting regimes for which each method performs best. We then show that multilayer perceptrons with ReLU activation functions can be consistent, and can explore good trade-offs between the true model and approximations. Our study highlights the interesting family of models that are beneficial to fit with missing values depending on the amount of data available

    NeuMiss networks: differentiable programming for supervised learning with missing values

    Get PDF
    International audienceThe presence of missing values makes supervised learning much more challenging. Indeed, previous work has shown that even when the response is a linear function of the complete data, the optimal predictor is a complex function of the observed entries and the missingness indicator. As a result, the computational or sample complexities of consistent approaches depend on the number of missing patterns, which can be exponential in the number of dimensions. In this work, we derive the analytical form of the optimal predictor under a linearity assumption and various missing data mechanisms including Missing at Random (MAR) and self-masking (Missing Not At Random). Based on a Neumann-series approximation of the optimal predictor, we propose a new principled architecture, named NeuMiss networks. Their originality and strength come from the use of a new type of non-linearity: the multiplication by the missingness indicator. We provide an upper bound on the Bayes risk of NeuMiss networks, and show that they have good predictive accuracy with both a number of parameters and a computational complexity independent of the number of missing data patterns. As a result they scale well to problems with many features, and remain statistically efficient for medium-sized samples. Moreover, we show that, contrary to procedures using EM or imputation, they are robust to the missing data mechanism, including difficult MNAR settings such as self-masking
    corecore