186 research outputs found

    Robust Lasso-Zero for sparse corruption and model selection with missing covariates

    Full text link
    We propose Robust Lasso-Zero, an extension of the Lasso-Zero methodology [Descloux and Sardy, 2018], initially introduced for sparse linear models, to the sparse corruptions problem. We give theoretical guarantees on the sign recovery of the parameters for a slightly simplified version of the estimator, called Thresholded Justice Pursuit. The use of Robust Lasso-Zero is showcased for variable selection with missing values in the covariates. In addition to not requiring the specification of a model for the covariates, nor estimating their covariance matrix or the noise variance, the method has the great advantage of handling missing not-at random values without specifying a parametric model. Numerical experiments and a medical application underline the relevance of Robust Lasso-Zero in such a context with few available competitors. The method is easy to use and implemented in the R library lass0

    Uncertainty in a chemistry-transport model due to physical parameterizations and numerical approximations: An ensemble approach applied to ozone modeling

    Get PDF
    International audienceThis paper estimates the uncertainty in the outputs of a chemistry-transport model due to physical parameterizations and numerical approximations. An ensemble of 20 simulations is generated from a reference simulation in which one key parameterization (chemical mechanism, dry deposition parameterization, turbulent closure, etc.) or one numerical approximation (grid size, splitting method, etc.) is changed at a time. Intercomparisons of the simulations and comparisons with observations allow us to assess the impact of each parameterization and numerical approximation and the robustness of the model. An ensemble of 16 simulations is also generated with multiple changes in the reference simulation in order to estimate the overall uncertainty. The case study is a four-month simulation of ozone concentrations over Europe in 2001 performed using the modeling system Polyphemus. It is shown that there is a high uncertainty due to the physical parameterizations (notably the turbulence closure and the chemical mechanism). The low robustness suggests that ensemble approaches are necessary in most applications

    3-D chemistry-transport model Polair: numerical issues, validation and automatic-differentiation strategy

    Get PDF
    International audienceWe briefly present in this short paper some issues related to the development and the validation of the three-dimensional chemistry-transport model Polair. Numerical studies have been performed in order to let Polair be an efficient and robust solver. This paper summarizes and comments choices that were made in this respect. Simulations of relevant photochemical episodes were led to assess the validity of the model. The results can be considered as a validation, which allows next studies to focus on fine modeling issues. A major feature of Polair is the availability of a tangent linear mode and an adjoint mode entirely generated by automatic differentiation. Tangent linear and adjoint modes grant the opportunity to perform detailed sensitivity analyses and data assimilation. This paper shows how inverse modeling is achieved with Polair

    Ensemble-based air quality forecasts: A multimodel approach applied to ozone

    Get PDF
    International audienceThe potential of ensemble techniques to improve ozone forecasts is investigated. Ensembles with up to 48 members (models) are generated using the modeling system Polyphemus. Members differ in their physical parameterizations, their numerical approximations, and their input data. Each model is evaluated during 4 months (summer 2001) over Europe with hundreds of stations from three ozone-monitoring networks. We found that several linear combinations of models have the potential to drastically increase the performances of model-to-data comparisons. Optimal weights associated with each model are not robust in time or space. Forecasting these weights therefore requires relevant methods, such as selection of adequate learning data sets, or specific learning algorithms. Significant performance improvements are accomplished by the resulting forecasted combinations. A decrease of about 10% of the root-mean-square error is obtained on ozone daily peaks. Ozone hourly concentrations show stronger improvements

    MICS Asia Phase II - Sensitivity to the aerosol module

    Get PDF
    International audienceIn the framework of the model inter-comparison study - Asia Phase II (MICS2), where eight models are compared over East Asia, this paper studies the influence of different parameterizations used in the aerosol module on the aerosol concentrations of sulfate and nitrate in PM10. An intracomparison of aerosol concentrations is done for March 2001 using different configurations of the aerosol module of one of the model used for the intercomparison. Single modifications of a reference setup for model configurations are performed and compared to a reference case. These modifications concern the size distribution, i.e. the number of sections, and physical processes, i.e. coagulation, condensation/evaporation, cloud chemistry, heterogeneous reactions and sea-salt emissions. Comparing monthly averaged concentrations at different stations, the importance of each parameterization is first assessed. It is found that sulfate concentrations are little sensitive to sea-salt emissions and to whether condensation is computed dynamically or by assuming thermodynamic equilibrium. Nitrate concentrations are little sensitive to cloud chemistry. However, a very high sensitivity to heterogeneous reactions is observed. Thereafter, the variability of the aerosol concentrations to the use of different chemistry transport models (CTMs) and the variability to the use of different parameterizations in the aerosol module are compared. For sulfate, the variability to the use of different parameterizations in the aerosol module is lower than the variability to the use of different CTMs. However, for nitrate, for monthly averaged concentrations averaged over four stations, these two variabilities have the same order of magnitude

    A comparison study of data assimilation algorithms for ozone forecasts

    Get PDF
    International audienceThe Institute of Radiation Protection and Nuclear Safety (France) is planning the set-up of an automatic nuclear aerosol monitoring network over the French territory. Each of the stations will be able to automatically sample the air aerosol content and provide activity concentration measurements on several radionuclides. This should help monitor the French and neighbouring countries nuclear power plants set. It would help evaluate the impact of a radiological incident occurring at one of these nuclear facilities. This paper is devoted to the spatial design of such a network. Here, any potential network is judged on its ability to extrapolate activity concentrations measured on the network stations over the whole domain. The performance of a network is quantitatively assessed through a cost function that measures the discrepancy between the extrapolation and the true concentration fields. These true fields are obtained through the computation of a database of dispersion accidents over one year of meteorology and originating from 20 French nuclear sites. A close to optimal network is then looked for using a simulated annealing optimisation. The results emphasise the importance of the cost function in the design of a network aimed at monitoring an accidental dispersion. Several choices of norm used in the cost function are studied and give way to different designs. The influence of the number of stations is discussed. A comparison with a purely geometric approach which does not involve simulations with a chemistry-transport model is performed

    Polyphemus : une plate-forme multimodèles pour la pollution atmosphérique et l'évaluation des risques

    Get PDF
    National audienceCet article présente le système de modélisation de la qualité de l'air Polyphemus, ses principales fonctionnalités et quelques applications. Polyphemus est dédié à la modélisation de la dispersion atmosphérique de traceurs passifs ou d'espèces réactives aux échelles locale, régionale et continentale. Polyphemus est développé au CEREA, laboratoire commun entre EDF R&D et lʼÉcole des Ponts et au sein dʼun projet commun avec lʼInstitut national de recherche en informatique et automatique (INRIA), avec le soutien de lʼIRSN et de lʼINERIS. Polyphemus est un système dʼun type nouveau qui se distingue de lʼapproche classique du " modèle tout en un " par sa construction modulaire, notamment fondée sur des bibliothèques et des pilotes manipulant les modèles de dispersion. Accueillant plusieurs modèles, Polyphemus est une plate-forme et non un modèle. Une de ses fonctionnalités notables est sa capacité à effectuer des simulations multimodèles, ce qui permet d'évaluer des incertitudes. Plusieurs méthodes dʼassimilation de données font aussi partie du système afin de pouvoir intégrer des données fournies par des réseaux de mesure

    Dominant aerosol processes during high-pollution episodes over Greater Tokyo

    Get PDF
    This paper studies two high-pollution episodes over Greater Tokyo: 9 and 10 December 1999, and 31 July and 1 August 2001. Results obtained with the chemistry-transport model (CTM) Polair3D are compared to measurements of inorganic PM2.5. To understand to which extent the aerosol processes modeled in Polair3D impact simulated inorganic PM2.5, Polair3D is run with different options in the aerosol module, e.g. with/without heterogeneous reactions. To quantify the impact of processes outside the aerosol module, simulations are also done with another CTM (CMAQ). In the winter episode, sulfate is mostly impacted by condensation, coagulation, long-range transport, and deposition to a lesser extent. In the summer episode, the effect of long-range transport largely dominates. The impact of condensation/evaporation is dominant for ammonium, nitrate and chloride in both episodes. However, the impact of the thermodynamic equilibrium assumption is limited. The impact of heterogeneous reactions is large for nitrate and ammonium, and taking heterogeneous reactions into account appears to be crucial in predicting the peaks of nitrate and ammonium. The impact of deposition is the same for all inorganic PM2.5. It is small compared to the impact of other processes although it is not negligible. The impact of nucleation is negligible in the summer episode, and small in the winter episode. The impact of coagulation is larger in the winter episode than in the summer episode, because the number of small particles is higher in the winter episode as a consequence of nucleation.Comment: Journal of Geophysical Research D: Atmospheres (15/05/2007) in pres

    Are labels informative in semi-supervised learning? -- Estimating and leveraging the missing-data mechanism

    Full text link
    Semi-supervised learning is a powerful technique for leveraging unlabeled data to improve machine learning models, but it can be affected by the presence of ``informative'' labels, which occur when some classes are more likely to be labeled than others. In the missing data literature, such labels are called missing not at random. In this paper, we propose a novel approach to address this issue by estimating the missing-data mechanism and using inverse propensity weighting to debias any SSL algorithm, including those using data augmentation. We also propose a likelihood ratio test to assess whether or not labels are indeed informative. Finally, we demonstrate the performance of the proposed methods on different datasets, in particular on two medical datasets for which we design pseudo-realistic missing data scenarios

    R-miss-tastic: a unified platform for missing values methods and workflows

    Full text link
    Missing values are unavoidable when working with data. Their occurrence is exacerbated as more data from different sources become available. However, most statistical models and visualization methods require complete data, and improper handling of missing data results in information loss, or biased analyses. Since the seminal work of Rubin (1976), there has been a burgeoning literature on missing values with heterogeneous aims and motivations. This has resulted in the development of various methods, formalizations, and tools (including a large number of R packages and Python modules). However, for practitioners, it remains challenging to decide which method is most suited for their problem, partially because handling missing data is still not a topic systematically covered in statistics or data science curricula. To help address this challenge, we have launched a unified platform: "R-miss-tastic", which aims to provide an overview of standard missing values problems, methods, how to handle them in analyses, and relevant implementations of methodologies. In the same perspective, we have also developed several pipelines in R and Python to allow for a hands-on illustration of how to handle missing values in various statistical tasks such as estimation and prediction, while ensuring reproducibility of the analyses. This will hopefully also provide some guidance on deciding which method to choose for a specific problem and data. The objective of this work is not only to comprehensively organize materials, but also to create standardized analysis workflows, and to provide a common ground for discussions among the community. This platform is thus suited for beginners, students, more advanced analysts and researchers.Comment: 38 pages, 9 figure
    corecore