186 research outputs found
Robust Lasso-Zero for sparse corruption and model selection with missing covariates
We propose Robust Lasso-Zero, an extension of the Lasso-Zero methodology
[Descloux and Sardy, 2018], initially introduced for sparse linear models, to
the sparse corruptions problem. We give theoretical guarantees on the sign
recovery of the parameters for a slightly simplified version of the estimator,
called Thresholded Justice Pursuit. The use of Robust Lasso-Zero is showcased
for variable selection with missing values in the covariates. In addition to
not requiring the specification of a model for the covariates, nor estimating
their covariance matrix or the noise variance, the method has the great
advantage of handling missing not-at random values without specifying a
parametric model. Numerical experiments and a medical application underline the
relevance of Robust Lasso-Zero in such a context with few available
competitors. The method is easy to use and implemented in the R library lass0
Uncertainty in a chemistry-transport model due to physical parameterizations and numerical approximations: An ensemble approach applied to ozone modeling
International audienceThis paper estimates the uncertainty in the outputs of a chemistry-transport model due to physical parameterizations and numerical approximations. An ensemble of 20 simulations is generated from a reference simulation in which one key parameterization (chemical mechanism, dry deposition parameterization, turbulent closure, etc.) or one numerical approximation (grid size, splitting method, etc.) is changed at a time. Intercomparisons of the simulations and comparisons with observations allow us to assess the impact of each parameterization and numerical approximation and the robustness of the model. An ensemble of 16 simulations is also generated with multiple changes in the reference simulation in order to estimate the overall uncertainty. The case study is a four-month simulation of ozone concentrations over Europe in 2001 performed using the modeling system Polyphemus. It is shown that there is a high uncertainty due to the physical parameterizations (notably the turbulence closure and the chemical mechanism). The low robustness suggests that ensemble approaches are necessary in most applications
3-D chemistry-transport model Polair: numerical issues, validation and automatic-differentiation strategy
International audienceWe briefly present in this short paper some issues related to the development and the validation of the three-dimensional chemistry-transport model Polair. Numerical studies have been performed in order to let Polair be an efficient and robust solver. This paper summarizes and comments choices that were made in this respect. Simulations of relevant photochemical episodes were led to assess the validity of the model. The results can be considered as a validation, which allows next studies to focus on fine modeling issues. A major feature of Polair is the availability of a tangent linear mode and an adjoint mode entirely generated by automatic differentiation. Tangent linear and adjoint modes grant the opportunity to perform detailed sensitivity analyses and data assimilation. This paper shows how inverse modeling is achieved with Polair
Ensemble-based air quality forecasts: A multimodel approach applied to ozone
International audienceThe potential of ensemble techniques to improve ozone forecasts is investigated. Ensembles with up to 48 members (models) are generated using the modeling system Polyphemus. Members differ in their physical parameterizations, their numerical approximations, and their input data. Each model is evaluated during 4 months (summer 2001) over Europe with hundreds of stations from three ozone-monitoring networks. We found that several linear combinations of models have the potential to drastically increase the performances of model-to-data comparisons. Optimal weights associated with each model are not robust in time or space. Forecasting these weights therefore requires relevant methods, such as selection of adequate learning data sets, or specific learning algorithms. Significant performance improvements are accomplished by the resulting forecasted combinations. A decrease of about 10% of the root-mean-square error is obtained on ozone daily peaks. Ozone hourly concentrations show stronger improvements
MICS Asia Phase II - Sensitivity to the aerosol module
International audienceIn the framework of the model inter-comparison study - Asia Phase II (MICS2), where eight models are compared over East Asia, this paper studies the influence of different parameterizations used in the aerosol module on the aerosol concentrations of sulfate and nitrate in PM10. An intracomparison of aerosol concentrations is done for March 2001 using different configurations of the aerosol module of one of the model used for the intercomparison. Single modifications of a reference setup for model configurations are performed and compared to a reference case. These modifications concern the size distribution, i.e. the number of sections, and physical processes, i.e. coagulation, condensation/evaporation, cloud chemistry, heterogeneous reactions and sea-salt emissions. Comparing monthly averaged concentrations at different stations, the importance of each parameterization is first assessed. It is found that sulfate concentrations are little sensitive to sea-salt emissions and to whether condensation is computed dynamically or by assuming thermodynamic equilibrium. Nitrate concentrations are little sensitive to cloud chemistry. However, a very high sensitivity to heterogeneous reactions is observed. Thereafter, the variability of the aerosol concentrations to the use of different chemistry transport models (CTMs) and the variability to the use of different parameterizations in the aerosol module are compared. For sulfate, the variability to the use of different parameterizations in the aerosol module is lower than the variability to the use of different CTMs. However, for nitrate, for monthly averaged concentrations averaged over four stations, these two variabilities have the same order of magnitude
A comparison study of data assimilation algorithms for ozone forecasts
International audienceThe Institute of Radiation Protection and Nuclear Safety (France) is planning the set-up of an automatic nuclear aerosol monitoring network over the French territory. Each of the stations will be able to automatically sample the air aerosol content and provide activity concentration measurements on several radionuclides. This should help monitor the French and neighbouring countries nuclear power plants set. It would help evaluate the impact of a radiological incident occurring at one of these nuclear facilities. This paper is devoted to the spatial design of such a network. Here, any potential network is judged on its ability to extrapolate activity concentrations measured on the network stations over the whole domain. The performance of a network is quantitatively assessed through a cost function that measures the discrepancy between the extrapolation and the true concentration fields. These true fields are obtained through the computation of a database of dispersion accidents over one year of meteorology and originating from 20 French nuclear sites. A close to optimal network is then looked for using a simulated annealing optimisation. The results emphasise the importance of the cost function in the design of a network aimed at monitoring an accidental dispersion. Several choices of norm used in the cost function are studied and give way to different designs. The influence of the number of stations is discussed. A comparison with a purely geometric approach which does not involve simulations with a chemistry-transport model is performed
Polyphemus : une plate-forme multimodèles pour la pollution atmosphérique et l'évaluation des risques
National audienceCet article présente le système de modélisation de la qualité de l'air Polyphemus, ses principales fonctionnalités et quelques applications. Polyphemus est dédié à la modélisation de la dispersion atmosphérique de traceurs passifs ou d'espèces réactives aux échelles locale, régionale et continentale. Polyphemus est développé au CEREA, laboratoire commun entre EDF R&D et lʼÉcole des Ponts et au sein dʼun projet commun avec lʼInstitut national de recherche en informatique et automatique (INRIA), avec le soutien de lʼIRSN et de lʼINERIS. Polyphemus est un système dʼun type nouveau qui se distingue de lʼapproche classique du " modèle tout en un " par sa construction modulaire, notamment fondée sur des bibliothèques et des pilotes manipulant les modèles de dispersion. Accueillant plusieurs modèles, Polyphemus est une plate-forme et non un modèle. Une de ses fonctionnalités notables est sa capacité à effectuer des simulations multimodèles, ce qui permet d'évaluer des incertitudes. Plusieurs méthodes dʼassimilation de données font aussi partie du système afin de pouvoir intégrer des données fournies par des réseaux de mesure
Dominant aerosol processes during high-pollution episodes over Greater Tokyo
This paper studies two high-pollution episodes over Greater Tokyo: 9 and 10
December 1999, and 31 July and 1 August 2001. Results obtained with the
chemistry-transport model (CTM) Polair3D are compared to measurements of
inorganic PM2.5. To understand to which extent the aerosol processes modeled in
Polair3D impact simulated inorganic PM2.5, Polair3D is run with different
options in the aerosol module, e.g. with/without heterogeneous reactions. To
quantify the impact of processes outside the aerosol module, simulations are
also done with another CTM (CMAQ). In the winter episode, sulfate is mostly
impacted by condensation, coagulation, long-range transport, and deposition to
a lesser extent. In the summer episode, the effect of long-range transport
largely dominates. The impact of condensation/evaporation is dominant for
ammonium, nitrate and chloride in both episodes. However, the impact of the
thermodynamic equilibrium assumption is limited. The impact of heterogeneous
reactions is large for nitrate and ammonium, and taking heterogeneous reactions
into account appears to be crucial in predicting the peaks of nitrate and
ammonium. The impact of deposition is the same for all inorganic PM2.5. It is
small compared to the impact of other processes although it is not negligible.
The impact of nucleation is negligible in the summer episode, and small in the
winter episode. The impact of coagulation is larger in the winter episode than
in the summer episode, because the number of small particles is higher in the
winter episode as a consequence of nucleation.Comment: Journal of Geophysical Research D: Atmospheres (15/05/2007) in pres
Are labels informative in semi-supervised learning? -- Estimating and leveraging the missing-data mechanism
Semi-supervised learning is a powerful technique for leveraging unlabeled
data to improve machine learning models, but it can be affected by the presence
of ``informative'' labels, which occur when some classes are more likely to be
labeled than others. In the missing data literature, such labels are called
missing not at random. In this paper, we propose a novel approach to address
this issue by estimating the missing-data mechanism and using inverse
propensity weighting to debias any SSL algorithm, including those using data
augmentation. We also propose a likelihood ratio test to assess whether or not
labels are indeed informative. Finally, we demonstrate the performance of the
proposed methods on different datasets, in particular on two medical datasets
for which we design pseudo-realistic missing data scenarios
R-miss-tastic: a unified platform for missing values methods and workflows
Missing values are unavoidable when working with data. Their occurrence is
exacerbated as more data from different sources become available. However, most
statistical models and visualization methods require complete data, and
improper handling of missing data results in information loss, or biased
analyses. Since the seminal work of Rubin (1976), there has been a burgeoning
literature on missing values with heterogeneous aims and motivations. This has
resulted in the development of various methods, formalizations, and tools
(including a large number of R packages and Python modules). However, for
practitioners, it remains challenging to decide which method is most suited for
their problem, partially because handling missing data is still not a topic
systematically covered in statistics or data science curricula.
To help address this challenge, we have launched a unified platform:
"R-miss-tastic", which aims to provide an overview of standard missing values
problems, methods, how to handle them in analyses, and relevant implementations
of methodologies. In the same perspective, we have also developed several
pipelines in R and Python to allow for a hands-on illustration of how to handle
missing values in various statistical tasks such as estimation and prediction,
while ensuring reproducibility of the analyses. This will hopefully also
provide some guidance on deciding which method to choose for a specific problem
and data. The objective of this work is not only to comprehensively organize
materials, but also to create standardized analysis workflows, and to provide a
common ground for discussions among the community. This platform is thus suited
for beginners, students, more advanced analysts and researchers.Comment: 38 pages, 9 figure
- …