8 research outputs found

    Delineating Parameter Unidentifiabilities in Complex Models

    Full text link
    Scientists use mathematical modelling to understand and predict the properties of complex physical systems. In highly parameterised models there often exist relationships between parameters over which model predictions are identical, or nearly so. These are known as structural or practical unidentifiabilities, respectively. They are hard to diagnose and make reliable parameter estimation from data impossible. They furthermore imply the existence of an underlying model simplification. We describe a scalable method for detecting unidentifiabilities, and the functional relations defining them, for generic models. This allows for model simplification, and appreciation of which parameters (or functions thereof) cannot be estimated from data. Our algorithm can identify features such as redundant mechanisms and fast timescale subsystems, as well as the regimes in which such approximations are valid. We base our algorithm on a novel quantification of regional parametric sensitivity: multiscale sloppiness. Traditionally, the link between parametric sensitivity and the conditioning of the parameter estimation problem is made locally, through the Fisher Information Matrix. This is valid in the regime of infinitesimal measurement uncertainty. We demonstrate the duality between multiscale sloppiness and the geometry of confidence regions surrounding parameter estimates made where measurement uncertainty is non-negligible. Further theoretical relationships are provided linking multiscale sloppiness to the Likelihood-ratio test. From this, we show that a local sensitivity analysis (as typically done) is insufficient for determining the reliability of parameter estimation, even with simple (non)linear systems. Our algorithm provides a tractable alternative. We finally apply our methods to a large-scale, benchmark Systems Biology model of NF-Îş\kappaB, uncovering previously unknown unidentifiabilities

    Fair learning : une approche basée sur le transport optimale

    Get PDF
    L'objectif de cette thèse est double. D'une part, les méthodes de transport optimal sont étudiées pour l'inférence statistique. D'autre part, le récent problème de l'apprentissage équitable est considéré avec des contributions à travers le prisme de la théorie du transport optimal. L'utilisation généralisée des applications basées sur les modèles d'apprentissage automatique dans la vie quotidienne et le monde professionnel s'est accompagnée de préoccupations quant aux questions éthiques qui peuvent découler de l'adoption de ces technologies. Dans la première partie de cette thèse, nous motivons le problème de l'équité en présentant quelques résultats statistiques complets en étudiant le critère statistical parity par l'analyse de l'indice disparate impact sur l'ensemble de données réel Adult income. Il est important de noter que nous montrons qu'il peut être particulièrement difficile de créer des modèles d'apprentissage machine équitables, surtout lorsque les observations de formation contiennent des biais. Ensuite, une revue des mathématiques pour l'équité dans l'apprentissage machine est donné dans un cadre général, avec également quelques contributions nouvelles dans l'analyse du prix pour l'équité dans la régression et la classification. Dans cette dernière, nous terminons cette première partie en reformulant les liens entre l'équité et la prévisibilité en termes de mesures de probabilité. Nous analysons les méthodes de réparation basées sur le transport de distributions conditionnelles vers le barycentre de Wasserstein. Enfin, nous proposons le random repair qui permet de trouver un compromis entre une perte minimale d'information et un certain degré d'équité. La deuxième partie est dédiée à la théorie asymptotique du coût de transport empirique. Nous fournissons un Théorème de Limite Centrale pour la distance de Monge-Kantorovich entre deux distributions empiriques de tailles différentes n et m, Wp(Pn,Qm), p > = 1, avec observations sur R. Dans le cas de p > 1, nos hypothèses sont nettes en termes de moments et de régularité. Nous prouvons des résultats portant sur le choix des constantes de centrage. Nous fournissons une estimation consistente de la variance asymptotique qui permet de construire tests à deux échantillons et des intervalles de confiance pour certifier la similarité entre deux distributions. Ceux-ci sont ensuite utilisés pour évaluer un nouveau critère d'équité de l'ensemble des données dans la classification. En outre, nous fournissons un principe de déviations modérées pour le coût de transport empirique dans la dimension générale. Enfin, les barycentres de Wasserstein et le critère de variance en termes de la distance de Wasserstein sont utilisés dans de nombreux problèmes pour analyser l'homogénéité des ensembles de distributions et les relations structurelles entre les observations. Nous proposons l'estimation des quantiles du processus empirique de la variation de Wasserstein en utilisant une procédure bootstrap. Ensuite, nous utilisons ces résultats pour l'inférence statistique sur un modèle d'enregistrement de distribution avec des fonctions de déformation générale. Les tests sont basés sur la variance des distributions par rapport à leurs barycentres de Wasserstein pour lesquels nous prouvons les théorèmes de limite centrale, y compris les versions bootstrap.The aim of this thesis is two-fold. On the one hand, optimal transportation methods are studied for statistical inference purposes. On the other hand, the recent problem of fair learning is addressed through the prism of optimal transport theory. The generalization of applications based on machine learning models in the everyday life and the professional world has been accompanied by concerns about the ethical issues that may arise from the adoption of these technologies. In the first part of the thesis, we motivate the fairness problem by presenting some comprehensive results from the study of the statistical parity criterion through the analysis of the disparate impact index on the real and well-known Adult Income dataset. Importantly, we show that trying to make fair machine learning models may be a particularly challenging task, especially when the training observations contain bias. Then a review of Mathematics for fairness in machine learning is given in a general setting, with some novel contributions in the analysis of the price for fairness in regression and classification. In the latter, we finish this first part by recasting the links between fairness and predictability in terms of probability metrics. We analyze repair methods based on mapping conditional distributions to the Wasserstein barycenter. Finally, we propose a random repair which yields a tradeoff between minimal information loss and a certain amount of fairness. The second part is devoted to the asymptotic theory of the empirical transportation cost. We provide a Central Limit Theorem for the Monge-Kantorovich distance between two empirical distributions with different sizes n and m, Wp(Pn,Qm), p > = 1, for observations on R. In the case p > 1 our assumptions are sharp in terms of moments and smoothness. We prove results dealing with the choice of centering constants. We provide a consistent estimate of the asymptotic variance which enables to build two sample tests and confidence intervals to certify the similarity between two distributions. These are then used to assess a new criterion of data set fairness in classification. Additionally, we provide a moderate deviation principle for the empirical transportation cost in general dimension. Finally, Wasserstein barycenters and variance-like criterion using Wasserstein distance are used in many problems to analyze the homogeneity of collections of distributions and structural relationships between the observations. We propose the estimation of the quantiles of the empirical process of the Wasserstein's variation using a bootstrap procedure. Then we use these results for statistical inference on a distribution registration model for general deformation functions. The tests are based on the variance of the distributions with respect to their Wasserstein's barycenters for which we prove central limit theorems, including bootstrap versions

    Digital twin development for improved operation of batch process systems

    Get PDF

    Generation of (synthetic) influent data for performing wastewater treatment modelling studies

    Get PDF
    The success of many modelling studies strongly depends on the availability of sufficiently long influent time series - the main disturbance of a typical wastewater treatment plant (WWTP) - representing the inherent natural variability at the plant inlet as accurately as possible. This is an important point since most modelling projects suffer from a lack of realistic data representing the influent wastewater dynamics. The objective of this paper is to show the advantages of creating synthetic data when performing modelling studies for WWTPs. This study reviews the different principles that influent generators can be based on, in order to create realistic influent time series. In addition, the paper summarizes the variables that those models can describe: influent flow rate, temperature and traditional/emerging pollution compounds, weather conditions (dry/wet) as well as their temporal resolution (from minutes to years). The importance of calibration/validation is addressed and the authors critically analyse the pros and cons of manual versus automatic and frequentistic vs Bayesian methods. The presentation will focus on potential engineering applications of influent generators, illustrating the different model concepts with case studies. The authors have significant experience using these types of tools and have worked on interesting case studies that they will share with the audience. Discussion with experts at the WWTmod seminar shall facilitate identifying critical knowledge gaps in current WWTP influent disturbance models. Finally, the outcome of these discussions will be used to define specific tasks that should be tackled in the near future to achieve more general acceptance and use of WWTP influent generators

    Environmental degradation and intra-household welfare: the case of the Tanzanian rural South Pare Highlands

    Get PDF
    Key words: Environmental degradation, intrahousehold labour allocation, intrahousehold welfare. Rural south Pare highlands in Tanzania experience a deteriorating environmental situation. Of particular importance is the disappearance of forests and woodlands. The consequence are declining amounts and reliability of rainfall, declining amounts of water levels and loss of biodiversity. Deterioration of environmental resources increases costs of collecting environmental products, which in many respects have no feasible close substitutes. One of the major components of the increased costs is labour time allocated by household members to collecting environmental products and/or grazing activities. This study presents an empirical investigation of the impact of this reallocation of intra-household labour resources on livelihood for different members of a household. We used the cross-sectional data. To analyse how variations in the environmental degradation affect intra-household labour allocation, three types of areas are distinguished: severely-degraded, medium-degraded, and non-degraded environments. Our findings show that (1) the environmental products collection and/or grazing activities are gender biased with husbands specializing in grazing while wives and children working mainly on fetching water and fuel wood; and that the labour time allocation is significantly influenced by environmental condition; (2) environmental degradation is limiting the production and consumption potentials in the area and that a limited adoption of agricultural modernization further aggravates this problem; (3) factors like school crowdedness, illness, bad weather, poor school quality, and school absenteeism due to street vending contribute much negatively to the probability of primary school attainment for a child apart from the environmental degradation situation; and that (4) subjective welfare and well-being of the household members are affected by the quality of the environment. This study contributes to the understanding of the situation and setting proper measures towards solving the problems of sustainable development, poverty alleviation, environmental policy, human capital formation in south Pare. <br/

    Use of statistical modelling and analyses of malaria rapid diagnostic test outcome in Ethiopia.

    Get PDF
    Thesis (Ph.D.)-University of KwaZulu-Natal, Pietermaritzburg, 2013.The transmission of malaria is among the leading public health problems in Ethiopia. From the total area of Ethiopia, more than 75% is malarious. Identifying the infectiousness of malaria by socio-economic, demographic and geographic risk factors based on the malaria rapid diagnosis test (RDT) survey results has several advantages for planning, monitoring and controlling, and eventual malaria eradication effort. Such a study requires thorough understanding of the diseases process and associated factors. However such studies are limited. Therefore, the aim of this study was to use different statistical tools suitable to identify socioeconomic, demographic and geographic risk factors of malaria based on the malaria rapid diagnosis test (RDT) survey results in Ethiopia. A total of 224 clusters of about 25 households were selected from the Amhara, Oromiya and Southern Nation Nationalities and People (SNNP) regions of Ethiopia. Accordingly, a number of binary response statistical analysis models were used. Multiple correspondence analysis was carried out to identify the association among socioeconomic, demographic and geographic factors. Moreover a number of binary response models such as survey logistic, GLMM, GLMM with spatial correlation, joint models and semi-parametric models were applied. To test and investigate how well the observed malaria RDT result, use of mosquito nets and use of indoor residual spray data fit the expectations of the model, Rasch model was used. The fitted models have their own strengths and weaknesses. Application of these models was carried out by analysing data on malaria RDT result. The data used in this study, which was conducted from December 2006 to January 2007 by The Carter Center, is from baseline malaria indicator survey in Amhara, Oromiya and Southern Nation Nationalities and People (SNNP) regions of Ethiopia. The correspondence analysis and survey logistic regression model was used to identify predictors which affect malaria RDT results. The effect of identified socioeconomic, demographic and geographic factors were subsequently explored by fitting a generalized linear mixed model (GLMM), i.e., to assess the covariance structures of the random components (to assess the association structure of the data). To examine whether the data displayed any spatial autocorrelation, i.e., whether surveys that are near in space have malaria prevalence or incidence that is similar to the surveys that are far apart, spatial statistics analysis was performed. This was done by introducing spatial autocorrelation structure in GLMM. Moreover, the customary two variables joint modelling approach was extended to three variables joint effect by exploring the joint effect of malaria RDT result, use of mosquito nets and indoor residual spray in the last twelve months. Assessing the association between these outcomes was also of interest. Furthermore, the relationships between the response and some confounding covariates may have unknown functional form. This led to proposing the use of semiparametric additive models which are less restrictive in their specification. Therefore, generalized additive mixed models were used to model the effect of age, family size, number of rooms per person, number of nets per person, altitude and number of months the room sprayed nonparametrically. The result from the study suggests that with the correct use of mosquito nets, indoor residual spraying and other preventative measures, coupled with factors such as the number of rooms in a house, are associated with a decrease in the incidence of malaria as determined by the RDT. However, the study also suggests that the poor are less likely to use these preventative measures to effectively counteract the spread of malaria. In order to determine whether or not the limited number of respondents had undue influence on the malaria RDT result, a Rasch model was used. The result shows that none of the responses had such influences. Therefore, application of the Rasch model has supported the viability of the total sixteen (socio-economic, demographic and geographic) items for measuring malaria RDT result, use of indoor residual spray and use of mosquito nets. From the analysis it can be seen that the scale shows high reliability. Hence, the result from Rasch model supports the analysis carried out in previous models
    corecore