2,469 research outputs found

    On various ways of tackling incomplete information in statistics

    Get PDF
    International audienceThis short paper discusses the contributions made to the featured section on Low Quality Data. We further refine the distinction between the ontic and epistemic views of imprecise data in statistics. We also question the extent to which likelihood functions can be viewed as belief functions. Finally we comment on the data disambiguation effect of learning methods, relating it to data reconciliation problems

    Statistical modelling of categorical data under ontic and epistemic imprecision

    Get PDF

    Contributions to reasoning on imprecise data

    Get PDF
    This thesis contains four contributions which advocate cautious statistical modelling and inference. They achieve it by taking sets of models into account, either directly or indirectly by looking at compatible data situations. Special care is taken to avoid assumptions which are technically convenient, but reduce the uncertainty involved in an unjustified manner. This thesis provides methods for cautious statistical modelling and inference, which are able to exhaust the potential of precise and vague data, motivated by different fields of application, ranging from political science to official statistics. At first, the inherently imprecise Nonparametric Predictive Inference model is involved in the cautious selection of splitting variables in the construction of imprecise classification trees, which are able to describe a structure and allow for a reasonably high predictive power. Dependent on the interpretation of vagueness, different strategies for vague data are then discussed in terms of finite random closed sets: On the one hand, the data to be analysed are regarded as set-valued answers of an item in a questionnaire, where each possible answer corresponding to a subset of the sample space is interpreted as a separate entity. By this the finite random set is reduced to an (ordinary) random variable on a transformed sample space. The context of application is the analysis of voting intentions, where it is shown that the presented approach is able to characterise the undecided in a more detailed way, which common approaches are not able to. Altough the presented analysis, regarded as a first step, is carried out on set-valued data, which are suitably self-constructed with respect to the scientific research question, it still clearly demonstrates that the full potential of this quite general framework is not exhausted. It is capable of dealing with more complex applications. On the other hand, the vague data are produced by set-valued single imputation (imprecise imputation) where the finite random sets are interpreted as being the result of some (unspecified) coarsening. The approach is presented within the context of statistical matching, which is used to gain joint knowledge on features that were not jointly collected in the initial data production. This is especially relevant in data production, e.g. in official statistics, as it allows to fuse the information of already accessible data sets into a new one, without the requirement of actual data collection in the field. Finally, in order to share data, they need to be suitably anonymised. For the specific class of anonymisation techniques of microaggregation, its ability to infer on generalised linear regression models is evaluated. Therefore, the microaggregated data are regarded as a set of compatible, unobserved underlying data situations. Two strategies to follow are proposed. At first, a maximax-like optimisation strategy is pursued, in which the underlying unobserved data are incorporated into the regression model as nuisance parameters, providing a concise yet over-optimistic estimation of the regression coefficients. Secondly, an approach in terms of partial identification, which is inherently more cautious than the previous one, is applied to estimate the set of all regression coefficients that are obtained by performing the estimation on each compatible data situation. Vague data are deemed favourable to precise data as they additionally encompass the uncertainty of the individual observation, and therefore they have a higher informational value. However, to the present day, there are few (credible) statistical models that are able to deal with vague or set-valued data. For this reason, the collection of such data is neglected in data production, disallowing such models to exhaust their full potential. This in turn prevents a throughout evaluation, negatively affecting the (further) development of such models. This situation is a variant of the chicken or egg dilemma. The ambition of this thesis is to break this cycle by providing actual methods for dealing with vague data in relevant situations in practice, to stimulate the required data production.Diese Schrift setzt sich in vier BeitrĂ€gen fĂŒr eine vorsichtige statistische Modellierung und Inferenz ein. Dieses wird erreicht, indem man Mengen von Modellen betrachtet, entweder direkt oder indirekt ĂŒber die Interpretation der Daten als Menge zugrunde liegender Datensituationen. Besonderer Wert wird dabei darauf gelegt, Annahmen zu vermeiden, die zwar technisch bequem sind, aber die zugrunde liegende Unsicherheit der Daten in ungerechtfertigter Weise reduzieren. In dieser Schrift werden verschiedene Methoden der vorsichtigen Modellierung und Inferenz vorgeschlagen, die das Potential von prĂ€zisen und unscharfen Daten ausschöpfen können, angeregt von unterschiedlichen Anwendungsbereichen, die von Politikwissenschaften bis zur amtlichen Statistik reichen. Zuerst wird das Modell der Nonparametrischen PrĂ€diktiven Inferenz, welches per se unscharf ist, in der vorsichtigen Auswahl von Split-Variablen bei der Erstellung von KlassifikationsbĂ€umen verwendet, die auf Methoden der Imprecise Probabilities fußen. Diese BĂ€ume zeichnen sich dadurch aus, dass sie sowohl eine Struktur beschreiben, als auch eine annehmbar hohe PrĂ€diktionsgĂŒte aufweisen. In AbhĂ€ngigkeit von der Interpretation der UnschĂ€rfe, werden dann verschiedene Strategien fĂŒr den Umgang mit unscharfen Daten im Rahmen von finiten Random Sets erörtert. Einerseits werden die zu analysierenden Daten als mengenwertige Antwort auf eine Frage in einer Fragebogen aufgefasst. Hierbei wird jede mögliche (multiple) Antwort, die eine Teilmenge des Stichprobenraumes darstellt, als eigenstĂ€ndige EntitĂ€t betrachtet. Somit werden die finiten Random Sets auf (gewöhnliche) Zufallsvariablen reduziert, die nun in einen transformierten Raum abbilden. Im Rahmen einer Analyse von Wahlabsichten hat der vorgeschlagene Ansatz gezeigt, dass die Unentschlossenen mit ihm genauer charakterisiert werden können, als es mit den gĂ€ngigen Methoden möglich ist. Obwohl die vorgestellte Analyse, betrachtet als ein erster Schritt, auf mengenwertige Daten angewendet wird, die vor dem Hintergrund der wissenschaftlichen Forschungsfrage in geeigneter Weise selbst konstruiert worden sind, zeigt diese dennoch klar, dass die Möglichkeiten dieses generellen Ansatzes nicht ausgeschöpft sind, so dass er auch in komplexeren Situationen angewendet werden kann. Andererseits werden unscharfe Daten durch eine mengenwertige Einfachimputation (imprecise imputation) erzeugt. Hier werden die finiten Random Sets als Ergebnis einer (unspezifizierten) Vergröberung interpretiert. Der Ansatz wird im Rahmen des Statistischen Matchings vorgeschlagen, das verwendet wird, um gemeinsame Informationen ĂŒber ursprĂŒnglich nicht zusammen erhobene Merkmale zur erhalten. Dieses ist insbesondere relevant bei der Datenproduktion, beispielsweise in der amtlichen Statistik, weil es erlaubt, die verschiedenartigen Informationen aus unterschiedlichen bereits vorhandenen DatensĂ€tzen zu einen neuen Datensatz zu verschmelzen, ohne dass dafĂŒr tatsĂ€chlich Daten neu erhoben werden mĂŒssen. Zudem mĂŒssen die Daten fĂŒr den Datenaustausch in geeigneter Weise anonymisiert sein. FĂŒr die spezielle Klasse der Anonymisierungstechnik der Mikroaggregation wird ihre Eignung im Hinblick auf die Verwendbarkeit in generalisierten linearen Regressionsmodellen geprĂŒft. HierfĂŒr werden die mikroaggregierten Daten als eine Menge von möglichen, unbeobachtbaren zu Grunde liegenden Datensituationen aufgefasst. Es werden zwei Herangehensweisen prĂ€sentiert: Als Erstes wird eine maximax-Ă€hnliche Optimisierungsstrategie verfolgt, dabei werden die zu Grunde liegenden unbeobachtbaren Daten als Nuisance Parameter in das Regressionsmodell aufgenommen, was eine enge, aber auch ĂŒber-optimistische SchĂ€tzung der Regressionskoeffizienten liefert. Zweitens wird ein Ansatz im Sinne der partiellen Identifikation angewendet, der per se schon vorsichtiger ist (als der vorherige), indem er nur die Menge aller möglichen Regressionskoeffizienten schĂ€tzt, die erhalten werden können, wenn die SchĂ€tzung auf jeder zu Grunde liegenden Datensituation durchgefĂŒhrt wird. Unscharfe Daten haben gegenĂŒber prĂ€zisen Daten den Vorteil, dass sie zusĂ€tzlich die Unsicherheit der einzelnen Beobachtungseinheit umfassen. Damit besitzen sie einen höheren Informationsgehalt. Allerdings gibt es zur Zeit nur wenige glaubwĂŒrdige statistische Modelle, die mit unscharfen Daten umgehen können. Von daher wird die Erhebung solcher Daten bei der Datenproduktion vernachlĂ€ssigt, was dazu fĂŒhrt, dass entsprechende statistische Modelle ihr volles Potential nicht ausschöpfen können. Dies verhindert eine vollumfĂ€ngliche Bewertung, wodurch wiederum die (Weiter-)Entwicklung jener Modelle gehemmt wird. Dies ist eine Variante des Henne-Ei-Problems. Diese Schrift will durch Vorschlag konkreter Methoden hinsichtlich des Umgangs mit unscharfen Daten in relevanten Anwendungssituationen Lösungswege aus der beschriebenen Situation aufzeigen und damit die entsprechende Datenproduktion anregen

    Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction to Concepts and Methods

    Full text link
    The notion of uncertainty is of major importance in machine learning and constitutes a key element of machine learning methodology. In line with the statistical tradition, uncertainty has long been perceived as almost synonymous with standard probability and probabilistic predictions. Yet, due to the steadily increasing relevance of machine learning for practical applications and related issues such as safety requirements, new problems and challenges have recently been identified by machine learning scholars, and these problems may call for new methodological developments. In particular, this includes the importance of distinguishing between (at least) two different types of uncertainty, often referred to as aleatoric and epistemic. In this paper, we provide an introduction to the topic of uncertainty in machine learning as well as an overview of attempts so far at handling uncertainty in general and formalizing this distinction in particular.Comment: 59 page

    Contributions to modeling with set-valued data: benefitting from undecided respondents

    Get PDF
    This dissertation develops a methodological framework and approaches to benefit from undecided survey participants, particularly undecided voters in pre-election polls. As choices can be seen as processes that - in stages - exclude alternatives until arriving at one final element, we argue that in pre-election polls undecided participants can most suitably be represented by the set of their viable options. This consideration set sampling, in contrast to the conventional neglection of the undecided, could reduce nonresponse and collects new and valuable information. We embed the resulting set-valued data in the framework of random sets, which allows for two different interpretations, and develop modeling methods for either one. The first interpretation is called ontic and views the set of options as an entity of its own that most accurately represents the position at the time of the poll, thus as a precise representation of something naturally imprecise. With this, new ways of structural analysis emerge as individuals pondering between particular parties can now be examined. We show how the underlying categorical data structure can be preserved in this formalization process for specific models and how popular methods for categorical data analysis can be broadly transferred. As the set contains the eventual choice, under the second interpretation, the set is seen as a coarse version of an underlying truth, which is called the epistemic view. This imprecise information of something actually precise can then be used to improve predictions or election forecasting. We developed several approaches and a framework of a factorized likelihood to utilize the set-valued information for forecasting. Amongst others, we developed methods addressing the complex uncertainty induced by the undecided, weighting the justifiability of assumptions with the conciseness of the results. To evaluate and apply our approaches, we conducted a pre-election poll for the German federal election of 2021 in cooperation with the polling institute Civey, for the first time regarding undecided voters in a set-valued manner. This provides us with the unique opportunity to demonstrate the advantages of the new approaches based on a state-of-the-art survey

    Probability, fuzziness and borderline cases

    Get PDF

    Uncertainty Management of Intelligent Feature Selection in Wireless Sensor Networks

    Get PDF
    Wireless sensor networks (WSN) are envisioned to revolutionize the paradigm of monitoring complex real-world systems at a very high resolution. However, the deployment of a large number of unattended sensor nodes in hostile environments, frequent changes of environment dynamics, and severe resource constraints pose uncertainties and limit the potential use of WSN in complex real-world applications. Although uncertainty management in Artificial Intelligence (AI) is well developed and well investigated, its implications in wireless sensor environments are inadequately addressed. This dissertation addresses uncertainty management issues of spatio-temporal patterns generated from sensor data. It provides a framework for characterizing spatio-temporal pattern in WSN. Using rough set theory and temporal reasoning a novel formalism has been developed to characterize and quantify the uncertainties in predicting spatio-temporal patterns from sensor data. This research also uncovers the trade-off among the uncertainty measures, which can be used to develop a multi-objective optimization model for real-time decision making in sensor data aggregation and samplin

    Scenarios, probability and possible futures

    Get PDF
    This paper provides an introduction to the mathematical theory of possibility, and examines how this tool can contribute to the analysis of far distant futures. The degree of mathematical possibility of a future is a number between O and 1. It quantifies the extend to which a future event is implausible or surprising, without implying that it has to happen somehow. Intuitively, a degree of possibility can be seen as the upper bound of a range of admissible probability levels which goes all the way down to zero. Thus, the proposition `The possibility of X is Pi(X) can be read as `The probability of X is not greater than Pi(X).Possibility levels offers a measure to quantify the degree of unlikelihood of far distant futures. It offers an alternative between forecasts and scenarios, which are both problematic. Long range planning using forecasts with precise probabilities is problematic because it tends to suggests a false degree of precision. Using scenarios without any quantified uncertainty levels is problematic because it may lead to unjustified attention to the extreme scenarios.This paper further deals with the question of extreme cases. It examines how experts should build a set of two to four well contrasted and precisely described futures that summarizes in a simple way their knowledge. Like scenario makers, these experts face multiple objectives: they have to anchor their analysis in credible expertise; depict though-provoking possible futures; but not so provocative as to be dismissed out-of-hand. The first objective can be achieved by describing a future of possibility level 1. The second and third objective, however, balance each other. We find that a satisfying balance can be achieved by selecting extreme cases that do not rule out equiprobability. For example, if there are three cases, the possibility level of extremes should be about 1/3.Futures, futurible, scenarios, possibility, imprecise probabilities, uncertainty, fuzzy logic
    • 

    corecore