Distributional inference: the limits of reason

Abstract

Summary Distributional inference: the limits of reason Science advances by combining rational arguments and empirical information. In fields like philosophy and pure mathematics, emphasis is laid on the rational arguments, whilst in the applied sciences the collection and interpretation of data are the field of interest. In mathematical statistics one tries to combine these aspects. The primary goal is to make statistical inferences about something unknown. Such inferences can be of help in further discussion, e.g. in selecting a decision. The methods should not depend on ‘the intentions that might be furthered by utilizing the knowledge inferred’1. When the available data are too limited, then different procedures may yield different inferences. The statistician should refrain from providing a specific inference in case the differences are ‘too large’. When such an inference can be given, this inference should be accompanied by a statement about the uncertainty of the inference. This could be done by providing a distributional inference, or by providing the results of different approaches. An example is as follows. Ornithologist G.Th. de Roos is observing a population of Ruddy Turnstones (Arenaria Interpres) on the Frisian island Vlieland. Some of these birds are ringed, however the ring-number is not always observable, e.g. because another bird is blocking the view. After how many days of observing is it safe to assume that all ringed birds in the population have been observed at least once? This question can be answered by constructing a distributional inference about the number of present, yet unseen, ringed birds, including a probability statement about the hypothesis that all ringed birds have been seen. Of course, the results depend in some way on the probabilistic assumptions one makes, and on the statistical principles one follows. The first part of this thesis consists of ‘finger exercises’ illustrating that information about the unknown can only be of value if the mechanism generating the information is (sufficiently well) known. In probability theory, information is incorporated by conditioning to it. This generates difficulties in statistical practice, because unknown aspects are involved in the joint distribution of the random variables X and Y that are behind the observations x and the unknown y. Firstly, this is extensively exemplified by a die-rolling game. From the information ‘the number of eyes is even’ one cannot conclude automatically: ‘the probability that a six has been thrown, equals one third’. The way in which the source of information operates, should be incorporated in the statistical model. Secondly, a similar example, the two-envelopes problem, is considered. Again, the difficulties involving the numerical specification of conditional probabilities are in the forefront. The second and most important part deals with the situation where one has a random sample x1, . . . , xn from a distribution with density f. The goal is to use the sample to form an estimate of f or, almost equivalently, to generate a distributional inference about y(= xn+1). A new method is discussed to estimate the density f, where ‘initial knowledge’ of f is incorporated in the model. This is done by specifying a probability density ψ as the ‘initial guess’ for f. Also the degree of confidence in this ψ is quantified and incorporated in the method. By means of a multi-modal approach, incorporating aspects from both Classical and Bayesian statistics, and on basis of the sample x, ‘initial guess’ ψ (and the degree of confidence in ψ), an estimate ˆ f of f is generated. When the initial guess ψ is not unreasonable, this density estimate performs better, in general, than the generally used kernel methods. This is no surprise, since the kernel method makes no use of ψ. It is at this point unclear how the comparison will turn out when ψ is incorporated in the kernel method. To study the applicability of the developed method, an extensive data set about the pollution of Dutch waters is considered. Previous investigations showed that the different concentrations of pollutants can reasonably well be described by lognormal distributions. A complication is that the concentrations can only be measured when they are above a certain detection threshold. The density estimation theory of this thesis, adapted to mentioned complication, is used to ‘fine-tune’ the ‘initial guess’ of lognormality to the data. The resulting density estimates are better than the density estimates obtained previously by fitting lognormal densities. The density estimation theory of this thesis can usefully be applied to the goodness of fit context where a statement is required about the truth or falsity of the hypothesis H0: f = ψ. The resulting goodness of fit tests have interesting relations with the well-known χ2-test, Kolmogorovs test, and Neymans ‘smooth tests’. To emphasize the usefulness of distributional inference, an example from the interface of multivariate analysis and time-series analysis is discussed.

    Similar works