Summary
Distributional inference: the limits of reason
Science advances by combining rational arguments and empirical information. In
fields like philosophy and pure mathematics, emphasis is laid on the rational arguments,
whilst in the applied sciences the collection and interpretation of data are
the field of interest. In mathematical statistics one tries to combine these aspects.
The primary goal is to make statistical inferences about something unknown. Such
inferences can be of help in further discussion, e.g. in selecting a decision. The methods
should not depend on ‘the intentions that might be furthered by utilizing the
knowledge inferred’1. When the available data are too limited, then different procedures
may yield different inferences. The statistician should refrain from providing a
specific inference in case the differences are ‘too large’. When such an inference can
be given, this inference should be accompanied by a statement about the uncertainty
of the inference. This could be done by providing a distributional inference, or by
providing the results of different approaches.
An example is as follows. Ornithologist G.Th. de Roos is observing a population of
Ruddy Turnstones (Arenaria Interpres) on the Frisian island Vlieland. Some of these
birds are ringed, however the ring-number is not always observable, e.g. because
another bird is blocking the view. After how many days of observing is it safe to
assume that all ringed birds in the population have been observed at least once?
This question can be answered by constructing a distributional inference about the
number of present, yet unseen, ringed birds, including a probability statement about
the hypothesis that all ringed birds have been seen. Of course, the results depend in
some way on the probabilistic assumptions one makes, and on the statistical principles
one follows.
The first part of this thesis consists of ‘finger exercises’ illustrating that information
about the unknown can only be of value if the mechanism generating the information
is (sufficiently well) known. In probability theory, information is incorporated by
conditioning to it. This generates difficulties in statistical practice, because unknown
aspects are involved in the joint distribution of the random variables X and Y that are
behind the observations x and the unknown y. Firstly, this is extensively exemplified
by a die-rolling game. From the information ‘the number of eyes is even’ one cannot
conclude automatically: ‘the probability that a six has been thrown, equals one third’.
The way in which the source of information operates, should be incorporated in
the statistical model. Secondly, a similar example, the two-envelopes problem, is
considered. Again, the difficulties involving the numerical specification of conditional
probabilities are in the forefront.
The second and most important part deals with the situation where one has a random
sample x1, . . . , xn from a distribution with density f. The goal is to use the sample
to form an estimate of f or, almost equivalently, to generate a distributional inference
about y(= xn+1). A new method is discussed to estimate the density f, where ‘initial
knowledge’ of f is incorporated in the model. This is done by specifying a probability
density ψ as the ‘initial guess’ for f. Also the degree of confidence in this ψ is
quantified and incorporated in the method. By means of a multi-modal approach,
incorporating aspects from both Classical and Bayesian statistics, and on basis of
the sample x, ‘initial guess’ ψ (and the degree of confidence in ψ), an estimate ˆ f of
f is generated. When the initial guess ψ is not unreasonable, this density estimate
performs better, in general, than the generally used kernel methods. This is no
surprise, since the kernel method makes no use of ψ. It is at this point unclear how
the comparison will turn out when ψ is incorporated in the kernel method.
To study the applicability of the developed method, an extensive data set about the
pollution of Dutch waters is considered. Previous investigations showed that the
different concentrations of pollutants can reasonably well be described by lognormal
distributions. A complication is that the concentrations can only be measured when
they are above a certain detection threshold. The density estimation theory of this
thesis, adapted to mentioned complication, is used to ‘fine-tune’ the ‘initial guess’ of
lognormality to the data. The resulting density estimates are better than the density
estimates obtained previously by fitting lognormal densities.
The density estimation theory of this thesis can usefully be applied to the goodness of
fit context where a statement is required about the truth or falsity of the hypothesis
H0: f = ψ. The resulting goodness of fit tests have interesting relations with the
well-known χ2-test, Kolmogorovs test, and Neymans ‘smooth tests’.
To emphasize the usefulness of distributional inference, an example from the interface
of multivariate analysis and time-series analysis is discussed.