2 research outputs found
Retrieval of Experiments with Sequential Dirichlet Process Mixtures in Model Space
We address the problem of retrieving relevant experiments given a query
experiment, motivated by the public databases of datasets in molecular biology
and other experimental sciences, and the need of scientists to relate to
earlier work on the level of actual measurement data. Since experiments are
inherently noisy and databases ever accumulating, we argue that a retrieval
engine should possess two particular characteristics. First, it should compare
models learnt from the experiments rather than the raw measurements themselves:
this allows incorporating experiment-specific prior knowledge to suppress noise
effects and focus on what is important. Second, it should be updated
sequentially from newly published experiments, without explicitly storing
either the measurements or the models, which is critical for saving storage
space and protecting data privacy: this promotes life long learning. We
formulate the retrieval as a ``supermodelling'' problem, of sequentially
learning a model of the set of posterior distributions, represented as sets of
MCMC samples, and suggest the use of Particle-Learning-based sequential
Dirichlet process mixture (DPM) for this purpose. The relevance measure for
retrieval is derived from the supermodel through the mixture representation. We
demonstrate the performance of the proposed retrieval method on simulated data
and molecular biological experiments
Retrieval of Experiments by Efficient Estimation of Marginal Likelihood
We study the task of retrieving relevant experiments given a query
experiment. By experiment, we mean a collection of measurements from a set of
`covariates' and the associated `outcomes'. While similar experiments can be
retrieved by comparing available `annotations', this approach ignores the
valuable information available in the measurements themselves. To incorporate
this information in the retrieval task, we suggest employing a retrieval metric
that utilizes probabilistic models learned from the measurements. We argue that
such a metric is a sensible measure of similarity between two experiments since
it permits inclusion of experiment-specific prior knowledge. However, accurate
models are often not analytical, and one must resort to storing posterior
samples which demands considerable resources. Therefore, we study strategies to
select informative posterior samples to reduce the computational load while
maintaining the retrieval performance. We demonstrate the efficacy of our
approach on simulated data with simple linear regression as the models, and
real world datasets