The goal of this paper is to investigate the connection between the
performance gain that can be obtained by selftraining and the similarity
between the corpora used in this approach. Self-training is a semi-supervised
technique designed to increase the performance of machine learning algorithms
by automatically classifying instances of a task and adding these as additional
training material to the same classifier. In the context of language processing
tasks, this training material is mostly an (annotated) corpus. Unfortunately
self-training does not always lead to a performance increase and whether it
will is largely unpredictable. We show that the similarity between corpora can
be used to identify those setups for which self-training can be beneficial. We
consider this research as a step in the process of developing a classifier that
is able to adapt itself to each new test corpus that it is presented with