Statistical NLP systems are frequently  evaluated and compared on  the basis of their performances on  a single split of training and test  data. Results obtained using a single  split are, however, subject to sampling  noise. In this paper we argue  in favour of reporting a distribution  of performance figures, obtained  by resampling the training  data, rather than a single number. Th