Search CORE

1 research outputs found

Studying the effect of input size for Bayesian word segmentation on the providence corpus

Author: Börschinger Benjamin
Demuth Katherine
Johnson Mark
Publication venue: Mumbai, India : The COLING 2012 Organizing Committee
Publication date: 01/01/2012
Field of study

Studies of computational models of language acquisition depend to a large part on the input available for experiments. In this paper, we study the effect that input size has on the performance of word segmentation models embodying different kinds of linguistic assumptions. Because currently available corpora for word segmentation are not suited for addressing this question, we perform our study on a novel corpus based on the Providence Corpus (Demuth et al., 2006). We find that input size can have dramatic effects on segmentation performance and that, somewhat surprisingly, models performing well on smaller amounts of data can show a marked decrease in performance when exposed to larger amounts of data. We also present the data-set on which we perform our experiments comprising longitudinal data for six children. This corpus makes it possible to ask more specific questions about computational models of word segmentation, in particular about intra-language variability and about how the performance of different models can change over time.16 page(s

Macquarie University ResearchOnline