Article thumbnail

Unsupervised Extraction of Recurring Words from Infant-Directed Speech

By Fergus R. Mcinnes and Sharon J. Goldwater

Abstract

To date, most computational models of infant word segmentation have worked from phonemic or phonetic input, or have used toy datasets. In this paper, we present an algorithm for word extraction that works directly from naturalistic acoustic input: infant-directed speech from the CHILDES corpus. The algorithm identifies recurring acoustic patterns that are candidates for identification as words or phrases, and then clusters together the most similar patterns. The recurring patterns are found in a single pass through the corpus using an incremental method, where only a small number of utterances are considered at once. Despite this limitation, we show that the algorithm is able to extract a number of recurring words, including some that infants learn earliest, such as Mommy and the child’s name. We also introduce a novel information-theoretic evaluation measure

Topics: language acquisition, word segmentation
Year: 2012
OAI identifier: oai:CiteSeerX.psu:10.1.1.221.1727
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://www.cstr.inf.ed.ac.uk/d... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.