Location of Repository

A comparison of grapheme and phoneme-based units for Spanish spoken term detection

By Javier Tejedor, Dong Wang, Joe Frankel, Simon King and José Colás


The ever-increasing volume of audio data available online through the world wide web means that automatic methods for indexing and search are becoming essential. Hidden Markov model (HMM) keyword spotting and lattice search techniques are the two most common approaches used by such systems. In keyword spotting, models or templates are defined for each search term prior to accessing the speech and used to find matches. Lattice search (referred to as spoken term detection), uses a pre-indexing of speech data in terms of word or sub-word units, which can then quickly be searched for arbitrary terms without referring to the original audio. In both cases, the search term can be modelled in terms of sub-word units, typically phonemes. For in-vocabulary words (i.e. words that appear in the pronunciation dictionary), the letter-to-sound conversion systems are accepted to work well. However, for out-of-vocabulary (OOV) search terms, letter-to-sound conversion must be used to generate a pronunciation for the search term. This is usually a hard decision (i.e. not probabilistic and with no possibility of backtracking), and errors introduced at this step are difficult to recover from. We therefore propose the direct use of graphemes (i.e., letter-based sub-word units) for acoustic modelling. This is expected to work particularly well in languages such as Spanish, where despite the letter-to-sound mapping being very regular, the correspondence is not one-to-one, and there will be benefits from avoiding hard decisions at early stages of processing. In this article, we compare three approaches for Spanish keyword spotting or spoken term detection, and within each of these we compare acoustic modelling based on phone and grapheme units. Experiments were performed using the Spanish geographical-domain Albayzin corpus. Results achieved in the two approaches proposed for spoken term detection show us that trigrapheme units for acoustic modelling match or exceed the performance of phone-based acoustic models. In the method proposed for keyword spotting, the results achieved with each acoustic model are very similar

Publisher: Elsevier
Year: 2010
DOI identifier: 10.1016/j.specom.2008.03.005
OAI identifier: oai:www.era.lib.ed.ac.uk:1842/3834

Suggested articles



  1. (2004). A CTS task for meaningful fastturnaround experiments. In:
  2. (1998). A database for continuous speech recognition in a 1000 word domain. In: doi
  3. (1994). A fast lattice-based approach to vocabulary independent wordspotting. In: doi
  4. (2004). A hybrid word/phoneme-based approach for improved vocabulary-independent search in spontaneous speech. In: doi
  5. (2004). A keyword spotting approach based on pseudo N-gram language model. In:
  6. (2007). A study of phoneme and grapheme based context-dependent ASR systems. In: doi
  7. (1997). Acoustic indexing for multimedia retrieval and browsing. In: doi
  8. (1993). Albayzin speech database: design of the phonetic corpus. In:
  9. (2000). An experimental study of an audio indexing system for the web. In:
  10. (2005). Comparison of keyword spotting approaches for informal continuous speech. In: doi
  11. (1989). Continuous hidden Markov modeling for speaker-independent word spotting. In: doi
  12. (2003). Cross-language phonemisation in german text-to-speech synthesis. In:
  13. (1998). El comentario fonolo ´gico y fone ´tico de textos.
  14. (2008). et al./Speech
  15. (2007). Fast unconstrained audio search in numerous human languages. In: doi
  16. (1995). Grama ´tica de la lengua espan ˜ola. Real Academia Espan ˜ola. Coleccio ´n Lebrija y Bello, Espasa Calpe.
  17. (2003). Grapheme based speech recognition. In:
  18. (1997). Indexing and search of multimodal information. In: doi
  19. (2004). Joint decoding for phoneme-grapheme continuous speech recognition. In: doi
  20. (2006). keywordspotting systembasedon filler models, pseudo N-gram language model and a confidence measure. In:
  21. (1989). Lexical access to large vocabularies for speech recognition. doi
  22. (2002). Out-of-vocabulary word modeling and rejection for Spanish keyword spotting systems. In: doi
  23. (1993). Out-of-vocabulary word modelling and rejection for keyword spotting. In:
  24. (2003). Phoneme-grapheme based automatic speech recognition system. In: doi
  25. (2007). Rapid yet accurate speech indexing using dynamic match lattice spotting. doi
  26. (2000). Speech and language technologies for audio indexing and retrieval. doi
  27. (2001). Speech data retrieval system constructed on a universal phonetic code domain. In: doi
  28. (2005). Speechfind: advances in spoken document retrieval for a national gallery of the spoken word. doi
  29. (1994). Telephone speech corpus development at CSLU. In: doi
  30. (2006). The spoken term detection (STD)
  31. (2005). Vocabulary independent indexing of spontaneous speech. doi

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.