Skip to main content
Article thumbnail
Location of Repository

An Empirical Study of Smoothing Techniques for Language Modeling

By Stanley F. Chen, Stanley F. Chen and Joshua Goodman and Joshua Goodman

Abstract

We present a tutorial introduction to n-gram models for language modeling and survey the most widely-used smoothing algorithms for such models. We then present an extensive empirical comparison of several of these smoothing techniques, including those described by Jelinek and Mercer (1980), Katz (1987), Bell, Cleary, and Witten (1990), Ney, Essen, and Kneser (1994), and Kneser and Ney (1995). We investigate how factors such as training data size, training corpus (e.g., Brown versus Wall Street Journal), count cutoffs, and n-gram order (bigram versus trigram) affect the relative performance of these methods, which is measured through the cross-entropy of test data. Our results show that previous comparisons have not been complete enough to fully characterize smoothing algorithm performance. We introduce methodologies for analyzing smoothing algorithm efficacy in detail, and using these techniques we motivate a novel variation of Kneser-Ney smoothing that consistently outperforms all oth..

Year: 1998
OAI identifier: oai:CiteSeerX.psu:10.1.1.41.4604
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://www.cs.cmu.edu/afs/cs.c... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.