Unsupervised mining of lexical variants from noisy text

Gouws, Stephan; Hovy, Dirk; Metzler, Donald

Unsupervised mining of lexical variants from noisy text

Authors: Stephan Gouws
Dirk Hovy
Donald Metzler
Publication date: 1 January 2011
Publisher: Association for Computational Linguistics

Abstract

The amount of data produced in user-generated content continues to grow at a staggering rate. However, the text found in these media can deviate wildly from the standard rules of orthography, syntax and even semantics and present significant problems to downstream applications which make use of all this noisy data. In this paper we present a novel unsupervised method for extracting domain-specific lexical variants given a large volume of text. We demonstrate the utility of this method by applying it to normalize text messages found in the online social media service, Twitter, into their most likely standard English versions. Our method yields a 20% reduction in word error rate over an existing state-of-the-art approach

Similar works

Full text

Available Versions

CiteSeerX

oai:CiteSeerX.psu:10.1.1.708.2...

Last time updated on 29/10/2017

Archivio istituzionale della Ricerca - Bocconi

oai:iris.unibocconi.it:11565/4...

Last time updated on 03/09/2019