Towards standardized descriptions of linguistic features: ISOcat and procedures for using common data categories

Windhouwer, M.

Towards standardized descriptions of linguistic features: ISOcat and procedures for using common data categories

Authors: M. Windhouwer
Publication date: 1 January 2012
Publisher

Abstract

Automatic Language Identification of written texts is a well-established area of research in Computational Linguistics. State-of-the-art algorithms often rely on n-gram character models to identify the correct language of texts, with good results seen for European languages. In this paper we propose the use of a character n-gram model and a word n-gram language model for the automatic classification of two written varieties of Portuguese: European and Brazilian. Results reached 0.998 for accuracy using character 4-grams

Similar works

Full text

Available Versions

MPG.PuRe

oai:pure.mpg.de:item_1559137

Last time updated on 15/06/2019

MPG.PuRe

oai:escidoc.org:escidoc:155913...

Last time updated on 23/08/2016