A Universal Part-of-Speech Tagset

Das, Dipanjan; McDonald, Ryan; Petrov, Slav

research

A Universal Part-of-Speech Tagset

Authors: Dipanjan Das
Ryan McDonald
Slav Petrov
Publication date: 1 January 2011
Publisher

Abstract

To facilitate future research in unsupervised induction of syntactic structure and to standardize best-practices, we propose a tagset that consists of twelve universal part-of-speech categories. In addition to the tagset, we develop a mapping from 25 different treebank tagsets to this universal set. As a result, when combined with the original treebank data, this universal tagset and mapping produce a dataset consisting of common parts-of-speech for 22 different languages. We highlight the use of this resource via two experiments, including one that reports competitive accuracies for unsupervised grammar induction without gold standard part-of-speech tags

Similar works

Full text

Available Versions

CiteSeerX

oai:CiteSeerX.psu:10.1.1.768.6...

Last time updated on 30/10/2017