Lemmatization for languages with rich inflectional morphology is one of the basic, indispensable steps in a language processing pipeline. In this paper we present a simple data-driven context-sensitive approach to lemmatizating word forms in running text. We treat lemmatization as a classification task for Machine Learning, and automatically induce class labels. We achieve this by computing a Shortest Edit Script (SES) between reversed input and output strings. A SES describes the transformations that have to be applied to the

input string (word form) in order to convert it to the output string (lemma). Our approach shows competitive performance on a range of typologically different languages

Chrupała, Grzegorz

English

Para los idiomas con una morfología flexiva rica, la lematizacíon es uno de los
pasos básicos e indispensables para su tratamiento automático. En este artículo presentamos
un método de lematización sencillo basado en el aprendizaje automático y que tiene en
cuenta el contexto de las formas en el texto. Tratamos la lematización como una tarea
de clasificación e inducimos las etiquetas de clases de forma automática. Para este fin
calculamos el script de edición más corto (SES) entre las cadenas invertidas de entrada y de
salida. Un SES describe las modificaciones que se deben aplicar a la cadena de entrada (la
forma) para convertirla en la cadena de salida (el lema). Con nuestro método conseguimos
unos resultados competitivos en una serie de lenguas tipologicamente diversas.Lemmatization for languages with rich inflectional morphology is one of the
basic, indispensable steps in a language processing pipeline. In this paper we present a simple
data-driven context-sensitive approach to lemmatizating word forms in running text. We
treat lemmatization as a classification task for Machine Learning, and automatically induce
class labels. We achieve this by computing a Shortest Edit Script (SES) between reversed
input and output strings. A SES describes the transformations that have to be applied to the
input string (word form) in order to convert it to the output string (lemma). Our approach
shows competitive performance on a range of typologically different languages.We gratefully acknowledge support from Science Foundation Ireland grant 04/IN/I527 for the research reported in this paper

RUa Reposity University of Alicante

Simple data-driven context-sensitive lemmatization

Lemmatization for languages with rich inflectional morphology is one of the basic, indispensable steps in a language processing pipeline. In this paper we present a simple data-driven context-sensitive approach to lemmatizating word forms in running text. We treat lemmatization as a classification task for Machine Learning, and automatically induce class labels. We achieve this by computing a Shortest Edit Script (SES) between reversed input and output strings. A SES describes the transformations that have to be applied to the\ud
input string (word form) in order to convert it to the output string (lemma). Our approach shows competitive performance on a range of typologically different languages

Name not available

Lemmatization for languages with rich inflectional morphology is one of the basic, indispensable steps in a language processing pipeline. In this paper we present a simple data-driven context-sensitive approach to lemmatizating word forms in running text. We treat lemmatization as a classification task for Machine Learning, and automatically induce class labels. We achieve this by computing a Shortest Edit Script (SES) between reversed input and output strings. A SES describes the transformations that have to be applied to the
input string (word form) in order to convert it to the output string (lemma). Our approach shows competitive performance on a range of typologically different languages

Irish Universities

DCU Online Research Access Service

A training algorithm for optimal margin classiﬁers.

Algorithms for the longest common subsequence problem.

An analogical learner for morphological analysis.

An O(ND) difference algorithm and its variations.

and Ma Ant` onia Mart´ ı.

Bounds on the complexity of the longest common subsequence problem.

Building a Japanese parsed corpus while improving the parsing system.

Cl´ ement, and Franc ¸ois Toussenel.

Floresta sint´ a(c)tica”: a treebank for Portuguese.

Isaac Chao, Llu´ ıs Padr´ o, and Muntsa Padr´ o.

LIBSVM: a library for Support Vector Machines (version 2.31).Civit, Monsterrat, N´ uriaBuf´ ı, andPilarValverde.

Simple data-driven context-sensitive lemmatization

Abstract

Similar works

Full text

Available Versions

RUa Reposity University of Alicante

Name not available

Irish Universities

RUA

DCU Online Research Access Service