A standard tag set expounding traditional morphological features for Arabic language part-of-speech tagging

Al-Ghalayyni; Al-Sulaiti Latifa; Atwell Eric; Atwell Eric; Brill Eric; Cachia Pierre; Dahdah Antonie; Dahdah Antonie; Dukes Kais; Dukes Kais; Elliott John; Eric Atwell; Habash Nizar; Hamada Salwa; Harmain Harmain M.; Johansson Stig; Khoja Shereen; Majdi Sawalha; Sawalha Majdi; Sawalha Majdi; Sawalha Majdi; Sawalha Majdi; Talmon Rafi; Teahan Bill; Voutilainen Atro; Wright W.

research

A standard tag set expounding traditional morphological features for Arabic language part-of-speech tagging

Authors: Al-Ghalayyni
Al-Sulaiti Latifa
Atwell Eric
Atwell Eric
Brill Eric
Cachia Pierre
Dahdah Antonie
Dahdah Antonie
Dukes Kais
Dukes Kais
Elliott John
Eric Atwell
Habash Nizar
Hamada Salwa
Harmain Harmain M.
Johansson Stig
Khoja Shereen
Majdi Sawalha
Sawalha Majdi
Sawalha Majdi
Sawalha Majdi
Sawalha Majdi
Talmon Rafi
Teahan Bill
Voutilainen Atro
Wright W.
Publication date: 1 April 2013
Publisher: 'Edinburgh University Press'
Doi

Abstract

The SALMA Morphological Features Tag Set (SALMA, Sawalha Atwell Leeds Morphological Analysis tag set for Arabic) captures long-established traditional morphological features of grammar and Arabic, in a compact yet transparent notation. First, we introduce Part-of-Speech tagging and tag set standards for English and other European languages, and then survey Arabic Part-of-Speech taggers and corpora, and long-established Arabic traditions in analysis of morphology. A range of existing Arabic Part-of-Speech tag sets are illustrated and compared; and we review generic design criteria for corpus tag sets. For a morphologically-rich language like Arabic, the Part-of-Speech tag set should be defined in terms of morphological features characterizing word structure. We describe the SALMA Tag Set in detail, explaining and illustrating each feature and possible values. In our analysis, a tag consists of 22 characters; each position represents a feature and the letter at that location represents a value or attribute of the morphological feature; the dash ‘-’ represents a feature not relevant to a given word. The first character shows the main Parts of Speech, from: noun, verb, particle, punctuation, and Other (residual); these last two are an extension to the traditional three classes to handle modern texts. ‘Noun’ in Arabic subsumes what are traditionally referred to in English as ‘noun’ and ‘adjective’. The characters 2, 3, and 4 are used to represent subcategories; traditional Arabic grammar recognizes 34 subclasses of noun (letter 2), 3 subclasses of verb (letter 3), 21 subclasses of particle (letter 4). Others (residuals) and punctuation marks are represented in letters 5 and 6 respectively. The next letters represent traditional morphological features: gender (7), number (8), person (9), inflectional morphology (10) case or mood (11), case and mood marks (12), definiteness (13), voice (14), emphasized and non-emphasized (15), transitivity (16), rational (17), declension and conjugation (18). Finally there are four characters representing morphological information which is useful in Arabic text analysis, although not all linguists would count these as traditional features: unaugmented and augmented (19), number of root letters (20), verb root (21), types of nouns according to their final letters (22). The SALMA Tag Set is not tied to a specific tagging algorithm or theory, and other tag sets could be mapped onto this standard, to simplify and promote comparisons between and reuse of Arabic taggers and tagged corpora

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Crossref

info:doi/10.3366%2Fword.2013.0...

Last time updated on 05/06/2019

White Rose Research Online

oai:eprints.whiterose.ac.uk:90...

Last time updated on 22/10/2015