research

An Arabic Sign Language Corpus for Instructional Language in School

Abstract

Machine translation (MT) technology has made significant progress over the last decade and now offers the potential for Arabic sign language (ArSL) signers to access text published in Arabic. The dominant model of MT is now corpus based. In this model, the accuracy of translation correlates directly with size and coverage of the corpus. The corpus is a collection of translation examples constructed from existing documents such as books and newspapers; however, no written system for sign language (SL) comparable to that used for natural language has yet been developed. Hence, no SL documents exist, complicating the procedure for constructing an SL corpus. In countries such as Ireland and Germany, a number of corpora have already been developed from scratch and used for MT. There is no ArSL corpus for MT, requiring the creation of a new ArSL corpus for language instruction. The goal of building this corpus is to develop an automatic translation system from Arabic text to ArSL. This paper presents the ArSL corpus for instructional language constructed for use in schools, and the methodology used to create it. The corpus was collected at the College of Computer and Information Sciences at Imam Muhammad bin Saud University in Riyadh, Saudi Arabia. A group of interpreters and native signers with backgrounds in education were involved in this work. The corpus was constructed by collecting instructional sentences used daily in schools for the deaf. The syntax and morphology of each sentence were then manually analysed. Each sentence was individually translated, recorded on video, and stored in MPEG format. The corpus contains video data from three native signers. The videos were then annotated using an ELAN annotation tool. The annotated video data contain isolated signs accompanied by detailed information, such as manual and non-manual features. The last procedure in constructing the corpus was to create a bilingual dictionary from the annotated videos. The corpus comprises two main parts. The first part is the annotated video data, comprising isolated signs with detailed information, accompanied by manual and non-manual features. It also contains the Arabic translation script, including syntax and morphology details. The second part is the bilingual dictionary, delivered with the annotated videos

    Similar works

    Full text

    thumbnail-image

    Available Versions