Search CORE

101 research outputs found

Why Microsoft Arabic Spell checker is ineffective

Author: Neme Alexis Amid
Publication venue: http://www.al-erfan.com/
Publication date: 01/01/2014
Field of study

International audienceSince 1997, the MS Arabic spell checker was integrated by Coltec-Egypt in the MS-Office suite and till now many Arabic users find it worthless. In this study, we show why the MS-spell checker fails to attract Arabic users. After spell-checking a document (10 pages -3300 words in Arabic), the assessment procedure spots 78 false positive errors. They reveal the lexical resource flaws: an unsystematic lexical coverage of the feminine and the broken plural of nouns and adjectives, and an arbitrary coverage of verbs and nouns with prefixed or suffixed particles. This unsystematic and arbitrary lexical coverage of the language resources pinpoints the absence of a clear definition of a lexical entry and an inadequate design of the related agglutination rules. Finally, this assessment reveals in general the failure of scientific and technological policies in big companies and in research institutions regarding Arabic

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

An Arabic CCG approach for determining constituent types from Arabic Treebank

Author: Abo Bakr Hitahm M.
El-taher Ahmed I.
Shaalan Khaled
Zidan Ibrahim
Publication venue: King Saud University. Production and hosting by Elsevier B.V.
Publication date: 01/12/2014
Field of study

AbstractConverting a treebank into a CCGbank opens the respective language to the sophisticated tools developed for Combinatory Categorial Grammar (CCG) and enriches cross-linguistic development. The conversion is primarily a three-step process: determining constituents’ types, binarization, and category conversion. Usually, this process involves a preprocessing step to the Treebank of choice for correcting brackets and normalizing tags for any changes that were introduced during the manual annotation, as well as extracting morpho-syntactic information that is necessary for determining constituents’ types. In this article, we describe the required preprocessing step on the Arabic Treebank, as well as how to determine Arabic constituents’ types. We conducted an experiment on parts 1 and 2 of the Penn Arabic Treebank (PATB) aimed at converting the PATB into an Arabic CCGbank. The performance of our algorithm when applied to ATB1v2.0 & ATB2v2.0 was 99% identification of head nodes and 100% coverage over the Treebank data

Elsevier - Publisher Connector

Directory of Open Access Journals

An Arabic Dependency Treebank in the Travel Domain

Author: Taji Dima
Gizuli Jamila El
Habash Nizar
Publication venue
Publication date: 01/01/2019
Field of study

In this paper we present a dependency treebank of travel domain sentences in Modern Standard Arabic. The text comes from a translation of the English equivalent sentences in the Basic Traveling Expressions Corpus. The treebank dependency representation is in the style of the Columbia Arabic Treebank. The paper motivates the effort and discusses the construction process and guidelines. We also present parsing results and discuss the effect of domain and genre difference on parsing

arXiv.org e-Print Archive

Victoria University of Wellington

ResearchArchive at Victoria University of Wellington

Saudi Accented Arabic Voice Bank

Author: Alenazi Ammar
Alghamdi Mansour
Alhargan Fayez
Alkanhal Mohammed
Alkhairy Ashraf
Eldesouki Munir
Publication venue: King Saud University. Production and hosting by Elsevier B.V.
Publication date: 31/12/2008
Field of study

AbstractThe aim of this paper is to present an Arabic speech database that represents Arabic native speakers from all the cities of Saudi Arabia. The database is called the Saudi Accented Arabic Voice Bank (SAAVB). Preparing the prompt sheets, selecting the right speakers and transcribing their speech are some of the challenges that faced the project team. The procedures that meet these challenges are highlighted. SAAVB consists of 1033 speakers speak in Modern Standard Arabic with a Saudi accent. The SAAVB content is analyzed and the results are illustrated. The content was verified internally and externally by IBM Cairo and can be used to train speech engines such as automatic speech recognition and speaker verification systems

Elsevier - Publisher Connector

Introducing the Arabic WordNet project

Author: Alkhalifa M.
Black W.
Elkateb S.
Fellbaum C.
Pease A.
Rodriguez H.
Vossen P.
Publication venue
Publication date: 01/01/2006
Field of study

VU Research Portal

The Effects of Factorizing Root and Pattern Mapping in Bidirectional Tunisian - Standard Arabic Machine Translation

Author: Boujelbane Rahma
Habash Nizar
Hamdi Ahmed
Nasr Alexis
Publication venue: HAL CCSD
Publication date: 02/09/2013
Field of study

International audienceThe development of natural language processing tools for dialects faces the severe problem of lack of resources. In cases of diglossia, as in Arabic, one variant, Modern Standard Arabic (MSA), has many resources that can be used to build natural language processing tools. Whereas other variants, Arabic dialects, are resource poor. Taking advantage of the closeness of MSA and its dialects, one way to solve the problem of limited resources, consists in performing a translation of the dialect into MSA in order to use the tools developed for MSA. We describe in this paper an architecture for such a translation and we evaluate it on Tunisian Arabic verbs. Our approach relies on modeling the translation process over the deep morphological representations of roots and patterns, commonly used to model Semitic morphology. We compare different techniques for how to perform the cross-lingual mapping. Our evaluation demonstrates that the use of a decent coverage root+pattern lexicon of Tunisian and MSA with a backoff that assumes independence of mapping roots and patterns is optimal in reducing overall ambiguity and increasing recall

HAL AMU

HAL Descartes

Hal-Diderot

Summarizing videos into a target language: Methodology, architectures and evaluation

Author: Fohr Dominique
Garcia-Zapirain Begona
González-Gallardo Carlos-Emiliano
Grega Michał,
Janowski Lucjan
Jouvet Denis
Koźbiał Arian
Langlois David
Leszczuk Mikołaj
Mella Odile
Menacer Mohamed-Amine
Mendez Amaia
Pontes Elvys Linhares,
Sanjuan Eric
Smaïli Kamel
Torres-Moreno Juan-Manuel
Publication venue: 'IOS Press'
Publication date: 01/01/2019
Field of study

International audienceThe aim of the work is to report the results of the Chist-Era project AMIS (Access Multilingual Information opinionS). The purpose of AMIS is to answer the following question: How to make the information in a foreign language accessible for everyone? This issue is not limited to translate a source video into a target language video since the objective is to provide only the main idea of an Arabic video in English. This objective necessitates developing research in several areas that are not, all arrived at a maturity state: Video summarization, Speech recognition, Machine translation, Audio summarization and Speech segmentation. In this article we present several possible architectures to achieve our objective, yet we focus on only one of them. The scientific locks are be presented, and we explain how to deal with them. One of the big challenges of this work is to conceive a way to evaluate objectively a system composed of several components knowing that each of them has its limits and can propagate errors through the first component. Also, a subjective evaluation procedure is proposed in which several annotators have been mobilized to test the quality of the achieved summaries

INRIA a CCSD electronic archive server

PolyPublie

Recommended from our members

A Modern Standard Arabic Closed-Class Word List

Author: Habash Nizar Y.
Salloum Wael Sameer
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2012
Field of study

This document describes a list of Modern Standard Arabic closed-class words, which can be used as a stop list for a variety of natural language processing applications. The list contains 740 inflected words and clitics in the Arabic Treebank (ATB) tokenization scheme (Maamouri et al., 2004; Habash, 2010). The inflected words are based on 309 lemmas from the Standard Arabic Morphological Analyzer, SAMA (Graff et al., 2009). To get a copy of the full list, please contact the authors

Columbia University Academic Commons