Search CORE

2,237 research outputs found

Hybrid Technique for Arabic Text Compression

Author: Arafat Awajan
Enas Abu Jrai
Publication venue: Global Journals Inc. (US)
Publication date: 21/02/2015
Field of study

Arabic content on the Internet and other digital media is increasing exponentially, and the number of Arab users of these media has multiplied by more than 20 over the past five years. There is a real need to save allocated space for this content as well as allowing more efficient usage, searching, and retrieving information operations on this content. Using techniques borrowed from other languages or general data compression techniques, ignoring the proper features of Arabic has limited success in terms of compression ratio. In this paper, we present a hybrid technique that uses the linguistic features of Arabic language to improve the compression ratio of Arabic texts. This technique works in phases. In the first phase, the text file is split into four different files using a multilayer model-based approach. In the second phase, each one of these four files is compressed using the Burrows-Wheeler compression algorithm

Global Journal of Computer Science and Technology (GJCST)

Multi-space Variational Encoder-Decoders for Semi-supervised Labeled Sequence Transduction

Author: Neubig Graham
Zhou Chunting
Publication venue
Publication date: 01/01/2017
Field of study

Labeled sequence transduction is a task of transforming one sequence into another sequence that satisfies desiderata specified by a set of labels. In this paper we propose multi-space variational encoder-decoders, a new model for labeled sequence transduction with semi-supervised learning. The generative model can use neural networks to handle both discrete and continuous latent variables to exploit various features of data. Experiments show that our model provides not only a powerful supervised framework but also can effectively take advantage of the unlabeled data. On the SIGMORPHON morphological inflection benchmark, our model outperforms single-model state-of-art results by a large margin for the majority of languages.Comment: Accepted by ACL 201

arXiv.org e-Print Archive

Crossref

A new hybrid metric for verifying parallel corpora of Arabic-English

Author: Alkahtani S.
Alkahtani S.M.
Liu W.
Teahan W.J.
Publication venue
Publication date: 24/01/2015
Field of study

Bangor University Research Portal

A new hybrid metric for verifying parallel corpora of Arabic-English

Author: Alkahtani Saad
Liu Wei
Teahan William J.
Publication venue
Publication date: 12/02/2015
Field of study

This paper discusses a new metric that has been applied to verify the quality in translation between sentence pairs in parallel corpora of Arabic-English. This metric combines two techniques, one based on sentence length and the other based on compression code length. Experiments on sample test parallel Arabic-English corpora indicate the combination of these two techniques improves accuracy of the identification of satisfactory and unsatisfactory sentence pairs compared to sentence length and compression code length alone. The new method proposed in this research is effective at filtering noise and reducing mis-translations resulting in greatly improved quality.Comment: in CCSEA-201

arXiv.org e-Print Archive

CiteSeerX

Crossref

On the Difficulty to Design Arabic E-learning System in Statistics

Author: Julius Mungo
Taleb Ahmad
Wolfgang Härdle
Publication venue
Publication date
Field of study

In this paper, we present a case study, which describe the development of the Statistic e-learning-course in Arabic language –``Arabic MM*STAT´´. The basic frame forthis E-book, the system MM*STAT was developed at the School for Business and Economics of Humboldt-Universität zu Berlin. Arabic MM*STAT uses a HTML - based filing card structure. We discuss the difficulties of the implementation of such a system in to the standard WWW formats and present the solutions needed for Arab educational institutions and the Arabic user. Those solutions are consistent with the Arabic language, and include the modern trend in the e-learning environment.electronic books, Arabtex, MM*STAT, Statistical software

Research Papers in Economics

Compression-based Parts-of-Speech Tagger for the Arabic Language

Author: Alkhazi Ibrahim
Publication venue
Publication date: 18/12/2019
Field of study

Bangor University Research Portal

A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs

Author: Abdelali Ahmed,
Doumi Noureddine
Lehireche Ahmed
Maurel Denis
Publication venue: IJIT
Publication date: 01/02/2016
Field of study

International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithm’s strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license

Directory of Open Access Journals

HAL Descartes

HAL Université de Tours

Hal-Diderot

Frequency vs. Association for Constraint Selection in Usage-Based Construction Grammar

Author: Dunn Jonathan
Publication venue
Publication date: 01/01/2019
Field of study

A usage-based Construction Grammar (CxG) posits that slot-constraints generalize from common exemplar constructions. But what is the best model of constraint generalization? This paper evaluates competing frequency-based and association-based models across eight languages using a metric derived from the Minimum Description Length paradigm. The experiments show that association-based models produce better generalizations across all languages by a significant margin

arXiv.org e-Print Archive

Crossref

UC Research Repository

Categorisation of Arabic Twitter Text

Author: Altamimi Mohammed Hamed R
Publication venue
Publication date: 26/02/2020
Field of study

Bangor University Research Portal