2,237 research outputs found
Hybrid Technique for Arabic Text Compression
Arabic content on the Internet and other digital media is increasing exponentially, and the number of Arab users of these media has multiplied by more than 20 over the past five years. There is a real need to save allocated space for this content as well as allowing more efficient usage, searching, and retrieving information operations on this content. Using techniques borrowed from other languages or general data compression techniques, ignoring the proper features of Arabic has limited success in terms of compression ratio. In this paper, we present a hybrid technique that uses the linguistic features of Arabic language to improve the compression ratio of Arabic texts. This technique works in phases. In the first phase, the text file is split into four different files using a multilayer model-based approach. In the second phase, each one of these four files is compressed using the Burrows-Wheeler compression algorithm
Multi-space Variational Encoder-Decoders for Semi-supervised Labeled Sequence Transduction
Labeled sequence transduction is a task of transforming one sequence into
another sequence that satisfies desiderata specified by a set of labels. In
this paper we propose multi-space variational encoder-decoders, a new model for
labeled sequence transduction with semi-supervised learning. The generative
model can use neural networks to handle both discrete and continuous latent
variables to exploit various features of data. Experiments show that our model
provides not only a powerful supervised framework but also can effectively take
advantage of the unlabeled data. On the SIGMORPHON morphological inflection
benchmark, our model outperforms single-model state-of-art results by a large
margin for the majority of languages.Comment: Accepted by ACL 201
A new hybrid metric for verifying parallel corpora of Arabic-English
This paper discusses a new metric that has been applied to verify the quality
in translation between sentence pairs in parallel corpora of Arabic-English.
This metric combines two techniques, one based on sentence length and the other
based on compression code length. Experiments on sample test parallel
Arabic-English corpora indicate the combination of these two techniques
improves accuracy of the identification of satisfactory and unsatisfactory
sentence pairs compared to sentence length and compression code length alone.
The new method proposed in this research is effective at filtering noise and
reducing mis-translations resulting in greatly improved quality.Comment: in CCSEA-201
On the Difficulty to Design Arabic E-learning System in Statistics
In this paper, we present a case study, which describe the development of the Statistic e-learning-course in Arabic language â``Arabic MM*STATÂŽÂŽ. The basic frame forthis E-book, the system MM*STAT was developed at the School for Business and Economics of Humboldt-UniversitĂ€t zu Berlin. Arabic MM*STAT uses a HTML - based filing card structure. We discuss the difficulties of the implementation of such a system in to the standard WWW formats and present the solutions needed for Arab educational institutions and the Arabic user. Those solutions are consistent with the Arabic language, and include the modern trend in the e-learning environment.electronic books, Arabtex, MM*STAT, Statistical software
A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs
International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithmâs strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license
Frequency vs. Association for Constraint Selection in Usage-Based Construction Grammar
A usage-based Construction Grammar (CxG) posits that slot-constraints
generalize from common exemplar constructions. But what is the best model of
constraint generalization? This paper evaluates competing frequency-based and
association-based models across eight languages using a metric derived from the
Minimum Description Length paradigm. The experiments show that
association-based models produce better generalizations across all languages by
a significant margin
- âŠ