5 research outputs found
STEMMING BAHASA JAWA MENGGUNAKAN DAMERAU LEVENSHTEIN DISTANCE (DLD)
Stemming is one of the essential stages of text mining. This process removes prefixes and suffixes to produce root words in a text. This study uses a string matching algorithm, namely Damerau Levenshtein Distance (DLD), to find the basic word forms of Javanese. Test data of 300 words that have a prefix, insertion, suffix, a combination of prefix and suffix, and word repetition. The results of this study indicate that the Damerau Levenshtein Distance (DLD) algorithm can be used for Stemming Javanese text with an accuracy value of 49.6%
A new hybrid metric for verifying parallel corpora of Arabic-English
This paper discusses a new metric that has been applied to verify the quality
in translation between sentence pairs in parallel corpora of Arabic-English.
This metric combines two techniques, one based on sentence length and the other
based on compression code length. Experiments on sample test parallel
Arabic-English corpora indicate the combination of these two techniques
improves accuracy of the identification of satisfactory and unsatisfactory
sentence pairs compared to sentence length and compression code length alone.
The new method proposed in this research is effective at filtering noise and
reducing mis-translations resulting in greatly improved quality.Comment: in CCSEA-201
Building and verifying parallel corpora between Arabic and English
Arabic and English are acknowledged as two major natural languages used by
many countries and regions. Reviews of previous literature conclude that machine
translation (MT) between these languages is disappointing and unsatisfactory due
to its poor quality.
This research aims to improve the translation quality of MT between Arabic and
English by developing higher quality parallel corpora. The thesis developed a
higher quality parallel test corpus, based on corpora from Al Hayat articles and
the OPUS open-source online corpora database.
A new Prediction by Partial Matching (PPM)-based metric for sentence alignment
has been applied to verify quality in translation between the sentence pairs
in the test corpus. This metric combines two techniques; the traditional approach
is based on sentence length and the other is based on compression code length.
A higher quality parallel corpus has been constructed from the existing resources.
Obtaining sentences and words from two online sources, Al Hayat and OPUS, the
new corpus offers 27,775,663 words in Arabic and 30,808,480 in English. Experimental
results on sample data indicate that the PPM-based and sentence length
technique for sentence alignment on this corpus improves accuracy of alignment
compared to sentence length alone