5 research outputs found

    Using verb-noun collocation for disambiguating verb polysemy in English-Arabic statistical machine translation / Hussein Khaled Hussein Soori

    No full text
    This thesis attempts to resolve the problem of verb-noun collocation in English-Arabic Machine Translation engines. This problem can be seen from the semantic ill-formed output produced by current machine translation systems when the wrong verb synonym is chosen for the Arabic translation. Initially, this problem starts when a given engine tries to select from a set of polysemous verbs in English to find the equivalent meaning of the verb in Arabic. Mostly, this selection depends on the syntactic environment and verb semantic features serving as selectional restrictions. These selectional restrictions can be very effective when it comes to solving verb polysemantic ambiguity, but lead to a dead end when trying to find the verb that collocates most with the noun in the output Arabic translation. To resolve this problem, this work uses a statistical method inspired by Church et al. (1991) in a prototype designed to retrieve verb-noun collocates in Arabic. The testing data sets for this prototype were chosen from various topics. Two multi-domain corpora in modern standard Arabic were chosen for this work: the Contemporary Corpus of Arabic and the Arabic Corpus by Mourad Abbas. The total number of words in the chosen corpora is 14 million words. The testing data sets were translated by Google, Bing and the prototype designed for this thesis. For the evaluation of these three engines, a simple metric was proposed including a gold standard value for the nounverb collocation in the Arabic translation. According to the evaluation metric, the results showed that Bing scored a verb-noun collocation value of 0.72, Google scored a collocation value of 0.75 and the prototype scored a collocation value of 0.89. The final results showed that the average performance rate for Bing is between 0.65-0.67, the average performance rate for Google is between 0.63-0.85 and the average performance rate for the prototype is between 0.82-0.88. This thesis shows that retrieving the verb that collocates most with the noun in Arabic corpora is a sophisticated task, due to the highly inflectional and agglutinated nature of Arabic where particles, personal pronouns (both for subject and object) and possessive pronouns are agglutinated to the verb in Arabic texts. This task involves two aspects: choosing the query of the search and the distance between the noun and the verb. Choosing the query for the noun and the verb is highly governed by the verb conjugation and noun declension. This requires modifying the search query (stem or lemma) according to the verb features such as tense, number, mood, aspect, etc., and noun features such as, number, gender, definitiveness, case and possessive clitic. Furthermore, decreasing the search distance may lead the search results to ignore some tangible collocation results, but increasing the distance can lead to the inclusion of some noise results. Keywords: English-Arabic machine translation; verb-noun collocation in Arabic; statistical machine translation; collocation retrieval, polysemy and collocation; Arabic corpor

    Metoda komprese dat pro detekci plagiátorství

    No full text
    Import 13/01/2017In our digital era, the need for plagiarism detection tools is growing with the tremendous number of documents produced on daily basis in and outside academia in all fields of science. This includes, reports, students’ assignments, undergraduate and graduate theses and dissertations. While some students use cut and paste methods, some other students resort to different ways of plagiarism including, changing the sentence structure, paraphrasing and replacing words with their synonyms. This thesis focuses on creating textual plagiarism detection tools for detecting plagiarism of Arabic and Czech texts by implementing initial parts of a compression algorithm with its modifications where text similarity can be measured by compression-based similarity metrics. Next, it expands on this work by integrating this technique with a Czech synonyms thesaurus and a Czech stemmer to detect semantic plagiarism, including, paraphrasing and restructuring of Czech texts. On the other hand,stemming and syllabification are very important in information retrieval, data mining and language processing. Creating good stemming and syllabification rules is crucial. The demand goes even higher with languages spoken by wider population, such as Arabic language. This thesis presents a novel method for syllabification of Arabic text based on Arabic vowel letters. The thesis also presents a light stemming method for Arabic language. To fine-tune the results of this method, an online parser is used, before stemming, to better categorize the different parts of speech and, later, the output words are matched with an electronic dictionary.V naší digitální éře, je potřeba nástrojů pro detekci plagiátorství z důvodů obrovského počtu denně rostoucích dokumentů ať již v akademické sféře či mimo ni. Patří zde zprávy, úkoly studentů, bakalářské, magisterské či disertační práce. Zatímco někteří studenti používají metodu vyjmou a vložit, další skupina studentů se uchyluje k různým způsobům plagiátorství, včetně změn struktur vět, parafrázovaní i nahrazení slov jejich synonymem.Tato práce je zaměřena na vytvoření nástroje pro detekci textového plagiátorství při odhalování plagiátů v arabských a českých textech, dále na provádění počátečních částí kompresního algoritmu s jejími modifikacemi, kde podobnost textu může být měřena na základě podobnosti kompresními-metrik. Dále se tato práce zaměřuje na to, že začleňuje tuto techniku v lexikonu českých synonym a v českém stemmer, kde odhaluje sémantické plagiátorství včetně parafrázování a restrukturalizace českých textů. Na druhé straně hledání kořenů slov a schopnost rozdělování slov na slabiky je velmi důležité v oblastech vyhledávání informací, dolování dat a zpracování jazyka. Vytvoření kvalitních pravidel pro rozklad na slabiky a hledání kořene slov je stěžejní. Ještě vyšší poptávka je u jazyků, jimiž hovoří širší populace, jako je například arabština. Tato práce představuje novou metodu pro rozklad arabských slov na slabiky, založenou na arabských samohláskách. Práce také představuje snadnou metodu pro hledání kořenů slov pro arabský jazyk. Pro doladění výsledku této metody, je nutné před použitím hledání kořenů slov, využít on-line syntetického analyzátoru. Ten se využívá pro lepší kategorizaci různých slovní druhů. Po té, tyto výstupní slova je nutné porovnat pomocí elektronického slovníku.460 - Katedra informatikyvyhově

    Academic Plagiarism Detection

    No full text

    The role of endolysosomal trafficking in anticancer drug resistance

    No full text
    corecore