18 research outputs found

    Introducing the Arabic WordNet project

    Get PDF

    Kombinasi Metode Rule-Based dan N-Gram Stemming untuk Mengenali Stemmer Bahasa Bali

    Get PDF
    Proses untuk mengekstraksi kata dasar dari kata berafiks dikenal dengan istilah stemming yang bertujuan meningkatkan recall dengan mereduksi variasi kata berafiks ke dalam bentuk kata dasarnya. Penelitian terdahulu tentang stemming bahasa Bali pernah dilakukan menggunakan metode rule-based, tapi afiks yang diluluhkan hanya prefiks dan sufiks, sedangkan variasi afiks lain tidak diluluhkan, seperti infiks, konfiks, simulfiks, dan kombinasi afiks. Penelitian tentang stemming menggunakan pendekatan rule-based telah diterapkan di berbagai bahasa yang berbeda. Metode rule-based memiliki kelebihan jika diterapkan pada domain yang sederhana, maka rule-based mudah untuk diverifikasi dan divalidasi, tapi memiliki kelemahan saat diterapkan pada domain dengan level kompleksitas yang tinggi, apabila sistem tidak dapat mengenali rules, maka tidak ada hasil yang diperoleh. Untuk mengatasi kelemahan stemming menggunakan rule-based, kami menggunakan metode n-gram stemming, dimana kata berafiks dan kata dasar diubah ke bentuk n-gram, kemudian tingkat kemiripan antara n-gram kata berafiks dan n-gram kata dasar diukur menggunakan metode dice coefficient, apabila tingkat kemiripannya memenuhi nilai ambang batas yang ditentukan, maka kata dasar yang dibandingkan dengan kata berafiks ditampilkan. Pada penelitian ini, kami mengembangkan metode stemmer yang meluluhkan seluruh variasi afiks pada bahasa Bali dengan mengombinasikan pendekatan rule-based dan metode n-gram stemming. Berdasarkan pengujian yang telah dilakukan untuk kesepuluh query metode yang diusulkan memperoleh rerata akurasi stemming lebih baik 96,67% dari metode terdahulu 75%, sedangkan untuk kelima query metode n-gram stemming dapat mengenali beberapa kata berafiks diluar rules. Penelitian berikutnya, kami akan memperhatikan semantik setiap kata dan tahap validasi menggunakan aplikasi text mining.AbstractA process for extracting a stem word from the inflected word is known as stemming which aims to increase recall by reducing the variation of the inflected word into its stem word form. Previous research on stemming the Balinese language has been done using the rule-based method, but the affixes that are removed are only prefixes and suffixes, while other variations of affixes are not removed, such as infixes, confixes, simulfiks, and combinations of affixes. Research on stemming using the rule-based approach has been applied in a variety of different languages. The rule-based method has advantages when applied to a simple field, rule-based is easy to verify and validate, but has weaknesses when applied to domains with a high level of complexity, if the system cannot recognize rules, no results are obtained. To overcome the stemming weaknesses using rule-based, we use the n-gram stemming method, where the inflected word and stem word are converted to the n-gram form, then the level of similarity between the n-gram of the inflected word and the stem word is measured using the dice coefficient method, when the level of similarity meets the defined threshold value, then the stem word is displayed. In this study, we developed a stemmer method that removes all variations of affixes in the Balinese language by combining the rule-based approach and the n-gram stemming method. Based on the experiments for the ten queries the proposed method get 96,67% stemming accuracy than the previous method 75%, while for the five queries for the n-gram stemming method can recognize some inflected words outside the rules. The next study, we will pay attention to the semantics of each word and the validation stage using text mining application

    Arabic stemmers and their effectiveness on the information retrieval system

    Full text link
    Arabic is a semitic language that has a complex morphology. Therefore, using a stemmer algorithm in an information retrieval system is almost always beneficial; An Arabic stemmer has been implemented and included in the information retrieval system developed at the Information Science Research Institute at the University of Nevada Las Vegas. The Arabic stemmer is written in the Ruby Language and removes affixes then matches the remaining word against patterns of the same length. The retrieval experiment uses the TREC collection which consists of over a million documents. We will test the effectiveness of the Arabic stemmer using recall/precision measurement and compare the result to other stemmers

    بناء أداة تفاعلية متعددة اللغات لاسترجاع المعلومات

    Get PDF
    The growing requirement on the Internet have made users access to the information expressed in a language other than their own , which led to Cross lingual information retrieval (CLIR) .CLIR is established as a major topic in Information Retrieval (IR). One approach to CLIR uses different methods of translation to translate queries to documents and indexes in other languages. As queries submitted to search engines suffer lack of untranslatable query keys (i.e., words that the dictionary is missing) and translation ambiguity, which means difficulty in choosing between alternatives of translation. Our approach in this thesis is to build and develop the software tool (MORTAJA-IR-TOOL) , a new tool for retrieving information using programming JAVA language with JDK 1.6. This tool has many features, which is develop multiple systematic languages system to be use as a basis for translation when using CLIR, as well as the process of stemming the words entered in the query process as a stage preceding the translation process. The evaluation of the proposed methodology translator of the query comparing it with the basic translation that uses readable dictionary automatically the percentage of improvement is 8.96%. The evaluation of the impact of the process of stemming the words entered in the query on the quality of the output process in the retrieval of matched data in other process the rate of improvement is 4.14%. Finally the rated output of the merger between the use of stemming methodology proposed and translation process (MORTAJA-IR-TOOL) which concluded that the proportion of advanced in the process of improvement in data rate of retrieval is 15.86%. Keywords: Cross lingual information retrieval, CLIR, Information Retrieval, IR, Translation, stemming.الاحتياجات المتنامية على شبكة الإنترنت جعلت المستخدمين لهم حق الوصول إلى المعلومات بلغة غير لغتهم الاصلية، مما يقودنا الى مصطلح عبور اللغات لاسترجاع المعلومات (CLIR). CLIR أنشئت كموضوع رئيسي في "استرجاع المعلومات" (IR). نهج واحد ل CLIR يستخدم أساليب مختلفة للترجمة ومنها لترجمة الاستعلامات وترجمة الوثائق والفهارس في لغات أخرى. الاستفسارات والاستعلامات المقدمة لمحركات البحث تعاني من عدم وجود ترجمه لمفاتيح الاستعلام (أي أن العبارة مفقودة من القاموس) وايضا تعاني من غموض الترجمة، مما يعني صعوبة في الاختيار بين بدائل الترجمة. في نهجنا في هذه الاطروحة تم بناء وتطوير الأداة البرمجية (MORTAJA-IR-TOOL) أداة جديدة لاسترجاع المعلومات باستخدام لغة البرمجة JAVA مع JDK 1.6، وتمتلك هذه الأداة العديد من الميزات، حيث تم تطوير منظومة منهجية متعددة اللغات لاستخدامها كأساس للترجمة عند استخدام CLIR، وكذلك عملية تجذير للكلمات المدخلة في عملية الاستعلام كمرحلة تسبق عملية الترجمة. وتم تقييم الترجمة المنهجية المقترحة للاستعلام ومقارنتها مع الترجمة الأساسية التي تستخدم قاموس مقروء اليا كأساس للترجمة في تجربة تركز على المستخدم وكانت نسبة التحسين 8.96% , وكذلك يتم تقييم مدى تأثير عملية تجذير الكلمات المدخلة في عملية الاستعلام على جودة المخرجات في عملية استرجاع البيانات المتطابقة باللغة الاخرى وكانت نسبة التحسين 4.14% , وفي النهاية تم تقييم ناتج عملية الدمج بين استخدام التجذير والترجمة المنهجية المقترحة (MORTAJA-IR-TOOL) والتي خلصت الى نسبة متقدمة في عملية التحسين في نسبة البيانات المرجعة وكانت 15.86%

    A Morphologically Sensitive Clustering Algorithm for Identifying Arabic Roots

    No full text
    We present a clustering algorithm for Arabic words sharing the same root. Root based clusters can substitute dictionaries in indexing for IR. Modifying Adamson and Boreham (1974), our Two-stage algorithm applies light stemming before calculating word pair similarity coefficients using techniques sensitive to Arabic morphology

    Probabilistic Modelling of Morphologically Rich Languages

    Full text link
    This thesis investigates how the sub-structure of words can be accounted for in probabilistic models of language. Such models play an important role in natural language processing tasks such as translation or speech recognition, but often rely on the simplistic assumption that words are opaque symbols. This assumption does not fit morphologically complex language well, where words can have rich internal structure and sub-word elements are shared across distinct word forms. Our approach is to encode basic notions of morphology into the assumptions of three different types of language models, with the intention that leveraging shared sub-word structure can improve model performance and help overcome data sparsity that arises from morphological processes. In the context of n-gram language modelling, we formulate a new Bayesian model that relies on the decomposition of compound words to attain better smoothing, and we develop a new distributed language model that learns vector representations of morphemes and leverages them to link together morphologically related words. In both cases, we show that accounting for word sub-structure improves the models' intrinsic performance and provides benefits when applied to other tasks, including machine translation. We then shift the focus beyond the modelling of word sequences and consider models that automatically learn what the sub-word elements of a given language are, given an unannotated list of words. We formulate a novel model that can learn discontiguous morphemes in addition to the more conventional contiguous morphemes that most previous models are limited to. This approach is demonstrated on Semitic languages, and we find that modelling discontiguous sub-word structures leads to improvements in the task of segmenting words into their contiguous morphemes.Comment: DPhil thesis, University of Oxford, submitted and accepted 2014. http://ora.ox.ac.uk/objects/uuid:8df7324f-d3b8-47a1-8b0b-3a6feb5f45c
    corecore