11 research outputs found

    Turkish lexicon expansion by using finite state automata

    Get PDF
    © 2019 The Authors. Published by The Scientific and Technological Research Council of Turkey. This is an open access article available under a Creative Commons licence. The published version can be accessed at the following link on the publisher’s website: https://journals.tubitak.gov.tr/elektrik/issues/elk-19-27-2/elk-27-2-25-1804-10.pdfTurkish is an agglutinative language with rich morphology. A Turkish verb can have thousands of different word forms. Therefore, sparsity becomes an issue in many Turkish natural language processing (NLP) applications. This article presents a model for Turkish lexicon expansion. We aimed to expand the lexicon by using a morphological segmentation system by reversing the segmentation task into a generation task. Our model uses finite-state automata (FSA) to incorporate orthographic features and morphotactic rules. We extracted orthographic features by capturing phonological operations that are applied to words whenever a suffix is added. Each FSA state corresponds to either a stem or a suffix category. Stems are clustered based on their parts-of-speech (i.e. noun, verb, or adjective) and suffixes are clustered based on their allomorphic features. We generated approximately 1 million word forms by using only a few thousand Turkish stems with an accuracy of 82.36%, which will help to reduce the out-of-vocabulary size in other NLP applications. Although our experiments are performed on Turkish language, the same model is also applicable to other agglutinative languages such as Hungarian and Finnish.Published versio

    Undergraduates’ interest towards learning genetics concepts through integrated stemproblem based learning approach

    Get PDF
    Scientific and innovative society can be produced by giving priorities in Science, Technology, Engineering, and Mathematics (STEM) as emphasized by Malaysian Higher Education Blueprint (2015-2025). STEM need to be implemented at higher education because universities need to produce competent graduates to support economy growth and sustainable development. Learning STEM through Problem Based Learning might allow the undergraduates to become more enthusiastic when problem-based instruction is incorporated with STEM by implementing teamwork and problem-solving techniques to engage the first-year undergraduates fully with the learning. This study was conducted to investigate whether Integrated STEM Problem Based Learning module could enhance and retain the interest towards genetics concepts among first-year undergraduates. Topics in genetics was considered difficult not only to teach but also to learn. In this research, to overcome the genetic concepts learning difficulties, genetic related topics were chosen to introduce STEM through problem-based learning approach, which might help first-year undergraduates to acquire deep genetic content knowledge. This is very vital for the first-year undergraduates, as the knowledge gained in their first semester will be applied in the upcoming courses in their entire undergraduates’ programs of study. A Pre-Experimental research design with one group-posttest design was applied. A total of 50 participants who are first-year undergraduates from Faculty of Biology from one of the public universities in Malaysia were involved. The Genetics Interest Questionnaire used to study if the STEM Problem Based Learning module could enhance and retain the interest towards genetics concepts. The research has proven that Integrated STEM through problem-based learning approach could enhance and retains the interest in learning genetics concepts among first-year undergraduates

    A Method to Convert Sana’ani Accent to Modern Standard Arabic

    Get PDF
    This paper presents an efficient mechanism to convert Sana’ani dialect to modern standard Arabic. The mechanism is based on morphological rulesrelated to Sana’ani dialect as well as Modern Standard Arabic. Such rules facilitate the dialect conversion to its corresponding MSA. The mechanismtokenizes the input dialect text and divides each token into stem and its affixes; such affixes can be categorized into two categories: dialect affixesand/or MSA affixes. At the same time, the stem could be dialect stem or MSA stem. Therefore, our mechanism, implemented by using a simple MSAstemmer, must pay attention to such situations. Then our dialect stemmer is applied to strip the resulting token and extract dialect affixes. At this point,the rules are applied to decide when to carry out the extraction of an affix. The experiment shows that Sana’ani dialect has three classes of distortions,which are prefixes, suffixes, and stems distortions. The algorithm normalizes such distortion based on the morphological rules. For each morphologicalrule the mechanism checks possibility of applying such a rule. That means if rule conditions be met, then the dialect affix will be replaced by itscorresponding MSA. If there is no restriction on applying the rule related to the distorted stem, then the rule can be considered as a parallel corpus of thedialect and MSA. Finally, the experiment computes the distortion ratio of MSA in Sana’ani dialect. For a Sana’ani dialect sample of 9386 words,16.29% of them have distorted suffixes, 0.70% have distorted prefixes and 2.17% contain distorted stems. These percentages are related only to theprocessed words

    Building An Efficient Indexing For Crawling The Website With An Efficient Spider

    Get PDF
    With the present effort, we propose to investigate results of applying the Right-Truncated Index-Based Web Search Engine in order to determine its usefulness for storing and retrieving Arabic documents. The Right-Truncated Index-Based Web Search Engine, being a program for reading any set of Arabic documents accepts a query, and then processes both the documents and the query. Thus, it selects (predicts) those documents most relevant to the query which has been inserted. The program encompasses both a morphological component and a mathematical one. The morphological component allows the researcher to run either a stemming algorithm or a right-truncated algorithm. The chief advantage of the stemming algorithm is that it uses the least possible amount of storage for indexing by mapping the inflected and derived terms into a single, indexed-stem word. On the other hand, the right-truncated algorithm reduces the amount of storage to a lesser degree, but increases the probability of retrieving relevant (user-favorable) documents, compared to the stemming algorithm. One of the purposes of our investigation is to compare the efficiency of these two indexing mechanisms. The mathematical component of the algorithm accepts the output of the right truncation algorithm, and then employs both term-frequency and inverse document-frequency (TF-IDF) in order to establish the relative importance of each document, respective to the terms of the query. This paper also describes building a simple search engine based on a crawler or a spider. The clawer which indexes different types of documents is an algorithm to crawl the file systems from specified folder. A basic design and object model was developed to support single search word results as well as multiple search words results. It is capable of finding data to index by following (tracing) web links rather than searching directory listings in the file system. In this process files are downloaded through HTTP and HTML pages parsed in order to obtain more links without getting into a recursive loop. Also, this paper discusses how to improve indexing mechanism efficiency using a right truncated stemmer in terms of Arabic documents processing

    Arabic ELLS’ Attitude toward Phrasal Verbs

    Get PDF
    All too often, English second language learners come across phrasal verbs and find themselves missing the point. They find themselves in need to look up these phrases in order to understand the intended meaning. Learners usually recognize the meaning of the verb; however, the action suggested by the verb does not go along with the associated object or the surrounding context. Simply, what they read does not make sense. A particle that looks like a preposition is attached to the verb and affects the meaning of the whole sentence. This change in meaning leads to misinterpretation and causes communication failure. Phrasal verbs (PVs) are too many to master and sometimes one PV has multiple meanings (e.g., make up). Some studies described PVs as “a recurring nightmare” to English language learners (ELLs) (Littlemore & Low, 2006), and in other studies mentioned that PVs “do not enjoy a good reputation” (Rudzka-Ostyn, 2003). The natural reaction toward difficult language constructions is avoidance. This study concerns itself with the avoidance attitude of Arabic ELLs toward English phrasal verbs (EPVs). Earlier empirical studies attributed the avoidance of using EPVs only to the syntactic differences between L1 and L2 (Dagut & Laufer, 1985; Laufer & Eliasson, 1993). Other studies ascribed the avoidance behavior to the semantic difficulty of EPVs (Hulstijn & Marchena (1989). However, recent studies speculate that there are more factors for the behavior other than the L1 L2 differences and the polysemous nature of English PVs (Liao & Fukuya, 2004). This study validated the avoidance behavior among Arabic learners. It also looked into three salient factors that have direct effects on the avoidance behavior of English phrasal verbs: the proficiency level of the learners, the length of stay in L2 environment, and the type of phrasal verbs. A total of 18 Arabic informants, equally divided into two groups (intermediate and advanced), participated in an experimental test to investigate the Arabic ELLs’ avoidance attitude and the reasons behind it. It was hypothesized that the performances of the two groups were different through measuring the means and proportions of the two groups. The results proved the alternative hypothesis (H1) and rejected the null hypothesis (H0). That is, the means of the two groups were not equal. The intermediate group avoided more PVs than the advanced group. The results also showed effects of the variables on the avoidance behavior. 1) The advanced group selected and used more PVs in the experimental test than the intermediate group. 2) The longer the period a learner stay in an English speaking environment, the more PVs are learned. 3) PVs that bear idiomatic meaning are avoided more than PVs that carry idiomatic meaning are avoided more than the ones that carry literal meaning. The study also overviewed the concept of phrasal verb in Arabic and English in its folds and viewed the stance of grammarians about PVs in the two languages. Three approaches of teaching EPVs were presented as an attempt to find ways that allow ELLs perceive and produce phrasal verbs naturally the way native speakers do

    بناء أداة تفاعلية متعددة اللغات لاسترجاع المعلومات

    Get PDF
    The growing requirement on the Internet have made users access to the information expressed in a language other than their own , which led to Cross lingual information retrieval (CLIR) .CLIR is established as a major topic in Information Retrieval (IR). One approach to CLIR uses different methods of translation to translate queries to documents and indexes in other languages. As queries submitted to search engines suffer lack of untranslatable query keys (i.e., words that the dictionary is missing) and translation ambiguity, which means difficulty in choosing between alternatives of translation. Our approach in this thesis is to build and develop the software tool (MORTAJA-IR-TOOL) , a new tool for retrieving information using programming JAVA language with JDK 1.6. This tool has many features, which is develop multiple systematic languages system to be use as a basis for translation when using CLIR, as well as the process of stemming the words entered in the query process as a stage preceding the translation process. The evaluation of the proposed methodology translator of the query comparing it with the basic translation that uses readable dictionary automatically the percentage of improvement is 8.96%. The evaluation of the impact of the process of stemming the words entered in the query on the quality of the output process in the retrieval of matched data in other process the rate of improvement is 4.14%. Finally the rated output of the merger between the use of stemming methodology proposed and translation process (MORTAJA-IR-TOOL) which concluded that the proportion of advanced in the process of improvement in data rate of retrieval is 15.86%. Keywords: Cross lingual information retrieval, CLIR, Information Retrieval, IR, Translation, stemming.الاحتياجات المتنامية على شبكة الإنترنت جعلت المستخدمين لهم حق الوصول إلى المعلومات بلغة غير لغتهم الاصلية، مما يقودنا الى مصطلح عبور اللغات لاسترجاع المعلومات (CLIR). CLIR أنشئت كموضوع رئيسي في "استرجاع المعلومات" (IR). نهج واحد ل CLIR يستخدم أساليب مختلفة للترجمة ومنها لترجمة الاستعلامات وترجمة الوثائق والفهارس في لغات أخرى. الاستفسارات والاستعلامات المقدمة لمحركات البحث تعاني من عدم وجود ترجمه لمفاتيح الاستعلام (أي أن العبارة مفقودة من القاموس) وايضا تعاني من غموض الترجمة، مما يعني صعوبة في الاختيار بين بدائل الترجمة. في نهجنا في هذه الاطروحة تم بناء وتطوير الأداة البرمجية (MORTAJA-IR-TOOL) أداة جديدة لاسترجاع المعلومات باستخدام لغة البرمجة JAVA مع JDK 1.6، وتمتلك هذه الأداة العديد من الميزات، حيث تم تطوير منظومة منهجية متعددة اللغات لاستخدامها كأساس للترجمة عند استخدام CLIR، وكذلك عملية تجذير للكلمات المدخلة في عملية الاستعلام كمرحلة تسبق عملية الترجمة. وتم تقييم الترجمة المنهجية المقترحة للاستعلام ومقارنتها مع الترجمة الأساسية التي تستخدم قاموس مقروء اليا كأساس للترجمة في تجربة تركز على المستخدم وكانت نسبة التحسين 8.96% , وكذلك يتم تقييم مدى تأثير عملية تجذير الكلمات المدخلة في عملية الاستعلام على جودة المخرجات في عملية استرجاع البيانات المتطابقة باللغة الاخرى وكانت نسبة التحسين 4.14% , وفي النهاية تم تقييم ناتج عملية الدمج بين استخدام التجذير والترجمة المنهجية المقترحة (MORTAJA-IR-TOOL) والتي خلصت الى نسبة متقدمة في عملية التحسين في نسبة البيانات المرجعة وكانت 15.86%

    Early phonological acquisition by Kuwaiti Arabic children

    Get PDF
    PhD ThesisThis is the first exploration of typical phonological development in the speech of children acquiring Kuwaiti-Arabic (KA) before the age of 4;0. In many of the word’s languages, salient aspects of the ambient language have been shown to influence the child’s initial progress in language acquisition (Vihman, 1996, 2014); however, studies of phonological development of Arabic lack adequate information on the extent of the influence of factors such as frequency of occurrence of certain features and their phonological salience on the early stages of speech acquisition. A cross-sectional study design was adapted in this thesis to explore the speech of 70 typically developing children. The children were sampled from the Arabic-speaking Kuwaiti population; the children were aged 1;4 and 3;7 and gender-balanced. Spontaneous speech samples were obtained from audio and video recordings of the children while interacting with their parent for 30-minutes. The production accuracy of KA consonants was examined to explore the influence of type and token frequencies on order of consonant acquisition and the development of error patterns. The sonority index was also used to predict the order of consonant acquisition cross-linguistically. The findings were then compared with those of other dialects of Arabic to identify within-language variability and with studies on English to address cross-linguistic differences between Arabic and English early phonological development. The results are partially consistent with accounts that argue for a significant role of input frequency in determining rate and order of consonant acquisition within a language. The development of KA error patterns also shows relative sensitivity to consonant frequency. The sonority index does not always help in the prediction of all Arabic consonants, and the developmental error patterns and early word structures in Arabic and English are significantly distinct. The outcomes of this study provide essential knowledge about typical Arabic phonological development and the first step towards building a standardised phonological test for Arabic speaking children

    Sentiment analysis and resources for informal Arabic text on social media

    Get PDF
    Online content posted by Arab users on social networks does not generally abide by the grammatical and spelling rules. These posts, or comments, are valuable because they contain users’ opinions towards different objects such as products, policies, institutions, and people. These opinions constitute important material for commercial and governmental institutions. Commercial institutions can use these opinions to steer marketing campaigns, optimize their products and know the weaknesses and/ or strengths of their products. Governmental institutions can benefit from the social networks posts to detect public opinion before or after legislating a new policy or law and to learn about the main issues that concern citizens. However, the huge size of online data and its noisy nature can hinder manual extraction and classification of opinions present in online comments. Given the irregularity of dialectal Arabic (or informal Arabic), tools developed for formally correct Arabic are of limited use. This is specifically the case when employed in sentiment analysis (SA) where the target of the analysis is social media content. This research implemented a system that addresses this challenge. This work can be roughly divided into three blocks: building a corpus for SA and manually tagging it to check the performance of the constructed lexicon-based (LB) classifier; building a sentiment lexicon that consists of three different sets of patterns (negative, positive, and spam); and finally implementing a classifier that employs the lexicon to classify Facebook comments. In addition to providing resources for dialectal Arabic SA and classifying Facebook comments, this work categorises reasons behind incorrect classification, provides preliminary solutions for some of them with focus on negation, and uses regular expressions to detect the presence of lexemes. This work also illustrates how the constructed classifier works along with its different levels of reporting. Moreover, it compares the performance of the LB classifier against Naïve Bayes classifier and addresses how NLP tools such as POS tagging and Named Entity Recognition can be employed in SA. In addition, the work studies the performance of the implemented LB classifier and the developed sentiment lexicon when used to classify other corpora used in the literature, and the performance of lexicons used in the literature to classify the corpora constructed in this research. With minor changes, the classifier can be used in domain classification of documents (sports, science, news, etc.). The work ends with a discussion of research questions arising from the research reported
    corecore