77 research outputs found

    Using Graph Mining Method in Analyzing Turkish Loanwords Derived from Arabic Language

    Get PDF
    الكلمات المستعارة هي الكلمات التي يتم نقلها من لغة إلى أخرى وتصبح جزءًا أساسيًا من لغة الاستعارة. جاءت الكلمات المستعارة من لغة المصدر إلى لغة المستلم لأسباب عديدة. على سبيل المثال لا الحصر الغزوات أو المهن أو التجارة. ان ايجاد هذه الكلمات المستعارة بين اللغات عملية صعبة ومعقدة نظرا لانه لايوجد معايير ثابتة لتحويل الكلمات بين اللغات وبالتالي تكون الدقة قليلة. في هذا البحث تم تحسين دقة ايجاد الكلمات التركية المستعارة من اللغة العربية. وكذلك سوف يساهم هذا البحث بايجاد كل الكلمات المستعارة باستخدام اي مجموعة من الحرووف سواءا كانت مرتبة او غير مرتبة ابجديا. عالج هذا البحث مشكلة التشويه في النطق وقام بايجاد الحلول للحروف المفقودة في اللغة التركية والموجودة في اللغة العربية. تقدم هذه الورقة طريقة مقترحة لتحديد الكلمات التركية المستعارة من اللغة العربية اعتمادًا على تقنيات التنقيب في المخططات والتي استخدمت لاول مرة لهذا الغرض. فقد تم حل مشاكل الاختلاف في الحروف بين اللغتين باستخدام لغة مرجعية وهي اللغة الانكليزية لتوحيد نمط وشكل الحروف. لقد تم اختبار هذا النظام المقترح باستخدام 1256 كلمة. النتائج التي تم الحصول عليها تبين ان الدقة في تحديد الكلمات المستعارة كانت 0,99 والتي تعتبر قيمة عالية جدا. كل هذه المساهمات تؤدي إلى تقليل الوقت والجهد لتحديد الكلمات المستعارة بطريقة فعالة ودقيقة. كما أن الباحث لا يحتاج إلى معرفة باللغة المستعيرة واللغة المأخوذ منها. علاوة على ذلك ، يمكن تعميم هذه الطريقة على أي لغتين باستخدام نفس الخطوات المتبعة في الحصول على الكلمات المستعارة التركية من العربية.Loanwords are the words transferred from one language to another, which become essential part of the borrowing language. The loanwords have come from the source language to the recipient language because of many reasons. Detecting these loanwords is complicated task due to that there are no standard specifications for transferring words between languages and hence low accuracy. This work tries to enhance this accuracy of detecting loanwords between Turkish and Arabic language as a case study. In this paper, the proposed system contributes to find all possible loanwords using any set of characters either alphabetically or randomly arranged. Then, it processes the distortion in the pronunciation, and solves the problem of the missing letters in Turkish language relative to Arabic language. A graph mining technique was introduced, for identifying the Turkish loanwords from Arabic language, which is used for the first time for this purpose. Also, the problem of letters differences, in the two languages, is solved by using a reference language (English) to unify the style of writing. The proposed system was tested using 1256 words that manually annotated. The obtained results showed that the f-measure is 0.99 which is high value for such system. Also, all these contributions lead to decrease time and effort to identify the loanwords in efficient and accurate way. Moreover, researchers do not need to have knowledge in the recipient and the source languages. In addition, this method can be generalized to any two languages using the same steps followed in obtaining Turkish loanwords from Arabic

    Augmenting a colour lexicon

    Get PDF
    Languages differ markedly in the number of colour terms in their lexicons. The Himba, for example, a remote culture in Namibia, were reported in 2005 to have only a 5-colour term language. We re-examined their colour naming using a novel computer-based method drawing colours from across the gamut rather than only from the saturated shell of colour space that is the norm in cross-cultural colour research. Measuring confidence in communication, the Himba now have seven terms, or more properly categories, that are independent of other colour terms. Thus, we report the first augmentation of major terms, namely green and brown, to a colour lexicon in any language. A critical examination of supervised and unsupervised machine-learning approaches across the two datasets collected at different periods shows that perceptual mechanisms can, at most, only to some extent explain colour category formation and that cultural factors, such as linguistic similarity are the critical driving force for augmenting colour terms and effective colour communication

    Character-level Chinese Backpack Language Models

    Full text link
    The Backpack is a Transformer alternative shown to improve interpretability in English language modeling by decomposing predictions into a weighted sum of token sense components. However, Backpacks' reliance on token-defined meaning raises questions as to their potential for languages other than English, a language for which subword tokenization provides a reasonable approximation for lexical items. In this work, we train, evaluate, interpret, and control Backpack language models in character-tokenized Chinese, in which words are often composed of many characters. We find that our (134M parameter) Chinese Backpack language model performs comparably to a (104M parameter) Transformer, and learns rich character-level meanings that log-additively compose to form word meanings. In SimLex-style lexical semantic evaluations, simple averages of Backpack character senses outperform input embeddings from a Transformer. We find that complex multi-character meanings are often formed by using the same per-character sense weights consistently across context. Exploring interpretability-through control, we show that we can localize a source of gender bias in our Backpacks to specific character senses and intervene to reduce the bias.Comment: BlackboxNLP 2023 Camera-Read

    Compilation of Malay criminological terms from online news

    Get PDF
    A Malay language corpus has been established by the Institute of Language and Literature (Dewan Bahasa dan Pustaka, DBP in Malaysia). Most of the past research on the Malay language corpus has focused on the description, lexicography and translation of the Malay language. However, in the existing literature, there is no list of Malay words that categorizes crime terminologies. This study aims to fill that linguistic gap. First, we aggregated the most frequently used crime terminology words from Malaysian online news sources. Five hundred crime-related words were compiled. No automatic machines were in the initial process, but they were subsequently used to verify the data. Four human coders were used to validate the data and ensure the originality of the semantic understanding of the Malay text. Finally, major crime terminologies were outlined from a set of keywords to serve as taggers in our solution. The ultimate goal of this study is to provide a corpus for forensic linguistics, police investigations, and general crime research. This study has established the first corpus of a criminological text in the Malay language

    Information-theoretic causal inference of lexical flow

    Get PDF
    This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages. A flow-based separation criterion and domain-specific directionality detection criteria are developed to make existing causal inference algorithms more robust against imperfect cognacy data, giving rise to two new algorithms. The Phylogenetic Lexical Flow Inference (PLFI) algorithm requires lexical features of proto-languages to be reconstructed in advance, but yields fully general phylogenetic networks, whereas the more complex Contact Lexical Flow Inference (CLFI) algorithm treats proto-languages as hidden common causes, and only returns hypotheses of historical contact situations between attested languages. The algorithms are evaluated both against a large lexical database of Northern Eurasia spanning many language families, and against simulated data generated by a new model of language contact that builds on the opening and closing of directional contact channels as primary evolutionary events. The algorithms are found to infer the existence of contacts very reliably, whereas the inference of directionality remains difficult. This currently limits the new algorithms to a role as exploratory tools for quickly detecting salient patterns in large lexical datasets, but it should soon be possible for the framework to be enhanced e.g. by confidence values for each directionality decision

    Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing

    Get PDF
    Linguistic typology aims to capture structural and semantic variation across the world's languages. A large-scale typology could provide excellent guidance for multilingual Natural Language Processing (NLP), particularly for languages that suffer from the lack of human labeled resources. We present an extensive literature survey on the use of typological information in the development of NLP techniques. Our survey demonstrates that to date, the use of information in existing typological databases has resulted in consistent but modest improvements in system performance. We show that this is due to both intrinsic limitations of databases (in terms of coverage and feature granularity) and under-employment of the typological features included in them. We advocate for a new approach that adapts the broad and discrete nature of typological categories to the contextual and continuous nature of machine learning algorithms used in contemporary NLP. In particular, we suggest that such approach could be facilitated by recent developments in data-driven induction of typological knowledge
    corecore