23 research outputs found

    Using Graph Mining Method in Analyzing Turkish Loanwords Derived from Arabic Language

    Get PDF
    الكلمات المستعارة هي الكلمات التي يتم نقلها من لغة إلى أخرى وتصبح جزءًا أساسيًا من لغة الاستعارة. جاءت الكلمات المستعارة من لغة المصدر إلى لغة المستلم لأسباب عديدة. على سبيل المثال لا الحصر الغزوات أو المهن أو التجارة. ان ايجاد هذه الكلمات المستعارة بين اللغات عملية صعبة ومعقدة نظرا لانه لايوجد معايير ثابتة لتحويل الكلمات بين اللغات وبالتالي تكون الدقة قليلة. في هذا البحث تم تحسين دقة ايجاد الكلمات التركية المستعارة من اللغة العربية. وكذلك سوف يساهم هذا البحث بايجاد كل الكلمات المستعارة باستخدام اي مجموعة من الحرووف سواءا كانت مرتبة او غير مرتبة ابجديا. عالج هذا البحث مشكلة التشويه في النطق وقام بايجاد الحلول للحروف المفقودة في اللغة التركية والموجودة في اللغة العربية. تقدم هذه الورقة طريقة مقترحة لتحديد الكلمات التركية المستعارة من اللغة العربية اعتمادًا على تقنيات التنقيب في المخططات والتي استخدمت لاول مرة لهذا الغرض. فقد تم حل مشاكل الاختلاف في الحروف بين اللغتين باستخدام لغة مرجعية وهي اللغة الانكليزية لتوحيد نمط وشكل الحروف. لقد تم اختبار هذا النظام المقترح باستخدام 1256 كلمة. النتائج التي تم الحصول عليها تبين ان الدقة في تحديد الكلمات المستعارة كانت 0,99 والتي تعتبر قيمة عالية جدا. كل هذه المساهمات تؤدي إلى تقليل الوقت والجهد لتحديد الكلمات المستعارة بطريقة فعالة ودقيقة. كما أن الباحث لا يحتاج إلى معرفة باللغة المستعيرة واللغة المأخوذ منها. علاوة على ذلك ، يمكن تعميم هذه الطريقة على أي لغتين باستخدام نفس الخطوات المتبعة في الحصول على الكلمات المستعارة التركية من العربية.Loanwords are the words transferred from one language to another, which become essential part of the borrowing language. The loanwords have come from the source language to the recipient language because of many reasons. Detecting these loanwords is complicated task due to that there are no standard specifications for transferring words between languages and hence low accuracy. This work tries to enhance this accuracy of detecting loanwords between Turkish and Arabic language as a case study. In this paper, the proposed system contributes to find all possible loanwords using any set of characters either alphabetically or randomly arranged. Then, it processes the distortion in the pronunciation, and solves the problem of the missing letters in Turkish language relative to Arabic language. A graph mining technique was introduced, for identifying the Turkish loanwords from Arabic language, which is used for the first time for this purpose. Also, the problem of letters differences, in the two languages, is solved by using a reference language (English) to unify the style of writing. The proposed system was tested using 1256 words that manually annotated. The obtained results showed that the f-measure is 0.99 which is high value for such system. Also, all these contributions lead to decrease time and effort to identify the loanwords in efficient and accurate way. Moreover, researchers do not need to have knowledge in the recipient and the source languages. In addition, this method can be generalized to any two languages using the same steps followed in obtaining Turkish loanwords from Arabic

    Detection and Morphological Analysis of Novel Russian Loanwords

    Full text link
    This paper investigates recent English loanwords in Russian and explores ways in which computational methods can help further theoretical research. The goal of the study is two-fold: to find new, previously unattested loanwords borrowed over the last decade and to examine the rate of adaptation of the new borrowings, attested by the degree to which they conform to the constraints of the Russian language. First, we train a finite-state pipeline that combines character n-gram language models, which encode phonotactic and lexical properties of loanwords, with a binary classifier to detect loanwords. The model achieves state-of-the-art performance results during evaluation, surpassing previously established benchmarks. Secondly, we introduce a new and extended corpus of recent Russian loanwords that have been detected in Web texts by our model. The corpus includes loanwords together with their morphological features, part-of-speech tags, and sentences in which they occur. We conduct an analysis of inflectional morphology of the identified loanwords, investigating the rate of indeclinability of recent loanwords and stem-final consonant alternations in verbs

    Automatic Extraction of Lithuanian Cybersecurity Terms Using Deep Learning Approaches

    Get PDF
    The paper presents the results of research on deep learning methods aiming to determine the most effective one for automatic extraction of Lithuanian terms from a specialized domain (cybersecurity) with very restricted resources. A semi-supervised approach to deep learning was chosen for the research as Lithuanian is a less resourced language and large amounts of data, necessary for unsupervised methods, are not available in the selected domain. The findings of the research show that Bi-LSTM network with Bidirectional Encoder Representations from Transformers (BERT) can achieve close to state-of-the-art results

    Turkic C- type reduplications

    Get PDF
    The present book can be viewed as a patchwork of topics relating more or less directly to Turkic reduplications. Many are interconnected and interdependent, which renders it impossible to organize the presentation in a linear way. The thematic division adopted here is only one of the possible groupings, and not necessarily optimal for all tasks. To alleviate this inconvenience, the current chapter first summarizes the whole following a different thematic division (4.1), and then very briefly recapitualtes what I consider to be the most important conclusions (4.2). Some thoughts are expressed more clearly here than in the previous chapters, where they were lost between auxiliary observations

    Altaic and Chagatay lectures : studies in honour of Éva Kincses-Nagy

    Get PDF

    Information-theoretic causal inference of lexical flow

    Get PDF
    This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages. A flow-based separation criterion and domain-specific directionality detection criteria are developed to make existing causal inference algorithms more robust against imperfect cognacy data, giving rise to two new algorithms. The Phylogenetic Lexical Flow Inference (PLFI) algorithm requires lexical features of proto-languages to be reconstructed in advance, but yields fully general phylogenetic networks, whereas the more complex Contact Lexical Flow Inference (CLFI) algorithm treats proto-languages as hidden common causes, and only returns hypotheses of historical contact situations between attested languages. The algorithms are evaluated both against a large lexical database of Northern Eurasia spanning many language families, and against simulated data generated by a new model of language contact that builds on the opening and closing of directional contact channels as primary evolutionary events. The algorithms are found to infer the existence of contacts very reliably, whereas the inference of directionality remains difficult. This currently limits the new algorithms to a role as exploratory tools for quickly detecting salient patterns in large lexical datasets, but it should soon be possible for the framework to be enhanced e.g. by confidence values for each directionality decision

    Information-theoretic causal inference of lexical flow

    Get PDF
    This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages

    A typology of questions in Northeast Asia and beyond: An ecological perspective

    Get PDF
    This study investigates the distribution of linguistic and specifically structural diversity in Northeast Asia (NEA), defined as the region north of the Yellow River and east of the Yenisei. In particular, it analyzes what is called the grammar of questions (GQ), i.e., those aspects of any given language that are specialized for asking questions or regularly combine with these. The bulk of the study is a bottom-up description and comparison of GQs in the languages of NEA. The addition of the phrase and beyond to the title of this study serves two purposes. First, languages such as Turkish and Chuvash are included, despite the fact that they are spoken outside of NEA, since they have ties to (or even originated in) the region. Second, despite its focus on one area, the typology is intended to be applicable to other languages as well. Therefore, it makes extensive use of data from languages outside of NEA. The restriction to one category is necessary for reasons of space and clarity, and the process of zooming in on one region allows a higher resolution and historical accuracy than is usually the case in linguistic typology. The discussion mentions over 450 languages and dialects from NEA and beyond and gives about 900 glossed examples. The aim is to achieve both a cross-linguistically plausible typology and a maximal resolution of the linguistic diversity of Northeast Asia

    A typology of questions in Northeast Asia and beyond: An ecological perspective

    Get PDF
    This study investigates the distribution of linguistic and specifically structural diversity in Northeast Asia (NEA), defined as the region north of the Yellow River and east of the Yenisei. In particular, it analyzes what is called the grammar of questions (GQ), i.e., those aspects of any given language that are specialized for asking questions or regularly combine with these. The bulk of the study is a bottom-up description and comparison of GQs in the languages of NEA. The addition of the phrase and beyond to the title of this study serves two purposes. First, languages such as Turkish and Chuvash are included, despite the fact that they are spoken outside of NEA, since they have ties to (or even originated in) the region. Second, despite its focus on one area, the typology is intended to be applicable to other languages as well. Therefore, it makes extensive use of data from languages outside of NEA. The restriction to one category is necessary for reasons of space and clarity, and the process of zooming in on one region allows a higher resolution and historical accuracy than is usually the case in linguistic typology. The discussion mentions over 450 languages and dialects from NEA and beyond and gives about 900 glossed examples. The aim is to achieve both a cross-linguistically plausible typology and a maximal resolution of the linguistic diversity of Northeast Asia

    A typology of questions in Northeast Asia and beyond: An ecological perspective

    Get PDF
    This study investigates the distribution of linguistic and specifically structural diversity in Northeast Asia (NEA), defined as the region north of the Yellow River and east of the Yenisei. In particular, it analyzes what is called the grammar of questions (GQ), i.e., those aspects of any given language that are specialized for asking questions or regularly combine with these. The bulk of the study is a bottom-up description and comparison of GQs in the languages of NEA. The addition of the phrase and beyond to the title of this study serves two purposes. First, languages such as Turkish and Chuvash are included, despite the fact that they are spoken outside of NEA, since they have ties to (or even originated in) the region. Second, despite its focus on one area, the typology is intended to be applicable to other languages as well. Therefore, it makes extensive use of data from languages outside of NEA. The restriction to one category is necessary for reasons of space and clarity, and the process of zooming in on one region allows a higher resolution and historical accuracy than is usually the case in linguistic typology. The discussion mentions over 450 languages and dialects from NEA and beyond and gives about 900 glossed examples. The aim is to achieve both a cross-linguistically plausible typology and a maximal resolution of the linguistic diversity of Northeast Asia
    corecore