77 research outputs found
Using Graph Mining Method in Analyzing Turkish Loanwords Derived from Arabic Language
الكلمات المستعارة هي الكلمات التي يتم نقلها من لغة إلى أخرى وتصبح جزءًا أساسيًا من لغة الاستعارة. جاءت الكلمات المستعارة من لغة المصدر إلى لغة المستلم لأسباب عديدة. على سبيل المثال لا الحصر الغزوات أو المهن أو التجارة. ان ايجاد هذه الكلمات المستعارة بين اللغات عملية صعبة ومعقدة نظرا لانه لايوجد معايير ثابتة لتحويل الكلمات بين اللغات وبالتالي تكون الدقة قليلة. في هذا البحث تم تحسين دقة ايجاد الكلمات التركية المستعارة من اللغة العربية. وكذلك سوف يساهم هذا البحث بايجاد كل الكلمات المستعارة باستخدام اي مجموعة من الحرووف سواءا كانت مرتبة او غير مرتبة ابجديا. عالج هذا البحث مشكلة التشويه في النطق وقام بايجاد الحلول للحروف المفقودة في اللغة التركية والموجودة في اللغة العربية. تقدم هذه الورقة طريقة مقترحة لتحديد الكلمات التركية المستعارة من اللغة العربية اعتمادًا على تقنيات التنقيب في المخططات والتي استخدمت لاول مرة لهذا الغرض. فقد تم حل مشاكل الاختلاف في الحروف بين اللغتين باستخدام لغة مرجعية وهي اللغة الانكليزية لتوحيد نمط وشكل الحروف. لقد تم اختبار هذا النظام المقترح باستخدام 1256 كلمة. النتائج التي تم الحصول عليها تبين ان الدقة في تحديد الكلمات المستعارة كانت 0,99 والتي تعتبر قيمة عالية جدا. كل هذه المساهمات تؤدي إلى تقليل الوقت والجهد لتحديد الكلمات المستعارة بطريقة فعالة ودقيقة. كما أن الباحث لا يحتاج إلى معرفة باللغة المستعيرة واللغة المأخوذ منها. علاوة على ذلك ، يمكن تعميم هذه الطريقة على أي لغتين باستخدام نفس الخطوات المتبعة في الحصول على الكلمات المستعارة التركية من العربية.Loanwords are the words transferred from one language to another, which become essential part of the borrowing language. The loanwords have come from the source language to the recipient language because of many reasons. Detecting these loanwords is complicated task due to that there are no standard specifications for transferring words between languages and hence low accuracy. This work tries to enhance this accuracy of detecting loanwords between Turkish and Arabic language as a case study. In this paper, the proposed system contributes to find all possible loanwords using any set of characters either alphabetically or randomly arranged. Then, it processes the distortion in the pronunciation, and solves the problem of the missing letters in Turkish language relative to Arabic language. A graph mining technique was introduced, for identifying the Turkish loanwords from Arabic language, which is used for the first time for this purpose. Also, the problem of letters differences, in the two languages, is solved by using a reference language (English) to unify the style of writing. The proposed system was tested using 1256 words that manually annotated. The obtained results showed that the f-measure is 0.99 which is high value for such system. Also, all these contributions lead to decrease time and effort to identify the loanwords in efficient and accurate way. Moreover, researchers do not need to have knowledge in the recipient and the source languages. In addition, this method can be generalized to any two languages using the same steps followed in obtaining Turkish loanwords from Arabic
Recommended from our members
Applying corpus and computational methods to loanword research : new approaches to Anglicisms in Spanish
Understanding both the linguistic and social roles of loanwords is becoming more relevant as globalization has brought loanwords into new settings, often previously viewed as monolingual. Their occurrence has the potential to impact speech communities, in that they have the capacity to alter the semantic relationships and social values ascribed to individual elements within the existing lexicon. In order to identify broad patterns, we must turn towards large and varied sources of data, specifically corpora. This dissertation aims to tackle some of the practical issues involved in the use of corpora, while addressing two conceptual issues in the field of loanword research – the social distribution and semantic nature of loanwords. In this dissertation, I propose two methods, adapted from advances in computational linguistics, which will contribute to two different stages of loanword research: processing corpora to find tokens of interest and semantically analyzing tokens of interest. These methods will be employed in two case studies. The first seeks to explore the social stratification of loanwords in Argentine Spanish. The second measures the semantic specificity of loanwords relative to their native equivalents.Spanish and Portugues
Augmenting a colour lexicon
Languages differ markedly in the number of colour terms in their lexicons. The Himba, for example, a remote culture in Namibia, were reported in 2005 to have only a 5-colour term language. We re-examined their colour naming using a novel computer-based method drawing colours from across the gamut rather than only from the saturated shell of colour space that is the norm in cross-cultural colour research. Measuring confidence in communication, the Himba now have seven terms, or more properly categories, that are independent of other colour terms. Thus, we report the first augmentation of major terms, namely green and brown, to a colour lexicon in any language. A critical examination of supervised and unsupervised machine-learning approaches across the two datasets collected at different periods shows that perceptual mechanisms can, at most, only to some extent explain colour category formation and that cultural factors, such as linguistic similarity are the critical driving force for augmenting colour terms and effective colour communication
Character-level Chinese Backpack Language Models
The Backpack is a Transformer alternative shown to improve interpretability
in English language modeling by decomposing predictions into a weighted sum of
token sense components. However, Backpacks' reliance on token-defined meaning
raises questions as to their potential for languages other than English, a
language for which subword tokenization provides a reasonable approximation for
lexical items. In this work, we train, evaluate, interpret, and control
Backpack language models in character-tokenized Chinese, in which words are
often composed of many characters. We find that our (134M parameter) Chinese
Backpack language model performs comparably to a (104M parameter) Transformer,
and learns rich character-level meanings that log-additively compose to form
word meanings. In SimLex-style lexical semantic evaluations, simple averages of
Backpack character senses outperform input embeddings from a Transformer. We
find that complex multi-character meanings are often formed by using the same
per-character sense weights consistently across context. Exploring
interpretability-through control, we show that we can localize a source of
gender bias in our Backpacks to specific character senses and intervene to
reduce the bias.Comment: BlackboxNLP 2023 Camera-Read
Compilation of Malay criminological terms from online news
A Malay language corpus has been established by
the Institute of Language and Literature (Dewan Bahasa dan
Pustaka, DBP in Malaysia). Most of the past research on the Malay language corpus has focused on the description,
lexicography and translation of the Malay language. However, in the existing literature, there is no list of Malay words that categorizes crime terminologies. This study aims to fill that linguistic gap. First, we aggregated the most frequently used crime terminology words from Malaysian online news sources. Five hundred crime-related words were compiled. No automatic machines were in the initial process, but they were subsequently used to verify the data. Four human coders were used to validate the data and ensure the originality of the semantic understanding of the Malay text. Finally, major crime terminologies were outlined from a set of keywords to serve as taggers in our solution. The ultimate goal of this study is to provide a corpus for forensic linguistics, police investigations, and general crime research. This study has established the first corpus of a criminological text in the Malay language
Information-theoretic causal inference of lexical flow
This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages. A flow-based separation criterion and domain-specific directionality detection criteria are developed to make existing causal inference algorithms more robust against imperfect cognacy data, giving rise to two new algorithms. The Phylogenetic Lexical Flow Inference (PLFI) algorithm requires lexical features of proto-languages to be reconstructed in advance, but yields fully general phylogenetic networks, whereas the more complex Contact Lexical Flow Inference (CLFI) algorithm treats proto-languages as hidden common causes, and only returns hypotheses of historical contact situations between attested languages. The algorithms are evaluated both against a large lexical database of Northern Eurasia spanning many language families, and against simulated data generated by a new model of language contact that builds on the opening and closing of directional contact channels as primary evolutionary events. The algorithms are found to infer the existence of contacts very reliably, whereas the inference of directionality remains difficult. This currently limits the new algorithms to a role as exploratory tools for quickly detecting salient patterns in large lexical datasets, but it should soon be possible for the framework to be enhanced e.g. by confidence values for each directionality decision
Modeling Language Variation and Universals: A Survey on Typological Linguistics for Natural Language Processing
Linguistic typology aims to capture structural and semantic variation across
the world's languages. A large-scale typology could provide excellent guidance
for multilingual Natural Language Processing (NLP), particularly for languages
that suffer from the lack of human labeled resources. We present an extensive
literature survey on the use of typological information in the development of
NLP techniques. Our survey demonstrates that to date, the use of information in
existing typological databases has resulted in consistent but modest
improvements in system performance. We show that this is due to both intrinsic
limitations of databases (in terms of coverage and feature granularity) and
under-employment of the typological features included in them. We advocate for
a new approach that adapts the broad and discrete nature of typological
categories to the contextual and continuous nature of machine learning
algorithms used in contemporary NLP. In particular, we suggest that such
approach could be facilitated by recent developments in data-driven induction
of typological knowledge
- …