Search CORE

391 research outputs found

Multilingual unsupervised word alignment models and their application

Author: Mansouri Bigvand Anahita
Publication venue
Publication date: 05/03/2021
Field of study

Word alignment is an essential task in natural language processing because of its critical role in training statistical machine translation (SMT) models, error analysis for neural machine translation (NMT), building bilingual lexicon, and annotation transfer. In this thesis, we explore models for word alignment, how they can be extended to incorporate linguistically-motivated alignment types, and how they can be neuralized in an end-to-end fashion. In addition to these methodological developments, we apply our word alignment models to cross-lingual part-of-speech projection. First, we present a new probabilistic model for word alignment where word alignments are associated with linguistically-motivated alignment types. We propose a novel task of joint prediction of word alignment and alignment types and propose novel semi-supervised learning algorithms for this task. We also solve a sub-task of predicting the alignment type given an aligned word pair. The proposed joint generative models (alignment-type-enhanced models) significantly outperform the models without alignment types in terms of word alignment and translation quality. Next, we present an unsupervised neural Hidden Markov Model for word alignment, where emission and transition probabilities are modeled using neural networks. The model is simpler in structure, allows for seamless integration of additional context, and can be used in an end-to-end neural network. Finally, we tackle the part-of-speech tagging task for the zero-resource scenario where no part-of-speech (POS) annotated training data is available. We present a cross-lingual projection approach where neural HMM aligners are used to obtain high quality word alignments between resource-poor and resource-rich languages. Moreover, high quality neural POS taggers are used to provide annotations for the resource-rich language side of the parallel data, as well as to train a tagger on the projected data. Our experimental results on truly low-resource languages show that our methods outperform their corresponding baselines

Simon Fraser University Institutional Repository

Deep Learning for Natural Language Parsing

Author: Calder Calum
Jaf Sardar
Publication venue: IEEE Access
Publication date
Field of study

Natural language processing problems (such as speech recognition, text-based data mining, and text or speech generation) are becoming increasingly important. Before effectively approaching many of these problems, it is necessary to process the syntactic structures of the sentences. Syntactic parsing is the task of constructing a syntactic parse tree over a sentence which describes the structure of the sentence. Parse trees are used as part of many language processing applications. In this paper, we present a multi-lingual dependency parser. Using advanced deep learning techniques, our parser architecture tackles common issues with parsing such as long-distance head attachment, while using ‘architecture engineering’ to adapt to each target language in order to reduce the feature engineering often required for parsing tasks. We implement a parser based on this architecture to utilize transfer learning techniques to address important issues related with limited-resourced language. We exceed the accuracy of state-of-the-art parsers on languages with limited training resources by a considerable margin. We present promising results for solving core problems in natural language parsing, while also performing at state-of-the-art accuracy on general parsing tasks

Sunderland University Institutional Repository

On the development of an information system for monitoring user opinion and its role for the public

Author: Karyukin Vladislav
Mamykova Zhanl
Mutanov Galimkair
Nassimova Gulnar
Negri Matteo
Sundetova Zhanerke
Torekul Saule
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Social media services and analytics platforms are rapidly growing. A large number of various events happen mostly every day, and the role of social media monitoring tools is also increasing. Social networks are widely used for managing and promoting brands and different services. Thus, most popular social analytics platforms aim for business purposes while monitoring various social, economic, and political problems remains underrepresented and not covered by thorough research. Moreover, most of them focus on resource-rich languages such as the English language, whereas texts and comments in other low-resource languages, such as the Russian and Kazakh languages in social media, are not represented well enough. So, this work is devoted to developing and applying the information system called the OMSystem for analyzing users' opinions on news portals, blogs, and social networks in Kazakhstan. The system uses sentiment dictionaries of the Russian and Kazakh languages and machine learning algorithms to determine the sentiment of social media texts. The whole structure and functionalities of the system are also presented. The experimental part is devoted to building machine learning models for sentiment analysis on the Russian and Kazakh datasets. Then the performance of the models is evaluated with accuracy, precision, recall, and F1-score metrics. The models with the highest scores are selected for implementation in the OMSystem. Then the OMSystem's social analytics module is used to thoroughly analyze the healthcare, political and social aspects of the most relevant topics connected with the vaccination against the coronavirus disease. The analysis allowed us to discover the public social mood in the cities of Almaty and Nur-Sultan and other large regional cities of Kazakhstan. The system's study included two extensive periods: 10-01-2021 to 30-05-2021 and 01-07-2021 to 12-08-2021. In the obtained results, people's moods and attitudes to the Government's policies and actions were studied by such social network indicators as the level of topic discussion activity in society, the level of interest in the topic in society, and the mood level of society. These indicators calculated by the OMSystem allowed careful identification of alarming factors of the public (negative attitude to the government regulations, vaccination policies, trust in vaccination, etc.) and assessment of the social mood

Archivio della ricerca - Fondazione Bruno Kessler

PubMed Central

Directional adposition use in English, Swedish and Finnish

Author: van der Zee Emile
Walker Crystal
Publication venue: International Cognitive Linguistics Association
Publication date: 21/06/2010
Field of study

Directional adpositions such as to the left of describe where a Figure is in relation to a Ground. English and Swedish directional adpositions refer to the location of a Figure in relation to a Ground, whether both are static or in motion. In contrast, the Finnish directional adpositions edellä (in front of) and jäljessä (behind) solely describe the location of a moving Figure in relation to a moving Ground (Nikanne, 2003). When using directional adpositions, a frame of reference must be assumed for interpreting the meaning of directional adpositions. For example, the meaning of to the left of in English can be based on a relative (speaker or listener based) reference frame or an intrinsic (object based) reference frame (Levinson, 1996). When a Figure and a Ground are both in motion, it is possible for a Figure to be described as being behind or in front of the Ground, even if neither have intrinsic features. As shown by Walker (in preparation), there are good reasons to assume that in the latter case a motion based reference frame is involved. This means that if Finnish speakers would use edellä (in front of) and jäljessä (behind) more frequently in situations where both the Figure and Ground are in motion, a difference in reference frame use between Finnish on one hand and English and Swedish on the other could be expected. We asked native English, Swedish and Finnish speakers’ to select adpositions from a language specific list to describe the location of a Figure relative to a Ground when both were shown to be moving on a computer screen. We were interested in any differences between Finnish, English and Swedish speakers. All languages showed a predominant use of directional spatial adpositions referring to the lexical concepts TO THE LEFT OF, TO THE RIGHT OF, ABOVE and BELOW. There were no differences between the languages in directional adpositions use or reference frame use, including reference frame use based on motion. We conclude that despite differences in the grammars of the languages involved, and potential differences in reference frame system use, the three languages investigated encode Figure location in relation to Ground location in a similar way when both are in motion. Levinson, S. C. (1996). Frames of reference and Molyneux’s question: Crosslingiuistic evidence. In P. Bloom, M.A. Peterson, L. Nadel & M.F. Garrett (Eds.) Language and Space (pp.109-170). Massachusetts: MIT Press. Nikanne, U. (2003). How Finnish postpositions see the axis system. In E. van der Zee & J. Slack (Eds.), Representing direction in language and space. Oxford, UK: Oxford University Press. Walker, C. (in preparation). Motion encoding in language, the use of spatial locatives in a motion context. Unpublished doctoral dissertation, University of Lincoln, Lincoln. United Kingdo

University of Lincoln Institutional Repository

Recommended from our members

Unsupervised Morphological Segmentation and Part-of-Speech Tagging for Low-Resource Scenarios

Author: Eskander Ramy
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2021
Field of study

With the high cost of manually labeling data and the increasing interest in low-resource languages, for which human annotators might not be even available, unsupervised approaches have become essential for processing a typologically diverse set of languages, whether high-resource or low-resource. In this work, we propose new fully unsupervised approaches for two tasks in morphology: unsupervised morphological segmentation and unsupervised cross-lingual part-of-speech (POS) tagging, which have been two essential subtasks for several downstream NLP applications, such as machine translation, speech recognition, information extraction and question answering. We propose a new unsupervised morphological-segmentation approach that utilizes Adaptor Grammars (AGs), nonparametric Bayesian models that generalize probabilistic context-free grammars (PCFGs), where a PCFG models word structure in the task of morphological segmentation. We implement the approach as a publicly available morphological-segmentation framework, MorphAGram, that enables unsupervised morphological segmentation through the use of several proposed language-independent grammars. In addition, the framework allows for the use of scholar knowledge, when available, in the form of affixes that can be seeded into the grammars. The framework handles the cases when the scholar-seeded knowledge is either generated from language resources, possibly by someone who does not know the language, as weak linguistic priors, or generated by an expert in the underlying language as strong linguistic priors. Another form of linguistic priors is the design of a grammar that models language-dependent specifications. We also propose a fully unsupervised learning setting that approximates the effect of scholar-seeded knowledge through self-training. Moreover, since there is no single grammar that works best across all languages, we propose an approach that picks a nearly optimal configuration (a learning setting and a grammar) for an unseen language, a language that is not part of the development. Finally, we examine multilingual learning for unsupervised morphological segmentation in low-resource setups. For unsupervised POS tagging, two cross-lingual approaches have been widely adapted: 1) annotation projection, where POS annotations are projected across an aligned parallel text from a source language for which a POS tagger is accessible to the target one prior to training a POS model; and 2) zero-shot model transfer, where a model of a source language is directly applied on texts in the target language. We propose an end-to-end architecture for unsupervised cross-lingual POS tagging via annotation projection in truly low-resource scenarios that do not assume access to parallel corpora that are large in size or represent a specific domain. We integrate and expand the best practices in alignment and projection and design a rich neural architecture that exploits non-contextualized and transformer-based contextualized word embeddings, affix embeddings and word-cluster embeddings. Additionally, since parallel data might be available between the target language and multiple source ones, as in the case of the Bible, we propose different approaches for learning from multiple sources. Finally, we combine our work on unsupervised morphological segmentation and unsupervised cross-lingual POS tagging by conducting unsupervised stem-based cross-lingual POS tagging via annotation projection, which relies on the stem as the core unit of abstraction for alignment and projection, which is beneficial to low-resource morphologically complex languages. We also examine morpheme-based alignment and projection, the use of linguistic priors towards better POS models and the use of segmentation information as learning features in the neural architecture. We conduct comprehensive evaluation and analysis to assess the performance of our approaches of unsupervised morphological segmentation and unsupervised POS tagging and show that they achieve the state-of-the-art performance for the two morphology tasks when evaluated on a large set of languages of different typologies: analytic, fusional, agglutinative and synthetic/polysynthetic

Columbia University Academic Commons

Assessing 6th and 8th Grades Students’ Reading Skills and Literacy in Kazakh, Russian, and English Languages in Kazakhstan

Author: Akhmetova Aigul
Publication venue
Publication date: 16/05/2022
Field of study

This research study aimed to assess and explore the issue of poor reading literacy skills among young learners in the middle school in Kazakhstan. In particular, our broader goal is to develop a modified framework of recommendation and suggestions for teaching and learning reading skills in Kazakhstan in the native and second languages (i.e. Kazakh or Russian), and in English as a foreign language. Consequently, we firstly focused on assessing reading skills and literacy development in Kazakh and Russian languages as a native and/or second language (L1)/ (L2), and English as a foreign language (EFL). This could help us to define the core issue while teaching reading skills to young Kazakhstani learners. Secondly, we revealed several factors in questionnaires regarding students’ socio-economic status, reading attitude, classroom climate, engagement, and reading metacognitive awareness while reading process. The evidences may assure explanation of poor results in reading literacy among Kazakhstani 15-year-old students, which were below the average while performing international surveys like PISA, and PIRLS. Thirdly, we assessed 6th and 8th grade students’ reading skills in English, Kazakh, and Russian. In total (N = 4,274) participants took part in the computer based assessment. Finally, the obtained results may track us to provide suggestions and recommendations for reading literacy, and further modification in the assessment process in the middle secondary education of Kazakhstan. Even young adolescents had positive reading attitude and insignificant gender differences (49.9% - boys, 51.2 % - girls), the analysis showed that middle school learners in Kazakhstan had poor reading skills in the target languages. In addition, we found that latent factors (i.e., classroom climate, engagement, reading attitude, and reading strategies) did not affect reading comprehension tests in English, Kazakh, and Russian languages. The weak relationship between classroom climate and engagement towards reading achievement might indicate insufficient learning environment, low teacher-student interaction, and scarce support from peers or parents towards reading skills in the target languages. Moreover, the analysis highlighted several drawbacks among young learners while teaching and learning reading skills and developing literacy in the languages. Likewise, policy makers, teachers, parents, and other stakeholders should put serious attention to the content of core curricular programme for teaching reading skills in L1, L2, and EFL. The findings also suggested that reading for pleasure in and out of school might consider being a challenging activity for children in the middle school. Interestingly though, bilingual learners seemed to use more reading strategies in performing reading comprehension tests than monolinguals in the respected languages. Therefore, to boost the importance of reading literacy for young learners in the middle school, appropriate programme are required to preparing young learners think critically, and improving the quality. Furthermore, the qualitative assessment of teachers and the school staff are necessary to explore the quality of teaching and assessing reading literacy skills in the respective languages

SZTE Doktori Értekezések Repozitórium (SZTE Repository of Dissertations)

An analysis of Kazakhstan and its energy sector using SAM and CGE modeling

Author: Naumov Alexander
Publication venue: Management and Languages
Publication date: 01/07/2009
Field of study

The primary focus of this thesis is the contribution of the oil and gas industry to Kazakhstan’s recent economic development. This industry is analyzed in a broader context with the help of the economy-wide modeling tools such as Computable General Equilibrium (CGE) model, Social Accounting Matrices (SAM) and Input-Output models. Such approach allows taking into account all possible linkages the oil and gas industry has with the rest of the economy. The first chapter presents a literature review of CGE studies with an emphasis on applications to energy and transition economies. The thesis proceeds with a description of building a CGE model for Kazakhstan and construction of the SAM. Subsequently, using the above mentioned tools Chapter Four analyses a spillover impact of the oil and gas sector on the rest of the economy. The study establishes that the sector accounted directly and indirectly for about forty percent of economic growth between 2001 and 2005. The final chapter develops an analytical framework to correct representation of the oil and gas sector in the national accounts distorted by the transfer pricing. When adjusted for transfer pricing, the GDP share of the oil and gas sector in 2001 increases to 16.1 percent compared to the officially reported 8.6 percent

ROS: The Research Output Service. Heriot-Watt University Edinburgh

A conditional theory of the ‘political resource curse:’ oil, autocrats, and strategic contexts

Author: Ahmadov Anar
Publication venue
Publication date: 01/09/2011
Field of study

A burgeoning literature argues that the abundance of oil in developing countries strengthens autocratic rule and erodes democracy. However, extant studies either show the average cross-national correlation between oil and political regime or develop particularistic accounts that do not easily lend themselves to theorizing. Consequently, we know little of the causal mechanisms that potentially link oil wealth to undemocratic outcomes and the conditions that would help explain the ultimate, not average, effect of oil on political regime. This study develops a conditional theory of the “political resource curse.” It does so by undertaking a statistical reassessment of the relationship between oil wealth and political regime and a nuanced qualitative examination of a set of carefully selected cases in order to contribute to developing an adequate account of causal mechanisms that transmit and conditions that shape the relationship between oil abundance and autocracy. It draws on qualitative and quantitative evidence collected over eighteen months of fieldwork in oil-rich former Soviet countries of Azerbaijan, Kazakhstan, and Turkmenistan, and the ‘counterfactual’ oil-poor Kyrgyzstan. Employing a theoretical framework that draws on insights from the rentier state theory, historical institutionalism, and rational choice institutionalism, I trace, compare, and contrast the processes that potentially link oil wealth to regime outcomes in these countries between 1989 and 2010. The findings strongly suggest that political regime differences can be better explained by the interaction of oil wealth with several structural and institutional variables rather than by oil abundance or another single factor alone. A thorough qualitative analysis of the post-Soviet cases shows that the causal mechanisms hypothesized in the ‘resource curse’ literature were neither necessarily present, nor uniform across these cases and throughout the post-Soviet period. This was because a particular interaction of exogenous variables and oil wealth affected the causal mechanisms differently, ultimately entailing different regime outcomes. The spread of alternative political elites, relative size of the ethnic minority with ties to a powerful kin state, and oil production geography were key exogenous factors that consistently interacted with oil in affecting the political regimes

LSE Theses Online

Kombinasi Metode Rule-Based dan N-Gram Stemming untuk Mengenali Stemmer Bahasa Bali

Author: Fatichah Chastine
Subali Made Agus Putra
Publication venue: 'Fakultas Ilmu Komputer Universitas Brawijaya'
Publication date: 25/02/2019
Field of study

Proses untuk mengekstraksi kata dasar dari kata berafiks dikenal dengan istilah stemming yang bertujuan meningkatkan recall dengan mereduksi variasi kata berafiks ke dalam bentuk kata dasarnya. Penelitian terdahulu tentang stemming bahasa Bali pernah dilakukan menggunakan metode rule-based, tapi afiks yang diluluhkan hanya prefiks dan sufiks, sedangkan variasi afiks lain tidak diluluhkan, seperti infiks, konfiks, simulfiks, dan kombinasi afiks. Penelitian tentang stemming menggunakan pendekatan rule-based telah diterapkan di berbagai bahasa yang berbeda. Metode rule-based memiliki kelebihan jika diterapkan pada domain yang sederhana, maka rule-based mudah untuk diverifikasi dan divalidasi, tapi memiliki kelemahan saat diterapkan pada domain dengan level kompleksitas yang tinggi, apabila sistem tidak dapat mengenali rules, maka tidak ada hasil yang diperoleh. Untuk mengatasi kelemahan stemming menggunakan rule-based, kami menggunakan metode n-gram stemming, dimana kata berafiks dan kata dasar diubah ke bentuk n-gram, kemudian tingkat kemiripan antara n-gram kata berafiks dan n-gram kata dasar diukur menggunakan metode dice coefficient, apabila tingkat kemiripannya memenuhi nilai ambang batas yang ditentukan, maka kata dasar yang dibandingkan dengan kata berafiks ditampilkan. Pada penelitian ini, kami mengembangkan metode stemmer yang meluluhkan seluruh variasi afiks pada bahasa Bali dengan mengombinasikan pendekatan rule-based dan metode n-gram stemming. Berdasarkan pengujian yang telah dilakukan untuk kesepuluh query metode yang diusulkan memperoleh rerata akurasi stemming lebih baik 96,67% dari metode terdahulu 75%, sedangkan untuk kelima query metode n-gram stemming dapat mengenali beberapa kata berafiks diluar rules. Penelitian berikutnya, kami akan memperhatikan semantik setiap kata dan tahap validasi menggunakan aplikasi text mining.AbstractA process for extracting a stem word from the inflected word is known as stemming which aims to increase recall by reducing the variation of the inflected word into its stem word form. Previous research on stemming the Balinese language has been done using the rule-based method, but the affixes that are removed are only prefixes and suffixes, while other variations of affixes are not removed, such as infixes, confixes, simulfiks, and combinations of affixes. Research on stemming using the rule-based approach has been applied in a variety of different languages. The rule-based method has advantages when applied to a simple field, rule-based is easy to verify and validate, but has weaknesses when applied to domains with a high level of complexity, if the system cannot recognize rules, no results are obtained. To overcome the stemming weaknesses using rule-based, we use the n-gram stemming method, where the inflected word and stem word are converted to the n-gram form, then the level of similarity between the n-gram of the inflected word and the stem word is measured using the dice coefficient method, when the level of similarity meets the defined threshold value, then the stem word is displayed. In this study, we developed a stemmer method that removes all variations of affixes in the Balinese language by combining the rule-based approach and the n-gram stemming method. Based on the experiments for the ten queries the proposed method get 96,67% stemming accuracy than the previous method 75%, while for the five queries for the n-gram stemming method can recognize some inflected words outside the rules. The next study, we will pay attention to the semantics of each word and the validation stage using text mining application

Jurnal Teknologi Informasi dan Ilmu Komputer