Search CORE

508 research outputs found

Deep Learning for Opinion Mining and Topic Classification of Course Reviews

Author: Koufakou Anna
Publication venue
Publication date: 16/06/2023
Field of study

Student opinions for a course are important to educators and administrators, regardless of the type of the course or the institution. Reading and manually analyzing open-ended feedback becomes infeasible for massive volumes of comments at institution level or online forums. In this paper, we collected and pre-processed a large number of course reviews publicly available online. We applied machine learning techniques with the goal to gain insight into student sentiments and topics. Specifically, we utilized current Natural Language Processing (NLP) techniques, such as word embeddings and deep neural networks, and state-of-the-art BERT (Bidirectional Encoder Representations from Transformers), RoBERTa (Robustly optimized BERT approach) and XLNet (Generalized Auto-regression Pre-training). We performed extensive experimentation to compare these techniques versus traditional approaches. This comparative study demonstrates how to apply modern machine learning approaches for sentiment polarity extraction and topic-based classification utilizing course feedback. For sentiment polarity, the top model was RoBERTa with 95.5% accuracy and 84.7% F1-macro, while for topic classification, an SVM (Support Vector Machine) was the top classifier with 79.8% accuracy and 80.6% F1-macro. We also provided an in-depth exploration of the effect of certain hyperparameters on the model performance and discussed our observations. These findings can be used by institutions and course providers as a guide for analyzing their own course feedback using NLP models towards self-evaluation and improvement.Comment: Accepted and Published in Education and Information Technologies (Accepted March 2023

arXiv.org e-Print Archive

ViCGCN: Graph Convolutional Network with Contextualized Language Models for Social Media Mining in Vietnamese

Author: Dang Chi-Thanh
Do Trong-Hop
Nguyen Quoc-Nam
Phan Chau-Thang
Van Nguyen Kiet
Publication venue
Publication date: 06/09/2023
Field of study

Social media processing is a fundamental task in natural language processing with numerous applications. As Vietnamese social media and information science have grown rapidly, the necessity of information-based mining on Vietnamese social media has become crucial. However, state-of-the-art research faces several significant drawbacks, including imbalanced data and noisy data on social media platforms. Imbalanced and noisy are two essential issues that need to be addressed in Vietnamese social media texts. Graph Convolutional Networks can address the problems of imbalanced and noisy data in text classification on social media by taking advantage of the graph structure of the data. This study presents a novel approach based on contextualized language model (PhoBERT) and graph-based method (Graph Convolutional Networks). In particular, the proposed approach, ViCGCN, jointly trained the power of Contextualized embeddings with the ability of Graph Convolutional Networks, GCN, to capture more syntactic and semantic dependencies to address those drawbacks. Extensive experiments on various Vietnamese benchmark datasets were conducted to verify our approach. The observation shows that applying GCN to BERTology models as the final layer significantly improves performance. Moreover, the experiments demonstrate that ViCGCN outperforms 13 powerful baseline models, including BERTology models, fusion BERTology and GCN models, other baselines, and SOTA on three benchmark social media datasets. Our proposed ViCGCN approach demonstrates a significant improvement of up to 6.21%, 4.61%, and 2.63% over the best Contextualized Language Models, including multilingual and monolingual, on three benchmark datasets, UIT-VSMEC, UIT-ViCTSD, and UIT-VSFC, respectively. Additionally, our integrated model ViCGCN achieves the best performance compared to other BERTology integrated with GCN models

arXiv.org e-Print Archive

English/Russian lexical cognates detection using NLP Machine Learning with Python

Author: Badr Y. E. K. A.
Бадр Я. Э. К. А.
Publication venue
Publication date: 01/01/2023
Field of study

Изучение языка – это замечательное занятие, которое расширяет наш кругозор и позволяет нам общаться с представителями различных культур и людей по всему миру. Традиционно языковое образование основывалось на традиционных методах, таких как учебники, словарный запас и языковой обмен. Однако с появлением машинного обучения наступила новая эра в обучении языку, предлагающая инновационные и эффективные способы ускорения овладения языком. Одним из интригующих применений машинного обучения в изучении языков является использование родственных слов, слов, которые имеют схожее значение и написание в разных языках. Для решения этой темы в данной исследовательской работе предлагается облегчить процесс изучения второго языка с помощью искусственного интеллекта, в частности нейронных сетей, которые могут идентифицировать и использовать слова, похожие или идентичные как на первом языке учащегося, так и на целевом языке. Эти слова, известные как лексические родственные слова, могут облегчить изучение языка, предоставляя учащимся знакомый ориентир и позволяя им связывать новый словарный запас со словами, которые они уже знают. Используя возможности нейронных сетей для обнаружения и использования этих родственных слов, учащиеся смогут ускорить свой прогресс в освоении второго языка. Хотя исследование семантического сходства в разных языках не является новой темой, наша цель состоит в том, чтобы применить другой подход для выявления русско-английских лексических родственных слов и представить полученные результаты в качестве инструмента изучения языка, используя выборку данных о лексическом и семантическом сходстве. между языками, чтобы построить модель обнаружения лексических родственных слов и ассоциаций слов. Впоследствии, в зависимости от нашего анализа и результатов, мы представим приложение для определения словесных ассоциаций, которое смогут использовать конечные пользователи. Учитывая, что русский и английский являются одними из наиболее распространенных языков в мире, а Россия является популярным местом для иностранных студентов со всего мира, это послужило значительной мотивацией для разработки инструмента искусственного интеллекта, который поможет людям, изучающим русский язык как англоговорящие, или изучающим английский язык. как русскоязычные.Language learning is a remarkable endeavor that expands our horizons and allows us to connect with diverse cultures and people around the world. Traditionally, language education has relied on conventional methods such as textbooks, vocabulary drills, and language exchanges. However, with the advent of machine learning, a new era has dawned upon language instruction, offering innovative and efficient ways to accelerate language acquisition. One intriguing application of machine learning in language learning is the utilization of cognates, words that share similar meanings and spellings across different languages. To address this subject, this research paper proposes to facilitate the process of acquiring a second language with the help of artificial intelligence, particularly neural networks, which can identify and use words that are similar or identical in both the learner's first language and the target language. These words, known as lexical cognates which can facilitate language learning by providing a familiar point of reference for the learner and enabling them to associate new vocabulary with words they already know. By leveraging the power of neural networks to detect and utilize these cognates, learners will be able to accelerate their progress in acquiring a second language. Although the study of semantic similarity across different languages is not a new topic, our objective is to adopt a different approach for identifying Russian-English Lexical cognates and present the obtained results as a language learning tool, by using the lexical and semantic similarity data sample across languages to build a lexical cognates detection and words association model. Subsequently, depend on our analysis and results, will present a word association application that can be utilized by end users. Given that Russian and English are among the most widely spoken languages globally and that Russia is a popular destination for international students from around the world, it served as a significant motivation to develop an AI tool to assist individuals learning Russian as English speakers or learning English as Russian speakers

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Recommended from our members

Cross-generational linguistic variation in the Canberra Vietnamese heritage language community: A corpus-centred investigation

Author: Nguyen Li
Publication venue: University of Cambridge
Publication date: 06/10/2020
Field of study

This dissertation investigates cross-generational linguistic differences in the Canberra Vietnamese bilingual community, with a particular focus on Vietnamese as the heritage language. Specifically, it documents the vernacular and considers key aspects of this data from different theoretical perspectives. Its main contribution is an insight into a rarely studied heritage language variety in a contact community that has never been examined. The dissertation consists of five core chapters, organised into two parts. In the first part (Chapters 2–3), I describe how I documented the vernacular and created the Canberra Vietnamese English Corpus (CanVEC), an original corpus compiled specifically for this study that is also the first to be freely available for research purposes. The corpus consists of over ten hours of spontaneous speech produced by 45 Vietnamese-English bilingual speakers across two generations living in Canberra. In the second part of the study (Chapters 4–6), I put the corpus to use and investigate aspects of the cross-generational differences in Vietnamese as the heritage language in this community. In particular, I first probe the Vietnamese heritage language via its participation in the code-switching discourse (Chapter 4). In doing so, I focus on the applicability of the Matrix Language Framework (MLF) (Myers-Scotton, 1993, 2002) and its associated Matrix Language (ML) Turnover Hypothesis (Myers-Scotton, 1998) to the code-switching data in CanVEC. Since support for this prominent model has mainly come from language pairs that have different clausal word order or vastly different inventories of inflectional morphology, Vietnamese-English as a pair in which both languages are SVO and essentially isolating offers a tantalising testing ground for its application. Results show that the universal claims of this model do not hold so straight-forwardly. CanVEC data challenges several assumptions of the MLF, with the model ultimately only being able to account for around half of the CanVEC code-switching data. I further demonstrate that even when the ML is putatively identifiable and a cross-generational ML ‘turnover’ is quantitatively observed, the predictions do not reflect the direction of structural influence that we see in CanVEC. The MLF approach therefore sheds only limited light on cross-generational language shift and variation in this community. Given that null elements emerge as a distinct area of difficulty in Chapter 4, I take this aspect as the focal point for the next part of the investigation (Chapter 5), where I use the variationist approach (Labov, 1972 et seq.) to explore three cases where null and overt realisation alternates in Vietnamese: subjects, objects, and copulas. In doing so, I move away from the bilingual portion of CanVEC to examine the monolingual heritage Vietnamese subset directly. Results show that Vietnamese null subjects vary significantly across generations, while null objects and copulas remain stable in terms of use. As speakers also overwhelmingly prefer overt forms over null forms (∼70:30) across all the three of the variables of interest, I appeal to the generative interface-oriented approach (Sorace & Filiaci, 2006 et seq.) to next examine the distribution of overt subjects, objects, copulas (Chapter 6). These results converge with what was found for null forms: cross-generational effects were observed for pronominal subjects, but not pronominal objects and copulas. This finding also supports the importance of a distinction drawn in previous works between internal (syntax-semantics) and external (syntax-discourse/pragmatics) interface phenomena, with the latter being seemingly more susceptible to change. Ultimately, this dissertation highlights the empirical and theoretical value of studying rarely considered contact varieties, while deploying an integrated approach that acknowledges the multi-faceted complexity of the contact communities where these varieties are spoken.Cambridge Trust International Scholarshi

Apollo (Cambridge)

ACQUIRING SYNTACTIC VARIATION: REGULARIZATION IN WH-QUESTION PRODUCTION

Author: Nguyen An TD
Publication venue: 'The Busan Gyeongnam Mathematical Society'
Publication date: 06/07/2023
Field of study

Children are often exposed to language-internal variation. Studying the acquisition of variation allows us to understand more about children’s ability to acquire probabilistic input, their preferences at choice points, and factors contributing to such preference. Using wh-variation as a case study, this dissertation explores the acquisition of syntactic variation through corpus analyses, behavioral experiments, and computational simulation. In English and some other languages (e.g., French, Brazilian Portuguese, etc.), information-seeking wh-questions allow for at least two variants: a wh-in-situ variant and a fronted-wh variant. How do English-speaking children acquire wh-variation, and what factors condition their course of acquisition? Experimental results show that 3-to-5 year-old children regularize to fronted wh-questions in their production even in contexts that allow for both variants to be used interchangeably. Based on the characteristics of the variants, two factors are identified to potentially contribute to the preference for fronted wh-questions: frequency and discourse restrictions. Two artificial language learning (ALL) experiments are then conducted so that the effect of discourse can be studied separately from frequency. The results show that learners prefer the variant with fewer or no discourse restrictions (i.e., the fronted-wh variant) when frequency is controlled. Thus, regularization in language acquisition is conditioned by both domain-general factors, such as frequency, and language-specific factors, such as discourse markedness. The dissertation also looks into the motivation for regularization. One prominent hypothesis is that regularization serves as a means to reduce the cognitive burden associated with learning multiple variants at once. Instead of mastering all the variants, learners can simplify the learning process and minimize their chance of violating a constraint by producing the dominant variant. This work provides additional evidence for the hypothesis in three ways. First, we replicate the findings that tasks that are more cognitively taxing induce more regularization. Second, we present new evidence that participants with a lower composite working memory score tend to have a higher regularization rate. Third, we provide a computational simulation showing that regularization behavior only happens when an intake limit (reflecting limited working memory capacity) and a parsimony bias to reduce the cognitive burden are incorporated in the model

JScholarship