24 research outputs found

    Native language identification of fluent and advanced non-native writers

    Get PDF
    This is an accepted manuscript of an article published by ACM in ACM Transactions on Asian and Low-Resource Language Information Processing in April 2020, available online: https://doi.org/10.1145/3383202 The accepted version of the publication may differ from the final published version.Native Language Identification (NLI) aims at identifying the native languages of authors by analyzing their text samples written in a non-native language. Most existing studies investigate this task for educational applications such as second language acquisition and require the learner corpora. This article performs NLI in a challenging context of the user-generated-content (UGC) where authors are fluent and advanced non-native speakers of a second language. Existing NLI studies with UGC (i) rely on the content-specific/social-network features and may not be generalizable to other domains and datasets, (ii) are unable to capture the variations of the language-usage-patterns within a text sample, and (iii) are not associated with any outlier handling mechanism. Moreover, since there is a sizable number of people who have acquired non-English second languages due to the economic and immigration policies, there is a need to gauge the applicability of NLI with UGC to other languages. Unlike existing solutions, we define a topic-independent feature space, which makes our solution generalizable to other domains and datasets. Based on our feature space, we present a solution that mitigates the effect of outliers in the data and helps capture the variations of the language-usage-patterns within a text sample. Specifically, we represent each text sample as a point set and identify the top-k stylistically similar text samples (SSTs) from the corpus. We then apply the probabilistic k nearest neighbors’ classifier on the identified top-k SSTs to predict the native languages of the authors. To conduct experiments, we create three new corpora where each corpus is written in a different language, namely, English, French, and German. Our experimental studies show that our solution outperforms competitive methods and reports more than 80% accuracy across languages.Research funded by Higher Education Commission, and Grants for Development of New Faculty Staff at Chulalongkorn University | Digital Economy Promotion Agency (# MP-62-0003) | Thailand Research Funds (MRG6180266 and MRG6280175).Published versio

    The impact of phrases on Italian lexical simplification

    Get PDF
    Automated lexical simplification has been performed so far focusing only on the replacement of single tokens with single tokens, and this choice has affected both the development of systems and the creation of benchmarks. In this paper, we argue that lexical simplification in real settings should deal both with single and multi-token terms, and present a benchmark created for the task. Besides, we describe how a freely available system can be tuned to cover also the simplification of phrases, and perform an evaluation comparing different experimental settings

    Insights to Problems, Research Trend and Progress in Techniques of Sentiment Analysis

    Get PDF
    The research-based implementations towards Sentiment analyses are about a decade old and have introduced many significant algorithms, techniques, and framework towards enhancing its performance. The applicability of sentiment analysis towards business and the political survey is quite immense. However, we strongly feel that existing progress in research towards Sentiment Analysis is not at par with the demand of massively increasing dynamic data over the pervasive environment. The degree of problems associated with opinion mining over such forms of data has been less addressed, and still, it leaves the certain major scope of research. This paper will brief about existing research trends, some important research implementation in recent times, and exploring some major open issues about sentiment analysis. We believe that this manuscript will give a progress report with the snapshot of effectiveness borne by the research techniques towards sentiment analysis to further assist the upcoming researcher to identify and pave their research work in a perfect direction towards considering research gap

    Towards Personalised Simplification based on L2 Learners' Native Language

    Get PDF
    We present an approach to improve the selection of complex words for automatic text simplification, addressing the need of L2 learners to take into account their native language during simplification. In particular, we develop a methodology that automatically identifies ‘difficult’ terms (i.e. false friends) for L2 learners in order to simplify them. We evaluate not only the quality of the detected false friends but also the impact of this methodology on text simplification compared with a standard frequency-based approach

    Multilingual Knowledge Base Completion by Cross-lingual Semantic Relation Inference

    Get PDF
    International audienceIn the present paper, we propose a simple en-dogenous method for enhancing a multilingual knowledge base through the cross-lingual semantic relation inference. It can be run on multilingual resources prior to semantic representation learning. Multilingual knowledge bases may integrate preexisting structured resources available for resource-rich languages. We aim at performing cross-lingual inference on them to improve the low resource language by creating semantic relationships

    Graph clustering for natural language processing

    Full text link
    Graph-based representations are proven to be an effective approach for a variety of Natural Language Processing (NLP) tasks. Graph clustering makes it possible to extract useful knowledge by exploiting the implicit structure of the data. In this tutorial, we will present several efficient graph clustering algorithms, show their strengths and weaknesses as well as their implementations and applications. Then, the evaluation methodology in unsupervised NLP tasks will be discussed

    The Sentiment Problem: A Critical Survey towards Deconstructing Sentiment Analysis

    Full text link
    We conduct an inquiry into the sociotechnical aspects of sentiment analysis (SA) by critically examining 189 peer-reviewed papers on their applications, models, and datasets. Our investigation stems from the recognition that SA has become an integral component of diverse sociotechnical systems, exerting influence on both social and technical users. By delving into sociological and technological literature on sentiment, we unveil distinct conceptualizations of this term in domains such as finance, government, and medicine. Our study exposes a lack of explicit definitions and frameworks for characterizing sentiment, resulting in potential challenges and biases. To tackle this issue, we propose an ethics sheet encompassing critical inquiries to guide practitioners in ensuring equitable utilization of SA. Our findings underscore the significance of adopting an interdisciplinary approach to defining sentiment in SA and offer a pragmatic solution for its implementation.Comment: This paper has been accepted and will appear at the EMNLP 2023 Main Conferenc

    Rule-based Syntactic Simplifier for Texts in Estonian

    Get PDF
    Selles bakalaureusetöös tutvustatakse praktilise osana tehtud keeleõppijatele mõeldud veebirakendust, mis lihtsustab eestikeelse teksti lausestruktuuri. Programmi eesmärk on teha liitlausest etteantud reeglite põhjal eesti keele grammatikale vastavad lihtlaused. Rakenduse loomisel toetuti eesti keele analüüsimiseks mõeldud tehnilistele vahenditele, võtmetähtsusega on osalausestaja, kuid kasutati veel süntaksianalüsaatorit, morfoloogilist analüsaatorit ja morfoloogilist süntesaatorit.This Bachelor's thesis introduces an implemented web-based application that simplifies text in Estonian syntactically. The app is meant for language learners and its main propose is to make grammatically correct simple sentences from composite sentences by the given rules. This application relies on tools that are meant for text analysis in Estonian, mainly a clause segmenter, but syntactic and morphological analysers and a morphological synthesizer are used as well

    Rule-based Syntactic Simplifier for Texts in Estonian

    Get PDF
    Selles bakalaureusetöös tutvustatakse praktilise osana tehtud keeleõppijatele mõeldud veebirakendust, mis lihtsustab eestikeelse teksti lausestruktuuri. Programmi eesmärk on teha liitlausest etteantud reeglite põhjal eesti keele grammatikale vastavad lihtlaused. Rakenduse loomisel toetuti eesti keele analüüsimiseks mõeldud tehnilistele vahenditele, võtmetähtsusega on osalausestaja, kuid kasutati veel süntaksianalüsaatorit, morfoloogilist analüsaatorit ja morfoloogilist süntesaatorit.This Bachelor's thesis introduces an implemented web-based application that simplifies text in Estonian syntactically. The app is meant for language learners and its main propose is to make grammatically correct simple sentences from composite sentences by the given rules. This application relies on tools that are meant for text analysis in Estonian, mainly a clause segmenter, but syntactic and morphological analysers and a morphological synthesizer are used as well

    A Comprehensive Survey on Word Representation Models: From Classical to State-Of-The-Art Word Representation Language Models

    Full text link
    Word representation has always been an important research area in the history of natural language processing (NLP). Understanding such complex text data is imperative, given that it is rich in information and can be used widely across various applications. In this survey, we explore different word representation models and its power of expression, from the classical to modern-day state-of-the-art word representation language models (LMS). We describe a variety of text representation methods, and model designs have blossomed in the context of NLP, including SOTA LMs. These models can transform large volumes of text into effective vector representations capturing the same semantic information. Further, such representations can be utilized by various machine learning (ML) algorithms for a variety of NLP related tasks. In the end, this survey briefly discusses the commonly used ML and DL based classifiers, evaluation metrics and the applications of these word embeddings in different NLP tasks
    corecore