10 research outputs found

    Scaling Native Language Identification with Transformer Adapters

    Full text link
    Native language identification (NLI) is the task of automatically identifying the native language (L1) of an individual based on their language production in a learned language. It is useful for a variety of purposes including marketing, security and educational applications. NLI is usually framed as a multi-label classification task, where numerous designed features are combined to achieve state-of-the-art results. Recently deep generative approach based on transformer decoders (GPT-2) outperformed its counterparts and achieved the best results on the NLI benchmark datasets. We investigate this approach to determine the practical implications compared to traditional state-of-the-art NLI systems. We introduce transformer adapters to address memory limitations and improve training/inference speed to scale NLI applications for production

    How Good are Humans at Native Language Identification? A Case Study on Italian L2 writings

    Get PDF
    In this paper we present a pilot study on human performance for the Native Language Identification task. We performed two tests aimed at exploring the human baseline for the task in which test takers had to identify the writers’ L1 relying only on scripts written in Italian by English, French, German and Spanish native speakers. Then, we conducted an error analysis considering the language background of both test takers and text writers

    On the Development of Large Scale Corpus for Native Language Identification

    Get PDF

    Improving Data Quality in Customer Relationship Management Systems: a method for cleaning personal information

    Get PDF
    openIn recent years, data literacy has become critical in areas such as marketing, sales, and more generally in businesses. These companies rely on software such as Customer Relationship Management (CRM) systems to derive useful information from the vast amount of data collected. However, lack of data quality undermines the effectiveness of this approach, as it directly impacts overall business performance. This thesis investigates the various issues and challenges related to data quality in CRM systems, focusing particularly on datasets with attributes such as first name, last name, and e-mail address. In addition, an algorithm for cleaning such datasets is proposed.In recent years, data literacy has become critical in areas such as marketing, sales, and more generally in businesses. These companies rely on software such as Customer Relationship Management (CRM) systems to derive useful information from the vast amount of data collected. However, lack of data quality undermines the effectiveness of this approach, as it directly impacts overall business performance. This thesis investigates the various issues and challenges related to data quality in CRM systems, focusing particularly on datasets with attributes such as first name, last name, and e-mail address. In addition, an algorithm for cleaning such datasets is proposed

    One Model to Rule them all: Multitask and Multilingual Modelling for Lexical Analysis

    Get PDF
    When learning a new skill, you take advantage of your preexisting skills and knowledge. For instance, if you are a skilled violinist, you will likely have an easier time learning to play cello. Similarly, when learning a new language you take advantage of the languages you already speak. For instance, if your native language is Norwegian and you decide to learn Dutch, the lexical overlap between these two languages will likely benefit your rate of language acquisition. This thesis deals with the intersection of learning multiple tasks and learning multiple languages in the context of Natural Language Processing (NLP), which can be defined as the study of computational processing of human language. Although these two types of learning may seem different on the surface, we will see that they share many similarities. The traditional approach in NLP is to consider a single task for a single language at a time. However, recent advances allow for broadening this approach, by considering data for multiple tasks and languages simultaneously. This is an important approach to explore further as the key to improving the reliability of NLP, especially for low-resource languages, is to take advantage of all relevant data whenever possible. In doing so, the hope is that in the long term, low-resource languages can benefit from the advances made in NLP which are currently to a large extent reserved for high-resource languages. This, in turn, may then have positive consequences for, e.g., language preservation, as speakers of minority languages will have a lower degree of pressure to using high-resource languages. In the short term, answering the specific research questions posed should be of use to NLP researchers working towards the same goal.Comment: PhD thesis, University of Groninge

    Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020). This edition of the conference is held in Bologna and organised by the University of Bologna. The CLiC-it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after six years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges
    corecore