9 research outputs found

    Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign

    Get PDF
    We present the results and the findings of the Second VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects. The campaign was organized as part of the fifth edition of the VarDial workshop, collocated with COLING’2018. This year, the campaign included five shared tasks, including two task re-runs – Arabic Dialect Identification (ADI) and German Dialect Identification (GDI) –, and three new tasks – Morphosyntactic Tagging of Tweets (MTT), Discriminating between Dutch and Flemish in Subtitles (DFS), and Indo-Aryan Language Identification (ILI). A total of 24 teams submitted runs across the five shared tasks, and contributed 22 system description papers, which were included in the VarDial workshop proceedings and are referred to in this report.Non peer reviewe

    Experiments in Language Variety Geolocation and Dialect Identification

    Get PDF
    Peer reviewe

    Language identification in texts

    Get PDF
    This work investigates the task of identifying the language of digitally encoded text. Automatic methods for language identification have been developed since the 1960s. During the years, the significance of language identification as an important preprocessing element has grown at the same time as other natural language processing systems have become mainstream in day-to-day applications. The methods used for language identification are mostly shared with other text classification tasks as almost any modern machine learning method can be trained to distinguish between different languages. We begin the work by taking a detailed look at the research so far conducted in the field. As part of this work, we provide the largest survey on language identification available so far. Comparing the performance of different language identification methods presented in the literature has been difficult in the past. Before the introduction of a series of language identification shared tasks at the VarDial workshops, there were no widely accepted standard datasets which could be used to compare different methods. The shared tasks mostly concentrated on the issue of distinguishing between similar languages, but other open issues relating to language identification were addressed as well. In this work, we present the methods for language identification we have developed while participating in the shared tasks from 2015 to 2017. Most of the research for this work was accomplished within the Finno-Ugric Languages and the Internet project. In the project, our goal was to find and collect texts written in rare Uralic languages on the Internet. In addition to the open issues addressed at the shared tasks, we dealt with issues concerning domain compatibility and the number of languages. We created an evaluation set-up for addressing short out-of-domain texts in a large number of languages. Using the set-up, we evaluated our own method as well as other promising methods from the literature. The last issue we address in this work is the handling of multilingual documents. We developed a method for language set identification and used a previously published dataset to evaluate its performance.Tässä väitöskirjassa tutkitaan digitaalisessa muodossa olevan tekstin kielen automaattista tunnistamista. Tekstin kielen tunnistamisen automaattisia menetelmiä on kehitetty jo 1960-luvulta lähtien. Kuluneiden vuosikymmenien aikana kielentunnistamisen merkitys osana laajempia tietojärjestelmiä on vähitellen kasvanut. Tekstin kieli on tarpeellista tunnistaa, jotta tekstin jatkokäsittelyssä osataan käyttää sopivia kieliteknologisia menetelmiä. Tekstin kielentunnistus on kieleltään tai kieliltään tuntemattoman tekstin kielen tai kielien määrittämistä. Suurimmaksi osaksi kielentunnistukseen käytettyjä menetelmiä käytetään tai voidaan käyttää tekstin luokitteluun myös tekstin muiden ominaisuuksien, kuten aihealueen, perusteella. Tähän artikkeliväitöskirjaan kuuluvassa katsausartikkelissa esittelemme laajasti kielentunnistuksen tähänastista tutkimusta ja käymme kattavasti lävitse kielentunnistukseen tähän mennessä käytetyt menetelmät. Seuraavat kolme väistöskirjan artikkelia esittelevät ne kielentunnistuksen menetelmät joita käytimme VarDial työpajojen yhteydessä järjestetyissä kansainvälisissä kielentunnistuskilpailuissa vuodesta 2015 vuoteen 2017. Suurin osa tämän väitöskirjan tutkimuksesta on tehty osana Koneen säätiön rahoittamaa suomalais-ugrilaiset kielet ja internet -hanketta. Hankkeen päämääränä oli löytää internetistä tekstejä, jotka olivat kirjoitettu harvinaisemmilla uralilaisilla kielillä ja väitöskirjan viides artikkeli keskittyy projektin alkuvaiheiden kuvaamiseen. Väitöskirjan kuudes artikkeli kertoo miten hankkeen verkkoharavaan liitetty kielentunnistin evaluoitiin vaativasssa testiympäristössä, joka sisälsi tekstejä kirjoitettuna 285 eri kielellä. Seitsemäs ja viimeinen artikkeli käsittelee monikielisten tekstien kielivalikoiman selvittämistä

    "Is There Choice in Non-Native Voice?" Linguistic Feature Engineering and a Variationist Perspective in Automatic Native Language Identification

    Get PDF
    Is it possible to infer the native language of an author from a non-native text? Can we perform this task fully automatically? The interest in answers to these questions led to the emergence of a research field called Native Language Identification (NLI) in the first decade of this century. The requirement to automatically identify a particular property based on some language data situates the task in the intersection between computer science and linguistics, or in the context of computational linguistics, which combines both disciplines. This thesis targets several relevant research questions in the context of NLI. In particular, what is the role of surface features and more abstract linguistic cues? How to combine different sets of features, and how to optimize the resulting large models? Do the findings generalize across different data sets? Can we benefit from considering the task in the light of the language variation theory? In order to approach these questions, we conduct a range of quantitative and qualitative explorations, employing different machine learning techniques. We show how linguistic insight can advance technology, and how technology can advance linguistic insight, constituting a fruitful and promising interplay

    On the Principles of Evaluation for Natural Language Generation

    Get PDF
    Natural language processing is concerned with the ability of computers to understand natural language texts, which is, arguably, one of the major bottlenecks in the course of chasing the holy grail of general Artificial Intelligence. Given the unprecedented success of deep learning technology, the natural language processing community has been almost entirely in favor of practical applications with state-of-the-art systems emerging and competing for human-parity performance at an ever-increasing pace. For that reason, fair and adequate evaluation and comparison, responsible for ensuring trustworthy, reproducible and unbiased results, have fascinated the scientific community for long, not only in natural language but also in other fields. A popular example is the ISO-9126 evaluation standard for software products, which outlines a wide range of evaluation concerns, such as cost, reliability, scalability, security, and so forth. The European project EAGLES-1996, being the acclaimed extension to ISO-9126, depicted the fundamental principles specifically for evaluating natural language technologies, which underpins succeeding methodologies in the evaluation of natural language. Natural language processing encompasses an enormous range of applications, each with its own evaluation concerns, criteria and measures. This thesis cannot hope to be comprehensive but particularly addresses the evaluation in natural language generation (NLG), which touches on, arguably, one of the most human-like natural language applications. In this context, research on quantifying day-to-day progress with evaluation metrics lays the foundation of the fast-growing NLG community. However, previous works have failed to address high-quality metrics in multiple scenarios such as evaluating long texts and when human references are not available, and, more prominently, these studies are limited in scope, given the lack of a holistic view sketched for principled NLG evaluation. In this thesis, we aim for a holistic view of NLG evaluation from three complementary perspectives, driven by the evaluation principles in EAGLES-1996: (i) high-quality evaluation metrics, (ii) rigorous comparison of NLG systems for properly tracking the progress, and (iii) understanding evaluation metrics. To this end, we identify the current state of challenges derived from the inherent characteristics of these perspectives, and then present novel metrics, rigorous comparison approaches, and explainability techniques for metrics to address the identified issues. We hope that our work on evaluation metrics, system comparison and explainability for metrics inspires more research towards principled NLG evaluation, and contributes to the fair and adequate evaluation and comparison in natural language processing

    Geographic information extraction from texts

    Get PDF
    A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction
    corecore