25 research outputs found

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    Get PDF
    Welcome to EVALITA 2020! EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for Italian. EVALITA is an initiative of the Italian Association for Computational Linguistics (AILC, http://www.ai-lc.it) and it is endorsed by the Italian Association for Artificial Intelligence (AIxIA, http://www.aixia.it) and the Italian Association for Speech Sciences (AISV, http://www.aisv.it)

    EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020

    Get PDF
    Welcome to EVALITA 2020! EVALITA is the evaluation campaign of Natural Language Processing and Speech Tools for Italian. EVALITA is an initiative of the Italian Association for Computational Linguistics (AILC, http://www.ai-lc.it) and it is endorsed by the Italian Association for Artificial Intelligence (AIxIA, http://www.aixia.it) and the Italian Association for Speech Sciences (AISV, http://www.aisv.it)

    Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020

    Get PDF
    On behalf of the Program Committee, a very warm welcome to the Seventh Italian Conference on Computational Linguistics (CLiC-it 2020). This edition of the conference is held in Bologna and organised by the University of Bologna. The CLiC-it conference series is an initiative of the Italian Association for Computational Linguistics (AILC) which, after six years of activity, has clearly established itself as the premier national forum for research and development in the fields of Computational Linguistics and Natural Language Processing, where leading researchers and practitioners from academia and industry meet to share their research results, experiences, and challenges

    Bridging the data gap in neural machine translation

    Get PDF
    Neural machine translation (NMT) has completely revolutionized the field, leading to many breakthroughs and significantly improving translation quality. Despite these advancements, a common limitation of existing NMT architectures is that they rely heavily on large amounts of high-quality parallel corpora. However, this requirement is met by only a few high-resource languages, whereas sufficient parallel data is scarce for most of the world's languages. This thesis proposes solutions to this challenge by exploiting two alternative data sources: monolingual data and parallel data from other (related) languages. The first half of the thesis explores how monolingual data can compensate for the lack of parallel data in two distinct ways. We first explore how to effectively exploit the knowledge of language models (LMs) trained on target-side monolingual data. We propose a method that uses an LM as a prior that simultaneously mitigates overfitting and distills the knowledge of the LM into the NMT model. This is achieved by adding a regularization term, which pushes the output distributions of the NMT model to be probable under the LM prior. This improves low-resource translation and outperforms related LM-fusion methods. Next, inspired by advancements in transfer learning, we study how to effectively use monolingual data by pretraining the entire NMT model. We focus on the role of different denoising autoencoding (DAE) objectives and explore noising methods that create samples resembling real sentences. Our analysis reveals that different objectives produce models that encode and use information differently, and our experiments show a strong variation in unsupervised NMT, unlike semi- and supervised NMT. The next part of the thesis focuses on exploiting related parallel data via multilingual machine translation (MMT). Initially, we investigate how to efficiently balance the trade-off between transfer and interference in MMT. Instead of increasing model capacity, which incurs a large computational cost, or using separate language-specific parameters, which prevent cross-lingual transfer, we achieve the best of both by incorporating language-specific layers generated from a language-aware hyper-network. Then, we unify all our previous efforts and study how to optimally combine monolingual and related parallel data in MMT. Motivated by promising and conflicting results in the literature, we systematically analyze jointly training MMT with DAE or back-translation (BT). Using a comprehensive evaluation across monolingual splits and multilingual test sets, we discover that all methods are surprisingly brittle to domain mismatches. We also analyze the role of the model scale (from 90M to 1.6B parameters) and find it critical for effectively using monolingual data and capable of completely changing the ranking across models, with surprisingly strong effects on DAE. The goal of this thesis is to contribute both new methods and new insights. One half presents novel methods for exploiting data sources beyond the parallel corpora of a given language pair, by addressing the limitations of existing methods. The other half presents systematic analyses of how state-of-the-art methods work, by using comprehensive evaluation with controlled experiments, that aims to advance our understanding of these methods and drive future research

    Geographic information extraction from texts

    Get PDF
    A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction
    corecore