107 research outputs found

    Character-level and syntax-level models for low-resource and multilingual natural language processing

    Get PDF
    There are more than 7000 languages in the world, but only a small portion of them benefit from Natural Language Processing resources and models. Although languages generally present different characteristics, “cross-lingual bridges” can be exploited, such as transliteration signals and word alignment links. Such information, together with the availability of multiparallel corpora and the urge to overcome language barriers, motivates us to build models that represent more of the world’s languages. This thesis investigates cross-lingual links for improving the processing of low-resource languages with language-agnostic models at the character and syntax level. Specifically, we propose to (i) use orthographic similarities and transliteration between Named Entities and rare words in different languages to improve the construction of Bilingual Word Embeddings (BWEs) and named entity resources, and (ii) exploit multiparallel corpora for projecting labels from high- to low-resource languages, thereby gaining access to weakly supervised processing methods for the latter. In the first publication, we describe our approach for improving the translation of rare words and named entities for the Bilingual Dictionary Induction (BDI) task, using orthography and transliteration information. In our second work, we tackle BDI by enriching BWEs with orthography embeddings and a number of other features, using our classification-based system to overcome script differences among languages. The third publication describes cheap cross-lingual signals that should be considered when building mapping approaches for BWEs since they are simple to extract, effective for bootstrapping the mapping of BWEs, and overcome the failure of unsupervised methods. The fourth paper shows our approach for extracting a named entity resource for 1340 languages, including very low-resource languages from all major areas of linguistic diversity. We exploit parallel corpus statistics and transliteration models and obtain improved performance over prior work. Lastly, the fifth work models annotation projection as a graph-based label propagation problem for the part of speech tagging task. Part of speech models trained on our labeled sets outperform prior work for low-resource languages like Bambara (an African language spoken in Mali), Erzya (a Uralic language spoken in Russia’s Republic of Mordovia), Manx (the Celtic language of the Isle of Man), and Yoruba (a Niger-Congo language spoken in Nigeria and surrounding countries)

    Measuring associational thinking through word embeddings

    Full text link
    [EN] The development of a model to quantify semantic similarity and relatedness between words has been the major focus of many studies in various fields, e.g. psychology, linguistics, and natural language processing. Unlike the measures proposed by most previous research, this article is aimed at estimating automatically the strength of associative words that can be semantically related or not. We demonstrate that the performance of the model depends not only on the combination of independently constructed word embeddings (namely, corpus- and network-based embeddings) but also on the way these word vectors interact. The research concludes that the weighted average of the cosine-similarity coefficients derived from independent word embeddings in a double vector space tends to yield high correlations with human judgements. Moreover, we demonstrate that evaluating word associations through a measure that relies on not only the rank ordering of word pairs but also the strength of associations can reveal some findings that go unnoticed by traditional measures such as Spearman's and Pearson's correlation coefficients.s Financial support for this research has been provided by the Spanish Ministry of Science, Innovation and Universities [grant number RTC 2017-6389-5], the Spanish ¿Agencia Estatal de Investigación¿ [grant number PID2020-112827GB-I00 / AEI / 10.13039/501100011033], and the European Union¿s Horizon 2020 research and innovation program [grant number 101017861: project SMARTLAGOON]. Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.Periñán-Pascual, C. (2022). Measuring associational thinking through word embeddings. Artificial Intelligence Review. 55(3):2065-2102. https://doi.org/10.1007/s10462-021-10056-62065210255

    Robust input representations for low-resource information extraction

    Get PDF
    Recent advances in the field of natural language processing were achieved with deep learning models. This led to a wide range of new research questions concerning the stability of such large-scale systems and their applicability beyond well-studied tasks and datasets, such as information extraction in non-standard domains and languages, in particular, in low-resource environments. In this work, we address these challenges and make important contributions across fields such as representation learning and transfer learning by proposing novel model architectures and training strategies to overcome existing limitations, including a lack of training resources, domain mismatches and language barriers. In particular, we propose solutions to close the domain gap between representation models by, e.g., domain-adaptive pre-training or our novel meta-embedding architecture for creating a joint representations of multiple embedding methods. Our broad set of experiments demonstrates state-of-the-art performance of our methods for various sequence tagging and classification tasks and highlight their robustness in challenging low-resource settings across languages and domains.Die jüngsten Fortschritte auf dem Gebiet der Verarbeitung natürlicher Sprache wurden mit Deep-Learning-Modellen erzielt. Dies führte zu einer Vielzahl neuer Forschungsfragen bezüglich der Stabilität solcher großen Systeme und ihrer Anwendbarkeit über gut untersuchte Aufgaben und Datensätze hinaus, wie z. B. die Informationsextraktion für Nicht-Standardsprachen, aber auch Textdomänen und Aufgaben, für die selbst im Englischen nur wenige Trainingsdaten zur Verfügung stehen. In dieser Arbeit gehen wir auf diese Herausforderungen ein und leisten wichtige Beiträge in Bereichen wie Repräsentationslernen und Transferlernen, indem wir neuartige Modellarchitekturen und Trainingsstrategien vorschlagen, um bestehende Beschränkungen zu überwinden, darunter fehlende Trainingsressourcen, ungesehene Domänen und Sprachbarrieren. Insbesondere schlagen wir Lösungen vor, um die Domänenlücke zwischen Repräsentationsmodellen zu schließen, z.B. durch domänenadaptives Vortrainieren oder unsere neuartige Meta-Embedding-Architektur zur Erstellung einer gemeinsamen Repräsentation mehrerer Embeddingmethoden. Unsere umfassende Evaluierung demonstriert die Leistungsfähigkeit unserer Methoden für verschiedene Klassifizierungsaufgaben auf Word und Satzebene und unterstreicht ihre Robustheit in anspruchsvollen, ressourcenarmen Umgebungen in verschiedenen Sprachen und Domänen

    Harnessing sense-level information for semantically augmented knowledge extraction

    Get PDF
    Nowadays, building accurate computational models for the semantics of language lies at the very core of Natural Language Processing and Artificial Intelligence. A first and foremost step in this respect consists in moving from word-based to sense-based approaches, in which operating explicitly at the level of word senses enables a model to produce more accurate and unambiguous results. At the same time, word senses create a bridge towards structured lexico-semantic resources, where the vast amount of available machine-readable information can help overcome the shortage of annotated data in many languages and domains of knowledge. This latter phenomenon, known as the knowledge acquisition bottlneck, is a crucial problem that hampers the development of large-scale, data-driven approaches for many Natural Language Processing tasks, especially when lexical semantics is directly involved. One of these tasks is Information Extraction, where an effective model has to cope with data sparsity, as well as with lexical ambiguity that can arise at the level of both arguments and relational phrases. Even in more recent Information Extraction approaches where semantics is implicitly modeled, these issues have not yet been addressed in their entirety. On the other hand, however, having access to explicit sense-level information is a very demanding task on its own, which can rarely be performed with high accuracy on a large scale. With this in mind, in ths thesis we will tackle a two-fold objective: our first focus will be on studying fully automatic approaches to obtain high-quality sense-level information from textual corpora; then, we will investigate in depth where and how such sense-level information has the potential to enhance the extraction of knowledge from open text. In the first part of this work, we will explore three different disambiguation scenar- ios (semi-structured text, parallel text, and definitional text) and devise automatic disambiguation strategies that are not only capable of scaling to different corpus sizes and different languages, but that actually take advantage of a multilingual and/or heterogeneous setting to improve and refine their performance. As a result, we will obtain three sense-annotated resources that, when tested experimentally with a baseline system in a series of downstream semantic tasks (i.e. Word Sense Disam- biguation, Entity Linking, Semantic Similarity), show very competitive performances on standard benchmarks against both manual and semi-automatic competitors. In the second part we will instead focus on Information Extraction, with an emphasis on Open Information Extraction (OIE), where issues like sparsity and lexical ambiguity are especially critical, and study how to exploit at best sense-level information within the extraction process. We will start by showing that enforcing a deeper semantic analysis in a definitional setting enables a full-fledged extraction pipeline to compete with state-of-the-art approaches based on much larger (but noisier) data. We will then demonstrate how working at the sense level at the end of an extraction pipeline is also beneficial: indeed, by leveraging sense-based techniques, very heterogeneous OIE-derived data can be aligned semantically, and unified with respect to a common sense inventory. Finally, we will briefly shift the focus to the more constrained setting of hypernym discovery, and study a sense-aware supervised framework for the task that is robust and effective, even when trained on heterogeneous OIE-derived hypernymic knowledge

    No Pattern, No Recognition: a Survey about Reproducibility and Distortion Issues of Text Clustering and Topic Modeling

    Full text link
    Extracting knowledge from unlabeled texts using machine learning algorithms can be complex. Document categorization and information retrieval are two applications that may benefit from unsupervised learning (e.g., text clustering and topic modeling), including exploratory data analysis. However, the unsupervised learning paradigm poses reproducibility issues. The initialization can lead to variability depending on the machine learning algorithm. Furthermore, the distortions can be misleading when regarding cluster geometry. Amongst the causes, the presence of outliers and anomalies can be a determining factor. Despite the relevance of initialization and outlier issues for text clustering and topic modeling, the authors did not find an in-depth analysis of them. This survey provides a systematic literature review (2011-2022) of these subareas and proposes a common terminology since similar procedures have different terms. The authors describe research opportunities, trends, and open issues. The appendices summarize the theoretical background of the text vectorization, the factorization, and the clustering algorithms that are directly or indirectly related to the reviewed works

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail
    • …
    corecore