120 research outputs found
Recommended from our members
Acquiring and Harnessing Verb Knowledge for Multilingual Natural Language Processing
Advances in representation learning have enabled natural language processing models to derive non-negligible linguistic information directly from text corpora in an unsupervised fashion. However, this signal is underused in downstream tasks, where they tend to fall back on superficial cues and heuristics to solve the problem at hand. Further progress relies on identifying and filling the gaps in linguistic knowledge captured in their parameters. The objective of this thesis is to address these challenges focusing on the issues of resource scarcity, interpretability, and lexical knowledge injection, with an emphasis on the category of verbs.
To this end, I propose a novel paradigm for efficient acquisition of lexical knowledge leveraging native speakers’ intuitions about verb meaning to support development and downstream performance of NLP models across languages. First, I investigate the potential of acquiring semantic verb classes from non-experts through manual clustering. This subsequently informs the development of a two-phase semantic dataset creation methodology, which combines semantic clustering with fine-grained semantic similarity judgments collected through spatial arrangements of lexical stimuli. The method is tested on English and then applied to a typologically diverse sample of languages to produce the first large-scale multilingual verb dataset of this kind. I demonstrate its utility as a diagnostic tool by carrying out a comprehensive evaluation of state-of-the-art NLP models, probing representation quality across languages and domains of verb meaning, and shedding light on their deficiencies. Subsequently, I directly address these shortcomings by injecting lexical knowledge into large pretrained language models. I demonstrate that external manually curated information about verbs’ lexical properties can support data-driven models in tasks where accurate verb processing is key. Moreover, I examine the potential of extending these benefits from resource-rich to resource-poor languages through translation-based transfer. The results emphasise the usefulness of human-generated lexical knowledge in supporting NLP models and suggest that time-efficient construction of lexicons similar to those developed in this work, especially in under-resourced languages, can play an important role in boosting their linguistic capacity.ESRC Doctoral Fellowship [ES/J500033/1], ERC Consolidator Grant LEXICAL [648909
Integrating Distributional, Compositional, and Relational Approaches to Neural Word Representations
When the field of natural language processing (NLP) entered the era of deep neural networks, the task of representing basic units of language, an inherently sparse and symbolic medium, using low-dimensional dense real-valued vectors, or embeddings, became crucial.
The dominant technique to perform this task has for years been to segment input text sequences into space-delimited words, for which embeddings are trained over a large corpus by means of leveraging distributional information: a word is reducible to the set of contexts it appears in.
This approach is powerful but imperfect; words not seen during the embedding learning phase, known as out-of-vocabulary words (OOVs), emerge in any plausible application where embeddings are used.
One approach applied in order to combat this and other shortcomings is the incorporation of compositional information obtained from the surface form of words, enabling the representation of morphological regularities and increasing robustness to typographical errors.
Another approach leverages word-sense information and relations curated in large semantic graph resources, offering a supervised signal for embedding space structure and improving representations for domain-specific rare words.
In this dissertation, I offer several analyses and remedies for the OOV problem based on the utilization of character-level compositional information in multiple languages and the structure of semantic knowledge in English.
In addition, I provide two novel datasets for the continued exploration of vocabulary expansion in English: one with a taxonomic emphasis on novel word formation, and the other generated by a real-world data-driven use case in the entity graph domain.
Finally, recognizing the recent shift in NLP towards contextualized representations of subword tokens, I describe the form in which the OOV problem still appears in these methods, and apply an integrative compositional model to address it.Ph.D
Final FLaReNet deliverable: Language Resources for the Future - The Future of Language Resources
Language Technologies (LT), together with their backbone, Language Resources (LR), provide an essential support to the challenge of Multilingualism and ICT of the future. The main task of language technologies is to bridge language barriers and to help creating a new environment where information flows smoothly across frontiers and languages, no matter the country, and the language, of origin. To achieve this goal, all players involved need to act as a community able to join forces on a set of shared priorities. However, until now the field of Language Resources and Technology has long suffered from an excess of individuality and fragmentation, with a lack of coherence concerning the priorities for the field, the direction to move, not to mention a common timeframe. The context encountered by the FLaReNet project was thus represented by an active field needing a coherence that can only be given by sharing common priorities and endeavours. FLaReNet has contributed to the creation of this coherence by gathering a wide community of experts and making them participate in the definition of an exhaustive set of recommendations
Mineração e uso de padrões linguĂsticos para desambiguação de palavras e análise do discurso
Tese (doutorado) - Universidade Federal de Santa Catarina, Centro TecnolĂłgico, Programa de PĂłs-Graduação em CiĂŞncia da Computação, FlorianĂłpolis, 2020.A extração de informação contida em textos na web tem o potencial de alavancar uma sĂ©rie de aplicações, mas muitas delas requerem a captura automática da semântica exata de elementos textuais relevantes. O Twitter, por exemplo, gera diariamente centenas de milhões de pequenos textos (tweets), muitos dos quais com rica informação sobre usuários, fatos, produtos, serviços, desejos, opiniões, etc. A anotação semântica de palavras relevantes em tweets Ă© um grande desafio, pois eles impõem dificuldades adicionais (e.g., pouca informação de contexto, agramaticalidade) para mĂ©todos automáticos realizarem uma desambiguação de qualidade, o que leva a resultados com baixa precisĂŁo e cobertura. Inclusive, porque a lĂngua Ă© um sistema simbĂłlico polissĂŞmico, que nĂŁo tem uma semântica pronta, o que se manifesta acentuadamente em linguagem coloquial e particularmente em mĂdias sociais. As soluções atuais de anotação geralmente nĂŁo conseguem encontrar o sentido correto de palavras em construções envolvendo a semântica implĂcita que, Ă s vezes, Ă© colocada intencionalmente, por exemplo, para fazer humor, ironia, jogo de palavras ou trocadilhos. Este trabalho propõe o desenvolvimento de uma abordagem para minerar padrões lĂ©xico-semânticos, com a finalidade de captar a semântica em texto para utilizar em tarefas que processam a linguagem. Estes padrões foram denominados de padrões MSC+, pois sĂŁo definidos por sequĂŞncias de Componentes Morfo-semânticos (MSC). Um algoritmo nĂŁo-supervisionado foi desenvolvido para minerar tais padrões, que suportam a identificação de um novo tipo de caracterĂstica semântica em documentos, assim como mĂ©todos para desambiguar o sentido de palavras. Os resultados de experimentos com a tarefa de Word Sense Disambiguation (WSD), em texto de mĂdia social, mostraram que instâncias de alguns padrões MSC+ aparecem em vários tweets, mas Ă s vezes usando palavras diferentes para transmitir o sentido. Os testes realizados nos resultados do experimento em WSD demonstraram que a exploração dos padrões MSC+ permite mecanismos eficazes na desambiguação do sentido de palavras, levando a melhorias no estado da arte, segundo medidas de precisĂŁo, cobertura e medida-F. Os padrões MSC+ tambĂ©m foram explorados em experimentos com Análise do Discurso (AD) do conteĂşdo de diferentes obras do escritor Machado de Assis. Os experimentos revelaram a incidĂŞncia de padrões morfo-semânticos que evidenciam caracterĂsticas de obras literárias e que podem auxiliar na classificação de discurso das obras analisadas, tais como a preponderância de verbos especĂficos nos contos, de substantivos femininos nos romances e adjetivos nos poemas.Abstract: Information extraction from social media texts has the potential to boost a number of applications, but many of them require the automatic capture of accurate semantics of relevant textual elements. Twitter, for example, generates hundreds of millions of short texts (tweets) daily, many of which containing rich information about users, facts, products, services, desires, opinions, etc. The semantic annotation of relevant words in tweets is a challenge because social media impose additional difficulties (e.g., little context information, poor grammatical rules conformity) for automatic methods to carry out quality disambiguation. It leads to results with low accuracy and coverage. In addition, a language is a polysemic symbolic system without ready semantics for some constructs. Sometimes words have implicit semantics (e.g., to make humor, irony, wordplay). It is common in colloquial language, and particularly in social media. In this work, we propose the development of an approach to mine lexical-semantic patterns and capture the semantics of texts for use in language processing tasks. We learn these patterns, that we call MSC+ patterns, from text data defined by Morpho-semantic Components (MSC). An unsupervised algorithm was developed to mine such patterns, which support the identification of a new kind of semantic feature in documents, as well as methods for disambiguating the meaning of words. Experimental results on Word Sense Disambiguation (WSD) task, from tweets, show that instances of some MSC+ patterns arise in many tweets, but sometimes using different words to convey the sense of the respective MSC in some tweets where pattern instances appear. The exploitation of MSC+ patterns when they induce semantics on target words enables effective word sense disambiguation mechanisms leading to improvements in the state of the art (e.g., metrics such as accuracy, coverage, and F-measure). We also explored the MSC+ patterns on the Discourse Analysis (DA) with literary content. Experimental results on selected works of a Brazilian writer submitted to our algorithm reveal the incidence of distinct morpho-semantic patterns in different types of works, such as the preponderance of specific verbs in tales, feminine nouns in romances, and adjectives in poems
Empirical studies on word representations
One of the most fundamental tasks in natural language processing is representing words with mathematical objects (such as vectors). The word representations, which are most often estimated from data, allow capturing the meaning of words. They enable comparing words according to their semantic similarity, and have been shown to work extremely well when included in complex real-world applications. A large part of our work deals with ways of estimating word representations directly from large quantities of text. Our methods exploit the idea that words which occur in similar contexts have a similar meaning. How we define the context is an important focus of our thesis. The context can consist of a number of words to the left and to the right of the word in question, but, as we show, obtaining context words via syntactic links (such as the link between the verb and its subject) often works better. We furthermore investigate word representations that accurately capture multiple meanings of a single word. We show that translation of a word in context contains information that can be used to disambiguate the meaning of that word
- …