115 research outputs found

    Adjusting Sense Representations for Word Sense Disambiguation and Automatic Pun Interpretation

    Get PDF
    Word sense disambiguation (WSD)—the task of determining which meaning a word carries in a particular context—is a core research problem in computational linguistics. Though it has long been recognized that supervised (machine learning–based) approaches to WSD can yield impressive results, they require an amount of manually annotated training data that is often too expensive or impractical to obtain. This is a particular problem for under-resourced languages and domains, and is also a hurdle in well-resourced languages when processing the sort of lexical-semantic anomalies employed for deliberate effect in humour and wordplay. In contrast to supervised systems are knowledge-based techniques, which rely only on pre-existing lexical-semantic resources (LSRs). These techniques are of more general applicability but tend to suffer from lower performance due to the informational gap between the target word's context and the sense descriptions provided by the LSR. This dissertation is concerned with extending the efficacy and applicability of knowledge-based word sense disambiguation. First, we investigate two approaches for bridging the information gap and thereby improving the performance of knowledge-based WSD. In the first approach we supplement the word's context and the LSR's sense descriptions with entries from a distributional thesaurus. The second approach enriches an LSR's sense information by aligning it to other, complementary LSRs. Our next main contribution is to adapt techniques from word sense disambiguation to a novel task: the interpretation of puns. Traditional NLP applications, including WSD, usually treat the source text as carrying a single meaning, and therefore cannot cope with the intentionally ambiguous constructions found in humour and wordplay. We describe how algorithms and evaluation methodologies from traditional word sense disambiguation can be adapted for the "disambiguation" of puns, or rather for the identification of their double meanings. Finally, we cover the design and construction of technological and linguistic resources aimed at supporting the research and application of word sense disambiguation. Development and comparison of WSD systems has long been hampered by a lack of standardized data formats, language resources, software components, and workflows. To address this issue, we designed and implemented a modular, extensible framework for WSD. It implements, encapsulates, and aggregates reusable, interoperable components using UIMA, an industry-standard information processing architecture. We have also produced two large sense-annotated data sets for under-resourced languages or domains: one of these targets German-language text, and the other English-language puns

    Metaphor and the \u27Emergent Property\u27 Problem: A Relevance-Theoretic Approach

    Get PDF
    The interpretation of metaphorical utterances often results in the attribution of emergent properties; these are properties which are neither standardly associated with the individual constituents of the utterance in isolation nor derivable by standard rules of semantic composition. For example, an utterance of ‘Robert is a bulldozer’ may be understood as attributing to Robert such properties as single-mindedness, insistence on having things done in his way, and insensitivity to the opinions/feelings of others, although none of these is included in the encyclopaedic information associated with bulldozers (earth-clearing machines). An adequate pragmatic account of metaphor interpretation must provide an explanation of the processes through which emergent properties are derived. In this paper, we attempt to develop an explicit account of the derivation process couched within the framework of relevance theory. The key features of our account are: (a) metaphorical language use is taken to lie on a continuum with other cases of loose use, including hyperbole; (b) metaphor interpretation is a wholly inferential process, which does not require associative mappings from one domain (e.g. machines) to another (e.g. human beings); (c) the derivation of emergent properties involves no special interpretive mechanisms not required for the interpretation of ordinary, literal utterances

    Designing Statistical Language Learners: Experiments on Noun Compounds

    Full text link
    The goal of this thesis is to advance the exploration of the statistical language learning design space. In pursuit of that goal, the thesis makes two main theoretical contributions: (i) it identifies a new class of designs by specifying an architecture for natural language analysis in which probabilities are given to semantic forms rather than to more superficial linguistic elements; and (ii) it explores the development of a mathematical theory to predict the expected accuracy of statistical language learning systems in terms of the volume of data used to train them. The theoretical work is illustrated by applying statistical language learning designs to the analysis of noun compounds. Both syntactic and semantic analysis of noun compounds are attempted using the proposed architecture. Empirical comparisons demonstrate that the proposed syntactic model is significantly better than those previously suggested, approaching the performance of human judges on the same task, and that the proposed semantic model, the first statistical approach to this problem, exhibits significantly better accuracy than the baseline strategy. These results suggest that the new class of designs identified is a promising one. The experiments also serve to highlight the need for a widely applicable theory of data requirements.Comment: PhD thesis (Macquarie University, Sydney; December 1995), LaTeX source, xii+214 page

    Subtitling Humour from the Perspective of Relevance Theory: The Office in Traditional Chinese

    Get PDF
    Subtitling the scenes containing humorous utterances in cinematic-televisual productions encounters a myriad of challenges, because the subtitler has to face the technical constraints that characterise the professional subtitling environment and the cultural barriers when reproducing humorous utterances for viewers inhabiting another culture. Past studies tend to explore more limited humour-related areas, which means that a more comprehensive picture of this specialised field is missing. The current research investigates the subtitling of humour, drawing on the framework of relevance theory and the British sitcom The Office, translated from English dialogue into Traditional Chinese subtitles. This research enquires into whether or not relevance theory can explain the subtitling strategies activated to deal with various humorous utterances in the sitcom, and, if so, to what extent. The English-Chinese Corpus of The Office (ECCO), which contains sample texts, media files and annotations, has been constructed to perform an empirical study. To enrich the corpus with valuable annotations, a typology of humour has been developed based on the concept of frame, and a taxonomy of subtitling strategies has also been proposed. The quantitative analysis demonstrates that the principle of relevance is the main benchmark for the choice of a subtitling micro-strategy within any given macro-strategy. With the chi-square test, it further proves the existence of a statistically significant association between humour types/frames and subtitling strategies at the global level. The qualitative analysis shows that the principle of relevance can operate in a subtle way, in which the subtitler invests more cognitive efforts to enhance the acceptability of subtitles. It also develops three levels of mutual dependency between the two variables, from strong, weak to null, to classify different examples. Overall, this study improves our understanding of humour translation and can facilitate a change in the curricula of translator training

    Mineração e uso de padrões linguísticos para desambiguação de palavras e análise do discurso

    Get PDF
    Tese (doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2020.A extração de informação contida em textos na web tem o potencial de alavancar uma série de aplicações, mas muitas delas requerem a captura automática da semântica exata de elementos textuais relevantes. O Twitter, por exemplo, gera diariamente centenas de milhões de pequenos textos (tweets), muitos dos quais com rica informação sobre usuários, fatos, produtos, serviços, desejos, opiniões, etc. A anotação semântica de palavras relevantes em tweets é um grande desafio, pois eles impõem dificuldades adicionais (e.g., pouca informação de contexto, agramaticalidade) para métodos automáticos realizarem uma desambiguação de qualidade, o que leva a resultados com baixa precisão e cobertura. Inclusive, porque a língua é um sistema simbólico polissêmico, que não tem uma semântica pronta, o que se manifesta acentuadamente em linguagem coloquial e particularmente em mídias sociais. As soluções atuais de anotação geralmente não conseguem encontrar o sentido correto de palavras em construções envolvendo a semântica implícita que, às vezes, é colocada intencionalmente, por exemplo, para fazer humor, ironia, jogo de palavras ou trocadilhos. Este trabalho propõe o desenvolvimento de uma abordagem para minerar padrões léxico-semânticos, com a finalidade de captar a semântica em texto para utilizar em tarefas que processam a linguagem. Estes padrões foram denominados de padrões MSC+, pois são definidos por sequências de Componentes Morfo-semânticos (MSC). Um algoritmo não-supervisionado foi desenvolvido para minerar tais padrões, que suportam a identificação de um novo tipo de característica semântica em documentos, assim como métodos para desambiguar o sentido de palavras. Os resultados de experimentos com a tarefa de Word Sense Disambiguation (WSD), em texto de mídia social, mostraram que instâncias de alguns padrões MSC+ aparecem em vários tweets, mas às vezes usando palavras diferentes para transmitir o sentido. Os testes realizados nos resultados do experimento em WSD demonstraram que a exploração dos padrões MSC+ permite mecanismos eficazes na desambiguação do sentido de palavras, levando a melhorias no estado da arte, segundo medidas de precisão, cobertura e medida-F. Os padrões MSC+ também foram explorados em experimentos com Análise do Discurso (AD) do conteúdo de diferentes obras do escritor Machado de Assis. Os experimentos revelaram a incidência de padrões morfo-semânticos que evidenciam características de obras literárias e que podem auxiliar na classificação de discurso das obras analisadas, tais como a preponderância de verbos específicos nos contos, de substantivos femininos nos romances e adjetivos nos poemas.Abstract: Information extraction from social media texts has the potential to boost a number of applications, but many of them require the automatic capture of accurate semantics of relevant textual elements. Twitter, for example, generates hundreds of millions of short texts (tweets) daily, many of which containing rich information about users, facts, products, services, desires, opinions, etc. The semantic annotation of relevant words in tweets is a challenge because social media impose additional difficulties (e.g., little context information, poor grammatical rules conformity) for automatic methods to carry out quality disambiguation. It leads to results with low accuracy and coverage. In addition, a language is a polysemic symbolic system without ready semantics for some constructs. Sometimes words have implicit semantics (e.g., to make humor, irony, wordplay). It is common in colloquial language, and particularly in social media. In this work, we propose the development of an approach to mine lexical-semantic patterns and capture the semantics of texts for use in language processing tasks. We learn these patterns, that we call MSC+ patterns, from text data defined by Morpho-semantic Components (MSC). An unsupervised algorithm was developed to mine such patterns, which support the identification of a new kind of semantic feature in documents, as well as methods for disambiguating the meaning of words. Experimental results on Word Sense Disambiguation (WSD) task, from tweets, show that instances of some MSC+ patterns arise in many tweets, but sometimes using different words to convey the sense of the respective MSC in some tweets where pattern instances appear. The exploitation of MSC+ patterns when they induce semantics on target words enables effective word sense disambiguation mechanisms leading to improvements in the state of the art (e.g., metrics such as accuracy, coverage, and F-measure). We also explored the MSC+ patterns on the Discourse Analysis (DA) with literary content. Experimental results on selected works of a Brazilian writer submitted to our algorithm reveal the incidence of distinct morpho-semantic patterns in different types of works, such as the preponderance of specific verbs in tales, feminine nouns in romances, and adjectives in poems

    Adjusting Sense Representations for Word Sense Disambiguation and Automatic Pun Interpretation

    No full text
    Word sense disambiguation (WSD)—the task of determining which meaning a word carries in a particular context—is a core research problem in computational linguistics. Though it has long been recognized that supervised (machine learning–based) approaches to WSD can yield impressive results, they require an amount of manually annotated training data that is often too expensive or impractical to obtain. This is a particular problem for under-resourced languages and domains, and is also a hurdle in well-resourced languages when processing the sort of lexical-semantic anomalies employed for deliberate effect in humour and wordplay. In contrast to supervised systems are knowledge-based techniques, which rely only on pre-existing lexical-semantic resources (LSRs). These techniques are of more general applicability but tend to suffer from lower performance due to the informational gap between the target word's context and the sense descriptions provided by the LSR. This dissertation is concerned with extending the efficacy and applicability of knowledge-based word sense disambiguation. First, we investigate two approaches for bridging the information gap and thereby improving the performance of knowledge-based WSD. In the first approach we supplement the word's context and the LSR's sense descriptions with entries from a distributional thesaurus. The second approach enriches an LSR's sense information by aligning it to other, complementary LSRs. Our next main contribution is to adapt techniques from word sense disambiguation to a novel task: the interpretation of puns. Traditional NLP applications, including WSD, usually treat the source text as carrying a single meaning, and therefore cannot cope with the intentionally ambiguous constructions found in humour and wordplay. We describe how algorithms and evaluation methodologies from traditional word sense disambiguation can be adapted for the "disambiguation" of puns, or rather for the identification of their double meanings. Finally, we cover the design and construction of technological and linguistic resources aimed at supporting the research and application of word sense disambiguation. Development and comparison of WSD systems has long been hampered by a lack of standardized data formats, language resources, software components, and workflows. To address this issue, we designed and implemented a modular, extensible framework for WSD. It implements, encapsulates, and aggregates reusable, interoperable components using UIMA, an industry-standard information processing architecture. We have also produced two large sense-annotated data sets for under-resourced languages or domains: one of these targets German-language text, and the other English-language puns
    • …
    corecore