11 research outputs found

    Predicting Linguistic Structure with Incomplete and Cross-Lingual Supervision

    Get PDF
    Contemporary approaches to natural language processing are predominantly based on statistical machine learning from large amounts of text, which has been manually annotated with the linguistic structure of interest. However, such complete supervision is currently only available for the world's major languages, in a limited number of domains and for a limited range of tasks. As an alternative, this dissertation considers methods for linguistic structure prediction that can make use of incomplete and cross-lingual supervision, with the prospect of making linguistic processing tools more widely available at a lower cost. An overarching theme of this work is the use of structured discriminative latent variable models for learning with indirect and ambiguous supervision; as instantiated, these models admit rich model features while retaining efficient learning and inference properties. The first contribution to this end is a latent-variable model for fine-grained sentiment analysis with coarse-grained indirect supervision. The second is a model for cross-lingual word-cluster induction and the application thereof to cross-lingual model transfer. The third is a method for adapting multi-source discriminative cross-lingual transfer models to target languages, by means of typologically informed selective parameter sharing. The fourth is an ambiguity-aware self- and ensemble-training algorithm, which is applied to target language adaptation and relexicalization of delexicalized cross-lingual transfer parsers. The fifth is a set of sequence-labeling models that combine constraints at the level of tokens and types, and an instantiation of these models for part-of-speech tagging with incomplete cross-lingual and crowdsourced supervision. In addition to these contributions, comprehensive overviews are provided of structured prediction with no or incomplete supervision, as well as of learning in the multilingual and cross-lingual settings. Through careful empirical evaluation, it is established that the proposed methods can be used to create substantially more accurate tools for linguistic processing, compared to both unsupervised methods and to recently proposed cross-lingual methods. The empirical support for this claim is particularly strong in the latter case; our models for syntactic dependency parsing and part-of-speech tagging achieve the hitherto best published results for a wide number of target languages, in the setting where no annotated training data is available in the target language

    Sumarización automática de noticias en español

    Full text link
    Con la ingente cantidad de información que se genera diariamente y a la cual tenemos acceso gracias a Internet, la sumarización de textos ha resultado ser una herramienta increíblemente útil, no solo como un medio para ganar eficiencia en su lectura sino también para despojar estos textos de la información irrelevante o secundaria que puedan contener. Si bien algunos textos traen consigo algún tipo de resumen o abstracto, como pueda tener por ejemplo este trabajo, los artículos periodísticos no, es por eso que en este trabajo se estudian distintas técnicas de analítica de texto para sumarizar noticias en español. El trabajo se desarrolla en Python y tiene como finalidad la obtención de resúmenes extractivos y la generación de los títulos de noticias extraídas de publicaciones nacionales. Tras el estudio del estado actual del estado del arte, decidimos implementar dos técnicas de sumarización extractiva para poder comparar sus resultados y construir un modelo sequence-to-sequence que permita la generación de los titulares de las noticias. Los resultados obtenidos demuestran la capacidad para llevar a cabo la tarea así como revelan algunos problemas, como son el hacer uso de frases que hacen referencia a información anterior que no han sido seleccionadas o la complicación del resumen de entrevistas. Palabra

    Computer-Aided Biomimetics : Semi-Open Relation Extraction from scientific biological texts

    Get PDF
    Engineering inspired by biology – recently termed biom* – has led to various groundbreaking technological developments. Example areas of application include aerospace engineering and robotics. However, biom* is not always successful and only sporadically applied in industry. The reason is that a systematic approach to biom* remains at large, despite the existence of a plethora of methods and design tools. In recent years computational tools have been proposed as well, which can potentially support a systematic integration of relevant biological knowledge during biom*. However, these so-called Computer-Aided Biom* (CAB) tools have not been able to fill all the gaps in the biom* process. This thesis investigates why existing CAB tools fail, proposes a novel approach – based on Information Extraction – and develops a proof-of-concept for a CAB tool that does enable a systematic approach to biom*. Key contributions include: 1) a disquisition of existing tools guides the selection of a strategy for systematic CAB, 2) a dataset of 1,500 manually-annotated sentences, 3) a novel Information Extraction approach that combines the outputs from a supervised Relation Extraction system and an existing Open Information Extraction system. The implemented exploratory approach indicates that it is possible to extract a focused selection of relations from scientific texts with reasonable accuracy, without imposing limitations on the types of information extracted. Furthermore, the tool developed in this thesis is shown to i) speed up a trade-off analysis by domain-experts, and ii) also improve the access to biology information for nonexperts

    Computer-aided biomimetics : semi-open relation extraction from scientific biological texts

    Get PDF
    Engineering inspired by biology – recently termed biom* – has led to various ground-breaking technological developments. Example areas of application include aerospace engineering and robotics. However, biom* is not always successful and only sporadically applied in industry. The reason is that a systematic approach to biom* remains at large, despite the existence of a plethora of methods and design tools. In recent years computational tools have been proposed as well, which can potentially support a systematic integration of relevant biological knowledge during biom*. However, these so-called Computer-Aided Biom* (CAB) tools have not been able to fill all the gaps in the biom* process. This thesis investigates why existing CAB tools fail, proposes a novel approach – based on Information Extraction – and develops a proof-of-concept for a CAB tool that does enable a systematic approach to biom*. Key contributions include: 1) a disquisition of existing tools guides the selection of a strategy for systematic CAB, 2) a dataset of 1,500 manually-annotated sentences, 3) a novel Information Extraction approach that combines the outputs from a supervised Relation Extraction system and an existing Open Information Extraction system. The implemented exploratory approach indicates that it is possible to extract a focused selection of relations from scientific texts with reasonable accuracy, without imposing limitations on the types of information extracted. Furthermore, the tool developed in this thesis is shown to i) speed up a trade-off analysis by domain-experts, and ii) also improve the access to biology information for non-exper

    Translation-based Word Sense Disambiguation

    Get PDF
    This thesis investigates the use of the translation-based Mirrors method (Dyvik, 2005, inter alia) for Word Sense Disambiguation (WSD) for Norwegian. Word Sense Disambiguation is the process of determining the relevant sense of an ambiguous word in context automatically. Automated WSD is relevant for Natural Language Processing systems such as machine translation (MT), information retrieval, information extraction and content analysis. The most successful WSD approaches to date are so-called supervised machine learning (ML) techniques, in which the system ‘learns’ the contextual characteristics of each sense from a training corpus that contains concrete examples of contexts in which a word sense typically occurs. This approach suffers from a knowledge acquisition problem since word senses are not overtly available in corpus text. First, we therefore need a sense inventory which is computationally tractable. Subjectively defined sense distinctions have been the norm in WSD research (especially the Princeton WordNet, Fellbaum, 1998). But WSD studies increasingly show that the WordNet senses are too fine-grained for efficient WSD, which has made WordNet less attractive for machine-learned WSD. Ide and Wilks (2006) recommend instead to approximate word senses by way of cross-lingual sense definitions. Second, we need a method for sense-tagging context examples with the relevant sense given the context. Preparing such sense-tagged training corpora manually is costly and time-consuming, in particular because statistical methods require large amounts of training examples, and automated methods are therefore desirable. This thesis introduces an experimental lexical knowledge source which derives word senses and relations between word senses on the basis of translational correspondences in a parallel corpus, resulting in a structured semantic network (Dyvik, 2009). The Mirrors method is applicable for any language pair for which a parallel corpus and word alignment is available. The appeal of the Mirrors method and its translational basis for lexical semantics is that it offers an objective and consistent—and hence, testable—criterion, as opposed to the traditional subjective judgements in lexicon classification (cf. the Princeton WordNet). But due to the lack of intersubjective “gold standards” for lexical semantics, it is not an easy task to evaluate the Mirrors method. The main research question of this thesis may thus be formulated as follows: are the translation-based senses and semantic relations in the Mirrors method linguistically motivated from a monolingual point of view? To this end, this thesis proposes to use monolingual task of WSD as a practical framework to evaluate the usefulness of the Mirrors method as a lexical knowledge source. This is motivated by the idea that a well-defined end-user application may provide a stable framework within which the benefits and drawbacks of a resource or a system can be demonstrated (e.g. Ng & Lee, 1996; Stevenson & Wilks, 2001; Yarowsky & Florian, 2002; Specia et al., 2009). The innovative aspect of applying the Mirrors method for WSD is two-fold: first, the Mirrors method is used to obtain sense-tagged data automatically (using cross-lingual data), providing a SemCor-like corpus which allows us to exploit semantically analysed context features in a subsequent WSD classifier. Second, we will test whether training on semantically analysed context features, based on information from the Mirrors method, means that the system resolves other instances than a ‘traditional’ classifier trained on words. In the absence of existing data sets for WSD for Norwegian, an automatically sense-tagged parallel corpus and a manually verified lexical sample of fifteen target words was developed for Norwegian as part of this thesis. The proposed automatic sense-tagging method is based on the Mirrors sense inventory and on the translational correspondents of each word occurrence. The sense-tagger provides a partially semantically analysed context—partially, because the translation-based sense-tagger can only sense-tag tokens that were successfully word-aligned. The sense-tagged English-Norwegian Parallel Corpus (the ENPC) is comparable in size to the existing SemCor. The sense-tagged material formed the basis for a series of controlled experiments, in which the knowledge source is varied but where we maintain the same experimental framework in terms of the classification algorithm, data sets, lexical sample and sense inventory. First, a WSD classifier is trained on the actually co-occurring context WORDS. This knowledge source functions as a point of reference to indicate how well a traditional word-based classifier could be expected to perform, given our specific data sample and using the Mirrors sense inventory. Second, two Mirrors-derived knowledge sources were tentatively implemented, both of which attempt to generalise from the actually occurring context words as a means of alleviating the sparse data problem in WSD. For instance, if the noun phone was found to co-occur with the ambiguous noun billN in the ‘invoice’ sense, and if the classifier can generalise from this to include words that are semantically close to phone, such as telephone, this means that the presence of only one of them during learning could make both of them ‘known’ to the classifier at classification time. In other words, it might be desirable to study not only word co-occurrences, as unanalysed and isolated units, but also how words enter into relations with other words (classes of words) in the structured network that constitutes the vocabulary of a language. In ML terms, it might be interesting to build a WSD model which learns, not how a word sense correlates with isolated words, but rather how a word sense correlates with certain classes of semantically related words. Such a tool for generalisation is clearly desirable in the face of sparse data and in view of the fact that most content words have a relatively low frequency even in larger text corpora. The first of the two Mirrors-based knowledge source rests on so-called SEMANTIC-FEATURES that are shared between word senses in the Mirrors network. Since SEMANTIC-FEATURES may include a very high number of related words, a second knowledge source was also developed—RELATED-WORDS—which attempts to selects a stricter class of near-related word senses in the wordnet-like Mirrors network. The results indicated that the gain in abstracting from context words to classes of semantically related word senses was only marginal in that the two Mirrorsbased knowledge sources only knew marginally more of the context words at classification time compared to a traditional word-based classifier. Regarding classification accuracy, the Mirrors-based SEMANTIC-FEATURES seemed to suffer from including too broad semantic information and performed significantly worse than the other two knowledge sources. The Mirrors-based RELATED-WORDS, on the other hand, was as good as, and sometimes better, than the traditional word model, but the differences were not found to be statistically significant. Although unfortunate for the purpose of enriching a traditional WSD model with Mirrorsderived information, the lack of a difference between the traditional word model and RELATED-WORDS nevertheless provides promising indications with regard to the plausibility of the Mirrors method
    corecore