2,303 research outputs found

    OpenDMAP: An open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Information extraction (IE) efforts are widely acknowledged to be important in harnessing the rapid advance of biomedical knowledge, particularly in areas where important factual information is published in a diverse literature. Here we report on the design, implementation and several evaluations of OpenDMAP, an ontology-driven, integrated concept analysis system. It significantly advances the state of the art in information extraction by leveraging knowledge in ontological resources, integrating diverse text processing applications, and using an expanded pattern language that allows the mixing of syntactic and semantic elements and variable ordering.</p> <p>Results</p> <p>OpenDMAP information extraction systems were produced for extracting protein transport assertions (transport), protein-protein interaction assertions (interaction) and assertions that a gene is expressed in a cell type (expression). Evaluations were performed on each system, resulting in F-scores ranging from .26 – .72 (precision .39 – .85, recall .16 – .85). Additionally, each of these systems was run over all abstracts in MEDLINE, producing a total of 72,460 transport instances, 265,795 interaction instances and 176,153 expression instances. </p> <p>Conclusion</p> <p>OpenDMAP advances the performance standards for extracting protein-protein interaction predications from the full texts of biomedical research articles. Furthermore, this level of performance appears to generalize to other information extraction tasks, including extracting information about predicates of more than two arguments. The output of the information extraction system is always constructed from elements of an ontology, ensuring that the knowledge representation is grounded with respect to a carefully constructed model of reality. The results of these efforts can be used to increase the efficiency of manual curation efforts and to provide additional features in systems that integrate multiple sources for information extraction. The open source OpenDMAP code library is freely available at <url>http://bionlp.sourceforge.net/</url></p

    Research, development and evaluation of a practical model for sentiment analysis

    Get PDF
    Sentiment Analysis is the task of extracting subjective information from input sources coming from a speaker or writer. Usually it refers to identifying whether a text holds a positive or negative polarity. The main approaches to carry out Sentiment Analysis are lexicon or dictionary-based methods and machine learning schemes. Lexicon-based models make use of a prede ned set of words, where each of the words composing the set has an associated polarity. Document polarity will depend on the feature selection method, and how their scores are combined. Machine-learning approaches usually rely on supervised classifiers. Although classifiers offer adaptability for specific contexts, they need to be trained with huge amounts of labelled data which may not be available, specially for upcoming topics. This project, contrary to most scientific researches over this field, aims to go further in emotion detection and puts its efforts on identifying the actual sentiment of documents, instead of focusing on whether it may have a positive or negative connotation. The set of sentiments used for this approach have been extracted from Plutchik's wheel of emotions, which defines eight basic bipolar sentiments and another eight advanced emotions composed of two basic ones. Moreover, in this project we have created a new scheme for SA combining a lexicon-based model for getting term emotions and a statistical approach to identify the most relevant topics in the document which are the targets of the sentiments. By taking this approach we have tried to overcome the disadvantages of simple Bag-of-words models that do not make any distinctions between parts of speech (POS) and weight all words commonly using the tf-idf scheme which leads to overweight most frequently used words. Furthermore, in order to improve knowledge, this projects presents a heuristic learning method that allows improving initial knowledge by converging to human-like sensitivity. In order to test proposed scheme's performance, an Android application for mobile devices has been developed. This app allows users taking photos and introducing descriptions which are processed and classi ed with emotions. Classi cation that may be corrected by the user so that system performance statistics can be extracted.El Análisis de Sentimientos consiste en extraer información subjetiva de lenguaje escrito u oral. Habitualmente se basa en identificar si un texto es positivo o negativo, es decir, extraer su polaridad. Las principales formas de llevar a cabo el Análisis de Sentimientos son los métodos basados en dictionarios y en aprendizaje automático. Los modelos basados en léxicos hacen uso de un conjunto predefinido de palabras que tienen asociada una polaridad. La polaridad del texto dependerá los elementos analizados y la forma en la que se combinan sus valores. Las aproximaciones basadas en aprendizaje automático, por el contrario, normalmente se apoyan en clasificadores supervisados. A pesar de que los claificadores ofrecen adaptabilidad para contextos muy específicos, necesitan gran cantidad de datos para ser entrenados no siempre disponibles, como por ejemplo en temas muy novedosos. Este proyecto, al contrario que la mayoría de investigaciones en este campo, intenta ir m as allá en la detección de emociones y pretende identificar los sentimientos del texto en vez de centrarse en su polaridad. El conjunto de sentimientos usados para este proyecto esrá basado en la Rueda de las Emociones de Plutchik, que define ocho sentimientos básicos y ocho complejos formados por dos básicos. Además, en este proyecto se ha creado un nuevo modelo de AS combinando léxicos para extraer las emociones de las palabras con otro estadístico que trata de identificar los temas más importantes del texto. De esta forma, se ha intentado superar las desventajas de los modelos Bag-of-words que no diferencian entre clases de palabras y ponderan todas las palabras usando el esquema tf-idf, que conlleva sobreponderar las palabras más usadas. Asimismo, para mejorar el conocimiento del proyecto, se ha implementado un método de aprendizaje heurístico que permite mejorar el conocimiento inicial para converger con la sensibilidad real de los humanos. Para evaluar el rendimiento del modelo propuesto, una aplicación Android para móviles ha sido desarrollada. Esta app permite a los usuarios tomar fotos e introducir descripciones que son procesadas y clasificadas por emociones. Clasificación que puede ser corregida por el usuario permitiendo así extraer estadísticas del rendimiento del sistema.Ingeniería Informátic

    Structurally informed methods for improved sentiment analysis

    Get PDF
    Sentiment analysis deals with methods to automatically analyze opinions in natural language texts, e.g., product reviews. Such reviews contain a large number of fine-grained opinions, but to automatically extract detailed information it is necessary to handle a wide variety of verbalizations of opinions. The goal of this thesis is to develop robust structurally informed models for sentiment analysis which address challenges that arise from structurally complex verbalizations of opinions. In this thesis, we look at two examples for such verbalizations that benefit from including structural information into the analysis: negation and comparisons. Negation directly influences the polarity of sentiment expressions, e.g., while "good" is positive, "not good" expresses a negative opinion. We propose a machine learning approach that uses information from dependency parse trees to determine whether a sentiment word is in the scope of a negation expression. Comparisons like "X is better than Y" are the main topic of this thesis. We present a machine learning system for the task of detecting the individual components of comparisons: the anchor or predicate of the comparison, the entities that are compared, which aspect they are compared in, and which entity is preferred. Again, we use structural context from a dependency parse tree to improve the performance of our system. We discuss two ways of addressing the issue of limited availability of training data for our system. First, we create a manually annotated corpus of comparisons in product reviews, the largest such resource available to date. Second, we use the semi-supervised method of structural alignment to expand a small seed set of labeled sentences with similar sentences from a large set of unlabeled sentences. Finally, we work on the task of producing a ranked list of products that complements the isolated prediction of ratings and supports the user in a process of decision making. We demonstrate how we can use the information from comparisons to rank products and evaluate the result against two conceptually different external gold standard rankings.Sentimentanalyse befasst sich mit Methoden zur automatischen Analyse von Meinungen in Texten wie z.B. Produktbewertungen. Solche bewertenden Texte enthalten detaillierte Meinungsäußerungen. Um diese automatisch analysieren zu können müssen wir mit strukturell komplexen Äußerungen umgehen können. In dieser Arbeit präsentieren wir einen Ansatz für die robuste Analyse von komplexen Meinungsäußerungen mit Hilfe von Informationen aus der Satzstruktur. Wir betrachten zwei Beispiele für komplexe Meinungsäußerungen: Negationen und Vergleiche. Eine Negation hat direkten Einfluss auf die Polarität einer Meinungsäußerung in einem Satz. Während "gut" eine positive Meinung ausdrückt, ist "nicht gut" negativ. Wir präsentieren ein System, das auf maschinellem Lernen beruht und Informationen aus dem Satzstrukturbaum verwendet um für ein gegebenes Schlüsselwort festzustellen, ob im Kontext eine Negation vorkommt die die Polarität beeinflusst. Als zweites Beispiel für komplexe Meinungsäußerungen betrachten wir Vergleiche von Produkten, z.B. "X ist besser als Y". Wir präsentieren ein lernendes System, das die einzelnen Komponenten von Vergleichen identifiziert: Das Prädikat bzw. das Wort, das den Vergleich einführt, die beiden Entitäten, die verglichen werden, der Aspekt in dem sie verglichen werden, und welche Entität als besser bewertet wird. Auch hier verwenden wir Satzstrukturinformationen um die Erkennung zu verbessern. Ein Problem für die Anwendung von maschinellen Lernverfahren ist die eingeschränkte Verfügbarkeit von Trainingsdaten. Wir gehen dieses Problem auf zwei Arten an. Zum einen durch die Annotation eines eigenen Datensatzes von Vergleichen in Kamerabewertungen. Zum anderen indem wir eine halbüberwachte Methode einsetzen um eine kleine Menge von manuell annotierten Sätzen durch ähnliche Sätze aus einer großen Menge unannotierter Sätze zu ergänzen. Abschließend bearbeiten wir die Aufgabe, den Auswahlprozess eines Kunden zu unterstützen indem wir eine Rangfolge von Produkten erstellen. Wir demonstrieren, wie wir Vergleiche zu diesem Zweck nutzen können und evaluieren unser System gegen zwei konzeptionell unterschiedliche Rangfolgen aus externen Quellen

    Automatic summarising: factors and directions

    Full text link
    This position paper suggests that progress with automatic summarising demands a better research methodology and a carefully focussed research strategy. In order to develop effective procedures it is necessary to identify and respond to the context factors, i.e. input, purpose, and output factors, that bear on summarising and its evaluation. The paper analyses and illustrates these factors and their implications for evaluation. It then argues that this analysis, together with the state of the art and the intrinsic difficulty of summarising, imply a nearer-term strategy concentrating on shallow, but not surface, text analysis and on indicative summarising. This is illustrated with current work, from which a potentially productive research programme can be developed

    Extracting Scales of Measurement Automatically from Biomedical Text with Special Emphasis on Comparative and Superlative Scales

    Get PDF
    Abstract In this thesis, the focus is on the topic of “Extracting Scales of Measurement Automatically from Biomedical Text with Special Emphasis on Comparative and Superlative Scales.” Comparison sentences, when considered as a critical part of scales of measurement, play a highly significant role in the process of gathering information from a large number of biomedical research papers. A comparison sentence is defined as any sentence that contains two or more entities that are being compared. This thesis discusses several different types of comparison sentences such as gradable comparisons and non-gradable comparisons. The main goal is extracting comparison sentences automatically from the full text of biomedical articles. Therefore, the thesis presents a Java program that could be used to analyze biomedical text to identify comparison sentences by matching the sentences in the text to 37 syntactic and semantic features. These features or qualities would be helpful to extract comparative sentences from any biomedical text. Two machine learning techniques are used with the 37 roles to assess the curated dataset. The results of this study are compared with earlier studies

    A Computational Framework for Formalizing Rules and Managing Changes in Normative Systems

    Get PDF
    Legal texts are typically written in a natural language. However, a legal text that is written in a formal language has the advantage of being subject to automation, at least partially. Such a translation is not easy, and the matter is even more complex because the law changes with time, so if we formalized a legal text that was originally written in natural language, there is a need to keep track of the change. This thesis proposes original developments on these subjects. In order to formalize a legal document, we provide a pipeline for the translation of a legal text from natural to formal language and we apply it to the case of natural resources contracts. In general, adjectives play an important role in a text and they allow to characterize it: for this reason we developed a logical system aimed at reasoning with gradable adjectives. Regarding norm change, we provide an ontology to represent change in a normative system, some basic mechanisms by which an agent may acquire new norms, and a study on the problem of revising a defeasible theory by only changing its facts. Another contribution of this thesis is a general framework for revision that includes the previous points as specific cases

    Large Scale Application of Neural Network Based Semantic Role Labeling for Automated Relation Extraction from Biomedical Texts

    Get PDF
    To reduce the increasing amount of time spent on literature search in the life sciences, several methods for automated knowledge extraction have been developed. Co-occurrence based approaches can deal with large text corpora like MEDLINE in an acceptable time but are not able to extract any specific type of semantic relation. Semantic relation extraction methods based on syntax trees, on the other hand, are computationally expensive and the interpretation of the generated trees is difficult. Several natural language processing (NLP) approaches for the biomedical domain exist focusing specifically on the detection of a limited set of relation types. For systems biology, generic approaches for the detection of a multitude of relation types which in addition are able to process large text corpora are needed but the number of systems meeting both requirements is very limited. We introduce the use of SENNA (“Semantic Extraction using a Neural Network Architecture”), a fast and accurate neural network based Semantic Role Labeling (SRL) program, for the large scale extraction of semantic relations from the biomedical literature. A comparison of processing times of SENNA and other SRL systems or syntactical parsers used in the biomedical domain revealed that SENNA is the fastest Proposition Bank (PropBank) conforming SRL program currently available. 89 million biomedical sentences were tagged with SENNA on a 100 node cluster within three days. The accuracy of the presented relation extraction approach was evaluated on two test sets of annotated sentences resulting in precision/recall values of 0.71/0.43. We show that the accuracy as well as processing speed of the proposed semantic relation extraction approach is sufficient for its large scale application on biomedical text. The proposed approach is highly generalizable regarding the supported relation types and appears to be especially suited for general-purpose, broad-scale text mining systems. The presented approach bridges the gap between fast, cooccurrence-based approaches lacking semantic relations and highly specialized and computationally demanding NLP approaches

    Unrestricted Bridging Resolution

    Get PDF
    Anaphora plays a major role in discourse comprehension and accounts for the coherence of a text. In contrast to identity anaphora which indicates that a noun phrase refers back to the same entity introduced by previous descriptions in the discourse, bridging anaphora or associative anaphora links anaphors and antecedents via lexico-semantic, frame or encyclopedic relations. In recent years, various computational approaches have been developed for bridging resolution. However, most of them only consider antecedent selection, assuming that bridging anaphora recognition has been performed. Moreover, they often focus on subproblems, e.g., only part-of bridging or definite noun phrase anaphora. This thesis addresses the problem of unrestricted bridging resolution, i.e., recognizing bridging anaphora and finding links to antecedents where bridging anaphors are not limited to definite noun phrases and semantic relations between anaphors and their antecedents are not restricted to meronymic relations. In this thesis, we solve the problem using a two-stage statistical model. Given all mentions in a document, the first stage predicts bridging anaphors by exploring a cascading collective classification model. We cast bridging anaphora recognition as a subtask of learning fine-grained information status (IS). Each mention in a text gets assigned one IS class, bridging being one possible class. The model combines the binary classifiers for minority categories and a collective classifier for all categories in a cascaded way. It addresses the multi-class imbalance problem (e.g., the wide variation of bridging anaphora and their relative rarity compared to many other IS classes) within a multi-class setting while still keeping the strength of the collective classifier by investigating relational autocorrelation among several IS classes. The second stage finds the antecedents for all predicted bridging anaphors at the same time by exploring a joint inference model. The approach models two mutually supportive tasks (i.e., bridging anaphora resolution and sibling anaphors clustering) jointly, on the basis of the observation that semantically/syntactically related anaphors are likely to be sibling anaphors, and hence share the same antecedent. Both components are based on rich linguistically-motivated features and discriminatively trained on a corpus (ISNotes) where bridging is reliably annotated. Our approaches achieve substantial improvements over the reimplementations of previous systems for all three tasks, i.e., bridging anaphora recognition, bridging anaphora resolution and full bridging resolution. The work is – to our knowledge – the first bridging resolution system that handles the unrestricted phenomenon in a realistic setting. The methods in this dissertation were originally presented in Markert et al. (2012) and Hou et al. (2013a; 2013b; 2014). The thesis gives a detailed exposition, carrying out a thorough corpus analysis of bridging and conducting a detailed comparison of our models to others in the literature, and also presents several extensions of the aforementioned papers
    • …
    corecore