143 research outputs found

    Filtering Microarray Correlations by Statistical Literature Analysis Yields Potential Hypotheses for Lactation Research

    Get PDF
    BackgroundRecent studies have demonstrated that the cyclical nature of mouse lactation can be mirrored at the transcriptome level of the mammary glands but making sense of microarray results requires analysis of large amounts of biological information which is increasingly difficult to access as the amount of literature increases. Extraction of protein-protein interaction from text by statistical and natural language processing has shown to be useful in managing the literature. Correlations between gene expression across a series of samples is a simple method to analyze microarray data as it was found that genes that are related in functions exhibit similar expression profiles. Microarrays had been used to examine the transcriptome of mouse lactation and found that the cyclic nature of the lactation cycle as observed histologically is reflected at the transcription level. However, there has been no study to date using text mining to sieve microarray analysis to generate new hypotheses for further research in the field of lactational biology. ResultsOur results demonstrated that a previously reported protein name co-occurrence method (5-mention PubGene) which was not based on a hypothesis testing framework, it is generally statistically more significant than the 99th percentile of Poisson distribution-based method of calculating co-occurrence. It agrees with previous methods using natural language processing to extract protein-protein interaction from text as more than 96% of the interactions found by natural language processing methods to overlap with the results from 5-mention PubGene method. However, less than 2% of the gene co-expressions analyzed by microarray were found from direct co-occurrence or interaction information extraction from the literature. At the same time, combining microarray and literature analyses, we derive a novel set of 7 potential functional protein-protein interactions that had not been previously described in the literature.ConclusionsWe conclude that the 5-mention PubGene method is more stringent than the 99th percentile of Poisson distribution method for extracting protein-protein interactions by co-occurrence of entity names and literature analysis may be a potential filter for microarray analysis to isolate potentially novel hypotheses for further research

    Text mining for metabolic reaction extraction from scientific literature

    Get PDF
    Science relies on data in all its different forms. In molecular biology and bioinformatics in particular large scale data generation has taken centre stage in the form of high-throughput experiments. In line with this exponential increase of experimental data has been the near exponential growth of scientific publications. Yet where classical data mining techniques are still capable of coping with this deluge in structured data (Chapter 2), access of information found in scientific literature is still limited to search engines allowing searches on the level keywords, titles and abstracts. However, large amounts of knowledge about biological entities and their relations are held within the body of articles. When extracted, this data can be used as evidence for existing knowledge or hypothesis generation making scientific literature a valuable scientific resource. To unlock the information inside the articles requires a dedicated set of techniques and approaches tailored to the unstructured nature of free text. Analogous to the field of data mining for the analysis of structured data, the field of text mining has emerged for unstructured text and a number of applications has been developed in that field. This thesis is about text mining in the field of metabolomics. The work focusses on strategies for accessing large collections of scientific text and on the text mining steps required to extract metabolic reactions and their constituents, enzymes and metabolites, from scientific text. Metabolic reactions are important for our understanding of metabolic processes within cells and that information provides an important link between genotype phenotype. Furthermore information about metabolic reactions stored in databases is far from complete making it an excellent target for our text mining application. In order to access the scientific publications for further analysis they can be used as flat text or loaded into database systems. In Chapter 2we assessed and discussed the capabilities and performance of XML-type database systems to store and access very large collections of XML-type documents in the form of the Medline corpus, a collection of more than 20 million of scientific abstracts. XML data formats are common in the field of bioinformatics and are also at the core of most web services. With the increasing amount of data stored in XML comes the need for storing and accessing the data. The database systems were evaluated on a number of aspects broadly ranging from technical requirements to ease-of-use and performance. The performance of the different XML-type database systems was measured Medline abstract collections of increasing size and with a number of different queries. One of the queries assessed the capabilities of each database system to search the full-text of each abstract, which would allow access to the information within the text without further text analysis. The results show that all database systems cope well with the small and medium dataset, but that the full dataset remains a challenge. Also the query possibilities varied greatly across all studied databases. This led us to conclude that the performances and possibilities of the different database types vary greatly, also depending on the type of research question. There is no single system that outperforms the others; instead different circumstances can lead to a different optimal solution. Some of these scenarios are presented in the chapter. Among the conclusions of Chapter 2is that conventional data mining techniques do not work for the natural language part of a publication beyond simple retrieval queries based on pattern matching. The natural language used in written text is too unstructured for that purpose and requires dedicated text mining approaches, the main research topic of this thesis. Two major tasks of text mining are named entity recognition, the identification of relevant entities in the text, and relation extraction, the identification of relations between those named entities. For both text mining tasks many different techniques and approaches have been developed. For the named entity recognition of enzymes and metabolites we used a dictionary-based approach (Chapter 3) and for metabolic reaction extraction a full grammar approach (Chapter 4). In Chapter 3we describe the creation of two thesauri, one for enzymes and one for metabolites with the specific goal of allowing named entity identification, the mapping of identified synonyms to a common identifier, for metabolic reaction extraction. In the case of the enzyme thesaurus these identifiers are Enzyme Nomenclature numbers (EC number), in the case of the metabolite thesaurus KEGG metabolite identifiers. These thesauri are applied to the identification of enzymes and metabolites in the text mining approach of Chapter 4. Both were created from existing data sources by a series of automated steps followed by manual curation. Compared to a previously published chemical thesaurus, created entirely with automated steps, our much smaller metabolite thesaurus performed on the same level for F-measure with a slightly higher precision. The enzyme thesaurus produced results equal to our metabolite thesaurus. The compactness of our thesauri permits the manual curation step important in guaranteeing accuracy of the thesaurus contents, whereas creation from existing resources by automated means limits the effort required for creation. We concluded that our thesauri are compact and of high quality, and that this compactness does not greatly impact recall. In Chapter 4we studied the applicability and performance of a full parsing approach using the two thesauri described in Chapter 3 for the extraction of metabolic reactions from scientific full-text articles. For this we developed a text mining pipeline built around a modified dependency parser from the AGFL grammar lab using a pattern-based approach to extract metabolic reactions from the parsing output. Results of a comparison to a modified rule-based approach by Czarnecki et al.using three previously described metabolic pathways from the EcoCyc database show a slightly lower recall compared to the rule-based approach, but higher precision. We concluded that despite its current recall our full parsing approach to metabolic reaction extraction has high precision and potential to be used to (re-)construct metabolic pathways in an automated setting. Future improvements to the grammar and relation extraction rules should allow reactions to be extracted with even higher specificity. To identify potential improvements to the recall, the effect of a number of text pre-processing steps on the performance was tested in a number of experiments. The one experiment that had the most effect on performance was the conversion of schematic chemical formulas to syntactic complete sentences allowing them to be analysed by the parser. In addition to the improvements to the text mining approach described in Chapter 4I make suggestions in Chapter 5 for potential improvements and extensions to our full parsing approach for metabolic reaction extraction. Core focus here is the increase of recall by optimising each of the steps required for the final goal of extracting metabolic reactions from the text. Some of the discussed improvements are to increase the coverage of the used thesauri, possibly with specialist thesauri depending on the analysed literature. Another potential target is the grammar, where there is still room to increase parsing success by taking into account the characteristics of biomedical language. On a different level are suggestions to include some form of anaphora resolution and across sentence boundary search to increase the amount of information extracted from literature. In the second part of Chapter 5I make suggestions as to how to maximise the information gained from the text mining results. One of the first steps should be integration with other biomedical databases to allow integration with existing knowledge about metabolic reactions and other biological entities. Another aspect is some form of ranking or weighting of the results to be able to distinguish between high quality results useful for automated analyses and lower quality results still useful for manual approaches. Furthermore I provide a perspective on the necessity of computational literature analysis in the form of text mining. The main reasoning here is that human annotators cannot keep up with the amount of publications so that some form of automated analysis is unavoidable. Lastly I discuss the role of text mining in bioinformatics and with that also the accessibility of both text mining results and the literature resources necessary to create them. An important requirement for the future of text mining is that the barriers around high-throughput access to literature for text mining applications have to be removed. With regards to accessing text mining results, there is a long way to go for many applications, including ours, before they can be used directly by biologists. A major factor is that these applications rarely feature a suitable user interface and easy to use setup. To conclude, I see the main role of a text mining system like ours mainly in gathering evidence for existing knowledge and giving insights into the nuances of the research landscape of a given topic. When using the results of our reaction extraction system for the identification of ‘new’ reactions it is important to go back to the actual evidence presented for extra validations and to cross-validate the predictions with other resources or experiments. Ideally text mining will be used for generation of hypotheses, in which the researcher uses text mining findings to get ideas on, in our case, new connections between metabolites and enzymes; subsequently the researcher needs to go back to the original texts for further study. In this role text mining is an essential tool on the workbench of the molecular biologist.</p

    Procedurally Rhetorical Verb-Centric Frame Semantics as a Knowledge Representation for Argumentation Analysis of Biochemistry Articles

    Get PDF
    The central focus of this thesis is rhetorical moves in biochemistry articles. Kanoksilapatham has provided a descriptive theory of rhetorical moves that extends Swales' CARS model to the complete biochemistry article. The thesis begins the construction of a computational model of this descriptive theory. Attention is placed on the Methods section of the articles. We hypothesize that because authors' argumentation closely follows their experimental procedure, procedural verbs may be the guide to understanding the rhetorical moves. Our work proposes an extension to the normal (i.e., VerbNet) semantic roles especially tuned to this domain. A major contribution is a corpus of Method sections that have been marked up for rhetorical moves and semantic roles. The writing style of this genre tends to occasionally omit semantic roles, so another important contribution is a prototype ontology that provides experimental procedure knowledge for the biochemistry domain. Our computational model employs machine learning to build its models for the semantic roles and rhetorical moves, validated against a gold standard reflecting the annotation of these texts by human experts. We provide significant insights into how to derive these annotations, and as such have contributions as well to the general challenge of producing markups in the domain of biomedical science documents, where specialized knowledge is required

    Text Mining for Pathway Curation

    Get PDF
    Biolog:innen untersuchen häufig Pathways, Netzwerke von Interaktionen zwischen Proteinen und Genen mit einer spezifischen Funktion. Neue Erkenntnisse über Pathways werden in der Regel zunächst in Publikationen veröffentlicht und dann in strukturierter Form in Lehrbüchern, Datenbanken oder mathematischen Modellen weitergegeben. Deren Kuratierung kann jedoch aufgrund der hohen Anzahl von Publikationen sehr aufwendig sein. In dieser Arbeit untersuchen wir wie Text Mining Methoden die Kuratierung unterstützen können. Wir stellen PEDL vor, ein Machine-Learning-Modell zur Extraktion von Protein-Protein-Assoziationen (PPAs) aus biomedizinischen Texten. PEDL verwendet Distant Supervision und vortrainierte Sprachmodelle, um eine höhere Genauigkeit als vergleichbare Methoden zu erreichen. Eine Evaluation durch Expert:innen bestätigt die Nützlichkeit von PEDLs für Pathway-Kurator:innen. Außerdem stellen wir PEDL+ vor, ein Kommandozeilen-Tool, mit dem auch Nicht-Expert:innen PPAs effizient extrahieren können. Drei Kurator:innen bewerten 55,6 % bis 79,6 % der von PEDL+ gefundenen PPAs als nützlich für ihre Arbeit. Die große Anzahl von PPAs, die durch Text Mining identifiziert werden, kann für Forscher:innen überwältigend sein. Um hier Abhilfe zu schaffen, stellen wir PathComplete vor, ein Modell, das nützliche Erweiterungen eines Pathways vorschlägt. Es ist die erste Pathway-Extension-Methode, die auf überwachtem maschinellen Lernen basiert. Unsere Experimente zeigen, dass PathComplete wesentlich genauer ist als existierende Methoden. Schließlich schlagen wir eine Methode vor, um Pathways mit komplexen Ereignisstrukturen zu erweitern. Hier übertrifft unsere neue Methode zur konditionalen Graphenmodifikation die derzeit beste Methode um 13-24% Genauigkeit in drei Benchmarks. Insgesamt zeigen unsere Ergebnisse, dass Deep Learning basierte Informationsextraktion eine vielversprechende Grundlage für die Unterstützung von Pathway-Kurator:innen ist.Biological knowledge often involves understanding the interactions between molecules, such as proteins and genes, that form functional networks called pathways. New knowledge about pathways is typically communicated through publications and later condensed into structured formats such as textbooks, pathway databases or mathematical models. However, curating updated pathway models can be labour-intensive due to the growing volume of publications. This thesis investigates text mining methods to support pathway curation. We present PEDL (Protein-Protein-Association Extraction with Deep Language Models), a machine learning model designed to extract protein-protein associations (PPAs) from biomedical text. PEDL uses distant supervision and pre-trained language models to achieve higher accuracy than the state of the art. An expert evaluation confirms its usefulness for pathway curators. We also present PEDL+, a command-line tool that allows non-expert users to efficiently extract PPAs. When applied to pathway curation tasks, 55.6% to 79.6% of PEDL+ extractions were found useful by curators. The large number of PPAs identified by text mining can be overwhelming for researchers. To help, we present PathComplete, a model that suggests potential extensions to a pathway. It is the first method based on supervised machine learning for this task, using transfer learning from pathway databases. Our evaluations show that PathComplete significantly outperforms existing methods. Finally, we generalise pathway extension from PPAs to more realistic complex events. Here, our novel method for conditional graph modification outperforms the current best by 13-24% accuracy on three benchmarks. We also present a new dataset for event-based pathway extension. Overall, our results show that deep learning-based information extraction is a promising basis for supporting pathway curators

    Research

    Get PDF

    Program and Proceedings: The Nebraska Academy of Sciences 1880-2009

    Get PDF
    PROGRAM FRIDAY, APRIL 17, 2009 REGISTRATION FOR ACADEMY, Lobby of Lecture wing, Olin Hall Aeronautics and Space Science, Olin 249 Collegiate Academy, Biology Session A, Olin B Earth Science, Olin 224 Collegiate Academy, Chemistry and Physics, Session A, Olin 324 Biological and Medical Sciences, Session A, Olin 112 Biological and Medical Sciences, Session B, Smith Callen Conference Center Junior Academy, Senior High REGISTRATION, Olin Hall Lobby NWU Health and Sciences Graduate School Fair, Olin and Smith Curtiss Halls Junior Academy, Senior High Competition, Olin 124, Olin 131 Aeronautics and Space Science, Poster Session, Olin 249 History and Philosophy of Science, Olin 325, combined section Teaching of Science and Math, Olin 325, combined section MAIBEN MEMORIAL LECTURE, OLIN B Dr. Donald Frey, Chair, Department of Family Practice, Creighton University Medical Center LUNCH, PATIO ROOM, STORY STUDENT CENTER (pay and carry tray through cafeteria line, or pay at NAS registration desk) Policy and Program Committee Luncheon, Roundup Room Emeriti Luncheon, Presidents Room Aeronautics Group, Conestoga Room Anthropology, Olin 111 Biological and Medical Sciences, Session C, Olin 112 Biological and Medical Sciences, Session D, Smith Callen Conference Center Chemistry and Physics, Section A, Chemistry, Olin A Chemistry and Physics, Section B, Physics, Planetarium Collegiate Academy, Biology Session A, Olin B Collegiate Academy, Biology Session B, Olin 249 Collegiate Academy, Chemistry and Physics, Session A, Olin 324 Junior Academy, Junior High REGISTRATION, Olin Hall Lobby Junior Academy, Senior High Competition, (Final), Olin 110 Junior Academy, Junior High Competition, Olin 124, Olin 131 NJAS Board/Teacher Meeting, Olin 219 Junior Academy, General Awards Presentations, Smith Callen Conference Center BUSINESS MEETING, OLIN B SOCIAL HOUR for Members, Spouses, and Guests First United Methodist Church, 2723 N 50th Street, Lincoln, NE ANNUAL BANQUET and Presentation of Awards and Scholarships First United Methodist Church, 2723 N 50th Street, Lincoln, N
    corecore