249 research outputs found

    Text Mining and Gene Expression Analysis Towards Combined Interpretation of High Throughput Data

    Get PDF
    Microarrays can capture gene expression activity for thousands of genes simultaneously and thus make it possible to analyze cell physiology and disease processes on molecular level. The interpretation of microarray gene expression experiments profits from knowledge on the analyzed genes and proteins and the biochemical networks in which they play a role. The trend is towards the development of data analysis methods that integrate diverse data types. Currently, the most comprehensive biomedical knowledge source is a large repository of free text articles. Text mining makes it possible to automatically extract and use information from texts. This thesis addresses two key aspects, biomedical text mining and gene expression data analysis, with the focus on providing high-quality methods and data that contribute to the development of integrated analysis approaches. The work is structured in three parts. Each part begins by providing the relevant background, and each chapter describes the developed methods as well as applications and results. Part I deals with biomedical text mining: Chapter 2 summarizes the relevant background of text mining; it describes text mining fundamentals, important text mining tasks, applications and particularities of text mining in the biomedical domain, and evaluation issues. In Chapter 3, a method for generating high-quality gene and protein name dictionaries is described. The analysis of the generated dictionaries revealed important properties of individual nomenclatures and the used databases (Fundel and Zimmer, 2006). The dictionaries are publicly available via a Wiki, a web service, and several client applications (Szugat et al., 2005). In Chapter 4, methods for the dictionary-based recognition of gene and protein names in texts and their mapping onto unique database identifiers are described. These methods make it possible to extract information from texts and to integrate text-derived information with data from other sources. Three named entity identification systems have been set up, two of them building upon the previously existing tool ProMiner (Hanisch et al., 2003). All of them have shown very good performance in the BioCreAtIvE challenges (Fundel et al., 2005a; Hanisch et al., 2005; Fundel and Zimmer, 2007). In Chapter 5, a new method for relation extraction (Fundel et al., 2007) is presented. It was applied on the largest collection of biomedical literature abstracts, and thus a comprehensive network of human gene and protein relations has been generated. A classification approach (Küffner et al., 2006) can be used to specify relation types further; e. g., as activating, direct physical, or gene regulatory relation. Part II deals with gene expression data analysis: Gene expression data needs to be processed so that differentially expressed genes can be identified. Gene expression data processing consists of several sequential steps. Two important steps are normalization, which aims at removing systematic variances between measurements, and quantification of differential expression by p-value and fold change determination. Numerous methods exist for these tasks. Chapter 6 describes the relevant background of gene expression data analysis; it presents the biological and technical principles of microarrays and gives an overview of the most relevant data processing steps. Finally, it provides a short introduction to osteoarthritis, which is in the focus of the analyzed gene expression data sets. In Chapter 7, quality criteria for the selection of normalization methods are described, and a method for the identification of differentially expressed genes is proposed, which is appropriate for data with large intensity variances between spots representing the same gene (Fundel et al., 2005b). Furthermore, a system is described that selects an appropriate combination of feature selection method and classifier, and thus identifies genes which lead to good classification results and show consistent behavior in different sample subgroups (Davis et al., 2006). The analysis of several gene expression data sets dealing with osteoarthritis is described in Chapter 8. This chapter contains the biomedical analysis of relevant disease processes and distinct disease stages (Aigner et al., 2006a), and a comparison of various microarray platforms and osteoarthritis models. Part III deals with integrated approaches and thus provides the connection between parts I and II: Chapter 9 gives an overview of different types of integrated data analysis approaches, with a focus on approaches that integrate gene expression data with manually compiled data, large-scale networks, or text mining. In Chapter 10, a method for the identification of genes which are consistently regulated and have a coherent literature background (Küffner et al., 2005) is described. This method indicates how gene and protein name identification and gene expression data can be integrated to return clusters which contain genes that are relevant for the respective experiment together with literature information that supports interpretation. Finally, in Chapter 11 ideas on how the described methods can contribute to current research and possible future directions are presented

    Monthly hydrometeorological ensemble prediction of streamflow droughts and corresponding drought indices

    Get PDF
    Streamflow droughts, characterized by low runoff as consequence of a drought event, affect numerous aspects of life. Economic sectors that are impacted by low streamflow are, e.g., power production, agriculture, tourism, water quality management and shipping. Those sectors could potentially benefit from forecasts of streamflow drought events, even of short events on the monthly time scales or below. Numerical hydrometeorological models have increasingly been used to forecast low streamflow and have become the focus of recent research. Here, we consider daily ensemble runoff forecasts for the river Thur, which has its source in the Swiss Alps. We focus on the evaluation of low streamflow and of the derived indices as duration, severity and magnitude, characterizing streamflow droughts up to a lead time of one month. <br><br> The ECMWF VarEPS 5-member ensemble reforecast, which covers 18 yr, is used as forcing for the hydrological model PREVAH. A thorough verification reveals that, compared to probabilistic peak-flow forecasts, which show skill up to a lead time of two weeks, forecasts of streamflow droughts are skilful over the entire forecast range of one month. For forecasts at the lower end of the runoff regime, the quality of the initial state seems to be crucial to achieve a good forecast quality in the longer range. It is shown that the states used in this study to initialize forecasts satisfy this requirement. The produced forecasts of streamflow drought indices, derived from the ensemble forecasts, could be beneficially included in a decision-making process. This is valid for probabilistic forecasts of streamflow drought events falling below a daily varying threshold, based on a quantile derived from a runoff climatology. Although the forecasts have a tendency to overpredict streamflow droughts, it is shown that the relative economic value of the ensemble forecasts reaches up to 60%, in case a forecast user is able to take preventive action based on the forecast

    Automatic Bat Call Classification using Transformer Networks

    Full text link
    Automatically identifying bat species from their echolocation calls is a difficult but important task for monitoring bats and the ecosystem they live in. Major challenges in automatic bat call identification are high call variability, similarities between species, interfering calls and lack of annotated data. Many currently available models suffer from relatively poor performance on real-life data due to being trained on single call datasets and, moreover, are often too slow for real-time classification. Here, we propose a Transformer architecture for multi-label classification with potential applications in real-time classification scenarios. We train our model on synthetically generated multi-species recordings by merging multiple bats calls into a single recording with multiple simultaneous calls. Our approach achieves a single species accuracy of 88.92% (F1-score of 84.23%) and a multi species macro F1-score of 74.40% on our test set. In comparison to three other tools on the independent and publicly available dataset ChiroVox, our model achieves at least 25.82% better accuracy for single species classification and at least 6.9% better macro F1-score for multi species classification.Comment: Volume 78, December 2023, 10228

    Entwicklung und Evaluation eines therapeutischen Trainingsphantoms für die flexible Endoskopie mit Fokus auf die interventionelle Blutstillung

    Get PDF
    Einleitung: GI-Blutungen sind akute Notfallereignisse und stellen die häufigsten Indikationen für Notfall-endoskopische Eingriffe dar. Die flexible Endoskopie steht bei der Diagnostik und Therapie von GI-Blutungen an erster Stelle. Man unterscheidet verschiedene Methoden zur endoskopischen Blutstillung: mechanische Verfahren, Injektion und thermische Verfahren. Erfahrung im Umgang mit dem flexiblen Endoskop und die Kenntnis der vorhandenen Blutstillungstechniken verbessern das Therapieergebnis deutlich. Ziel ist es, durch eine frühe Einführung von Trainingseinheiten am Endoskop, die Patientenversorgung zu verbessern. Methodik: Es erfolgte die Entwicklung eines tiermaterialfreien Blutungsphantoms mit artifizieller Mukosa und Submukosa (Patch-Modell und Magen-Modell), an dem folgende Blutstillungstechniken trainiert werden können: Clipapplikation, Injektion und APC. Das entwickelte Patch-Modell wurde in ein Training, bestehend aus einem theoretischen und einem praktischen Teil, integriert. Das gesamte Training wurde von elf Studenten der Humanmedizin getestet und über einen Fragebogen bewertet. Die Evaluation des Lerneffektes der Studenten erfolgte über zwei Bewertungsbögen, direkt nach dem Training und drei Monate nach dem Training. Ergebnisse: Die Entwicklung eines tiermaterialfreien Blutungsmodells ist gelungen. Es resultieren zwei Simulatorarten: ein Magen-Modell und ein Patch-Modell. An diesen Modellen können oben genannte Blutstillungstechniken durchgeführt werden (siehe Methodik). Durch die Injektion entsteht eine gut abgrenzbare Quaddel. Verschiedene Clipvarianten haften gut auf dem künstlichen Gewebe. Nach APC-Applikation entsteht eine oberflächliche Verschorfung. Elf Medizinstudenten im klinischen Abschnitt absolvierten ein Training mit dem Patch-Modell. Sie bewerteten Modell (Note 1,5) und Training positiv. Diskussion: Der Bedarf an realitätsgetreuen Phantomen zum Training in Blutungssituationen ist hoch. Die Nachfrage nach tiermaterialfreien Lösungen für entsprechende Trainingssimulationen wächst. Die Einbindung eines Trainings zur Blutstillung in der Aus- und Weiterbildung von Ärzten erscheint sinnvoll, da bereits Studenten nach einmaligem Training subjektiv davon profitieren. Die Verwendung des entwickelten Blutungsphantoms für das Training der endoskopischen Blutstillung konnte in Kleingruppen bereits erfolgreich getestet werden

    Homogenisation of a gridded snow water equivalent climatology for Alpine terrain: methodology and applications

    Get PDF
    Gridded snow water equivalent (SWE) data sets are valuable for estimating the snow water resources and verify different model systems, e.g. hydrological, land surface or atmospheric models. However, changing data availability represents a considerable challenge when trying to derive consistent time series for SWE products. In an attempt to improve the product consistency, we first evaluated the differences between two climatologies of SWE grids that were calculated on the basis of data from 110 and 203 stations, respectively. The "shorter" climatology (2001–2009) was produced using 203 stations (map203) and the "longer" one (1971–2009) 110 stations (map110). Relative to map203, map110 underestimated SWE, especially at higher elevations and at the end of the winter season. We tested the potential of quantile mapping to compensate for mapping errors in map110 relative to map203. During a 9 yr calibration period from 2001 to 2009, for which both map203 and map110 were available, the method could successfully refine the spatial and temporal SWE representation in map110 by making seasonal, regional and altitude-related distinctions. Expanding the calibration to the full 39 yr showed that the general underestimation of map110 with respect to map203 could be removed for the whole winter. The calibrated SWE maps fitted the reference (map203) well when averaged over regions and time periods, where the mean error is approximately zero. However, deviations between the calibrated maps and map203 were observed at single grid cells and years. When we looked at three different regions in more detail, we found that the calibration had the largest effect in the region with the highest proportion of catchment areas above 2000 m a.s.l. and that the general underestimation of map110 compared to map203 could be removed for the entire snow season. The added value of the calibrated SWE climatology is illustrated with practical examples: the verification of a hydrological model, the estimation of snow resource anomalies and the predictability of runoff through SWE

    Normalization and Gene p-Value Estimation: Issues in Microarray Data Processing

    Get PDF
    Introduction: Numerous methods exist for basic processing, e.g. normalization, of microarray gene expression data. These methods have an important effect on the final analysis outcome. Therefore, it is crucial to select methods appropriate for a given dataset in order to assure the validity and reliability of expression data analysis. Furthermore, biological interpretation requires expression values for genes, which are often represented by several spots or probe sets on a microarray. How to best integrate spot/probe set values into gene values has so far been a somewhat neglecte

    ProMiner: rule-based protein and gene entity recognition

    Get PDF
    doi:10.1186/1471-2105-6-S1-S14 <supplement> <title> <p>A critical assessment of text mining methods in molecular biology</p> </title> <editor>Christian Blaschke, Lynette Hirschman, Alfonso Valencia, Alexander Yeh</editor> <note>Report</note> </supplement> Background: Identification of gene and protein names in biomedical text is a challenging task as the corresponding nomenclature has evolved over time. This has led to multiple synonyms for individual genes and proteins, as well as names that may be ambiguous with other gene names or with general English words. The Gene List Task of the BioCreAtIvE challenge evaluation enables comparison of systems addressing the problem of protein and gene name identification on common benchmark data. Methods: The ProMiner system uses a pre-processed synonym dictionary to identify potential name occurrences in the biomedical text and associate protein and gene database identifiers with the detected matches. It follows a rule-based approach and its search algorithm is geared towards recognition of multi-word names [1]. To account for the large number of ambiguous synonyms in the considered organisms, the system has been extended to use specific variants of the detection procedure for highly ambiguous and case-sensitive synonyms. Based on all detected synonyms fo

    Impact of Entity Graphs on Extracting Semantic Relations

    Get PDF
    International audienceRelation extraction (RE) between a pair of entity mentions from text is an important and challenging task specially for open domain relations. Generally, relations are extracted based on the lexical and syntactical information at the sentence level. However, global information about known entities has not been explored yet for RE task. In this paper, we propose to extract a graph of entities from the overall corpus and to compute features on this graph that are able to capture some evidences of holding relationships between a pair of entities. The proposed features boost the RE performance significantly when these are combined with some linguistic features

    Gene and protein nomenclature in public databases

    Get PDF
    BACKGROUND: Frequently, several alternative names are in use for biological objects such as genes and proteins. Applications like manual literature search, automated text-mining, named entity identification, gene/protein annotation, and linking of knowledge from different information sources require the knowledge of all used names referring to a given gene or protein. Various organism-specific or general public databases aim at organizing knowledge about genes and proteins. These databases can be used for deriving gene and protein name dictionaries. So far, little is known about the differences between databases in terms of size, ambiguities and overlap. RESULTS: We compiled five gene and protein name dictionaries for each of the five model organisms (yeast, fly, mouse, rat, and human) from different organism-specific and general public databases. We analyzed the degree of ambiguity of gene and protein names within and between dictionaries, to a lexicon of common English words and domain-related non-gene terms, and we compared different data sources in terms of size of extracted dictionaries and overlap of synonyms between those. The study shows that the number of genes/proteins and synonyms covered in individual databases varies significantly for a given organism, and that the degree of ambiguity of synonyms varies significantly between different organisms. Furthermore, it shows that, despite considerable efforts of co-curation, the overlap of synonyms in different data sources is rather moderate and that the degree of ambiguity of gene names with common English words and domain-related non-gene terms varies depending on the considered organism. CONCLUSION: In conclusion, these results indicate that the combination of data contained in different databases allows the generation of gene and protein name dictionaries that contain significantly more used names than dictionaries obtained from individual data sources. Furthermore, curation of combined dictionaries considerably increases size and decreases ambiguity. The entries of the curated synonym dictionary are available for manual querying, editing, and PubMed- or Google-search via the ProThesaurus-wiki. For automated querying via custom software, we offer a web service and an exemplary client application
    corecore