6 research outputs found

    Figure mining for biomedical research

    Get PDF
    Motivation: Figures from biomedical articles contain valuable information difficult to reach without specialized tools. Currently, there is no search engine that can retrieve specific figure types. Results: This study describes a retrieval method that takes advantage of principles in image understanding, text mining and optical character recognition (OCR) to retrieve figure types defined conceptually. A search engine was developed to retrieve tables and figure types to aid computational and experimental research. Availability: http://iossifovlab.cshl.edu/figurome Contact: [email protected]

    LINNAEUS: A species name identification system for biomedical literature

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The task of recognizing and identifying species names in biomedical literature has recently been regarded as critical for a number of applications in text and data mining, including gene name recognition, species-specific document retrieval, and semantic enrichment of biomedical articles.</p> <p>Results</p> <p>In this paper we describe an open-source species name recognition and normalization software system, LINNAEUS, and evaluate its performance relative to several automatically generated biomedical corpora, as well as a novel corpus of full-text documents manually annotated for species mentions. LINNAEUS uses a dictionary-based approach (implemented as an efficient deterministic finite-state automaton) to identify species names and a set of heuristics to resolve ambiguous mentions. When compared against our manually annotated corpus, LINNAEUS performs with 94% recall and 97% precision at the mention level, and 98% recall and 90% precision at the document level. Our system successfully solves the problem of disambiguating uncertain species mentions, with 97% of all mentions in PubMed Central full-text documents resolved to unambiguous NCBI taxonomy identifiers.</p> <p>Conclusions</p> <p>LINNAEUS is an open source, stand-alone software system capable of recognizing and normalizing species name mentions with speed and accuracy, and can therefore be integrated into a range of bioinformatics and text-mining applications. The software and manually annotated corpus can be downloaded freely at <url>http://linnaeus.sourceforge.net/</url>.</p

    Open Source Analysis of Biomedical Figures

    Get PDF
    With a selection of biomedical literature available for open access, a natural pairing seems to be the use of open source software to automatically analyze content, in particular, the content of gures. Considering the large number of possible tools and approaches, we choose to focus on the recognition of printed characters. As the problem of optical character recognition (OCR) under rea- sonable conditions is considered to be solved, and as open source software is fully capable of isolating the location of characters and identifying most of them accurately, we instead use OCR as an application area for the relatively recent development of compressive sampling, and in particular a fast implementation called compressive sensing matching pursuit (CoSaMP). Compressive sampling enables recovery of a signal from noisy measurements if certain rigorous mathe- matical conditions hold on previously measured samples, the mathematical con- ditions stating that measured samples must be essentially nearly perpendicular, orthogonal, to each other. For OCR, we investigate approximating such nearly orthogonal samples by selecting random curves, then using CoSaMP to deter- mine a sparse number of samples approximating character shapes. We compare the accuracy of three di erent methods of applying CoSaMP to the problem of matching a blurred character to one of a set of previously sampled characters. We show numerically that selecting random curves does not satisfy the strict mathematical conditions for compressive sampling theory to guarantee optimal solutions. However, character matching strategies using CoSaMP transformed characters can be developed whose accuracy is roughly comparable to a base- line comparison of blurred characters with original characters, suggesting that OCR is an example where the performance of compressive sampling methods declines gracefully as conditions are weakened on the sampling matrix

    Integrative bioinformatics applications for complex human disease contexts

    Get PDF
    This thesis presents new methods for the analysis of high-throughput data from modern sources in the context of complex human diseases, at the example of a bioinformatics analysis workflow. New measurement techniques improve the resolution with which cellular and molecular processes can be monitored. While RNA sequencing (RNA-seq) measures mRNA expression, single-cell RNA-seq (scRNA-seq) resolves this on a per-cell basis. Long-read sequencing is increasingly used in genomics. With imaging mass spectrometry (IMS) the protein level in tissues is measured spatially resolved. All these techniques induce specific challenges, which need to be addressed with new computational methods. Collecting knowledge with contextual annotations is important for integrative data analyses. Such knowledge is available through large literature repositories, from which information, such as miRNA-gene interactions, can be extracted using text mining methods. After aggregating this information in new databases, specific questions can be answered with traceable evidence. The combination of experimental data with these databases offers new possibilities for data integrative methods and for answering questions relevant for complex human diseases. Several data sources are made available, such as literature for text mining miRNA-gene interactions (Chapter 2), next- and third-generation sequencing data for genomics and transcriptomics (Chapters 4.1, 5), and IMS for spatially resolved proteomics (Chapter 4.4). For these data sources new methods for information extraction and pre-processing are developed. For instance, third-generation sequencing runs can be monitored and evaluated using the poreSTAT and sequ-into methods. The integrative (down-stream) analyses make use of these (heterogeneous) data sources. The cPred method (Chapter 4.2) for cell type prediction from scRNA-seq data was successfully applied in the context of the SARS-CoV-2 pandemic. The robust differential expression (DE) analysis pipeline RoDE (Chapter 6.1) contains a large set of methods for (differential) data analysis, reporting and visualization of RNA-seq data. Topics of accessibility of bioinformatics software are discussed along practical applications (Chapter 3). The developed miRNA-gene interaction database gives valuable insights into atherosclerosis-relevant processes and serves as regulatory network for the prediction of active miRNA regulators in RoDE (Chapter 6.1). The cPred predictions, RoDE results, scRNA-seq and IMS data are unified as input for the 3D-index Aorta3D (Chapter 6.2), which makes atherosclerosis related datasets browsable. Finally, the scRNA-seq analysis with subsequent cPred cell type prediction, and the robust analysis of bulk-RNA-seq datasets, led to novel insights into COVID-19. Taken all discussed methods together, the integrative analysis methods for complex human disease contexts have been improved at essential positions.Die Dissertation beschreibt Methoden zur Prozessierung von aktuellen Hochdurchsatzdaten, sowie Verfahren zu deren weiterer integrativen Analyse. Diese findet Anwendung vor allem im Kontext von komplexen menschlichen Krankheiten. Neue Messtechniken erlauben eine detailliertere Beobachtung biomedizinischer Prozesse. Mit RNA-Sequenzierung (RNA-seq) wird mRNA-Expression gemessen, mit Hilfe von moderner single-cell-RNA-seq (scRNA-seq) sogar für (sehr viele) einzelne Zellen. Long-Read-Sequenzierung wird zunehmend zur Sequenzierung ganzer Genome eingesetzt. Mittels bildgebender Massenspektrometrie (IMS) können Proteine in Geweben räumlich aufgelöst quantifiziert werden. Diese Techniken bringen spezifische Herausforderungen mit sich, die mit neuen bioinformatischen Methoden angegangen werden müssen. Für die integrative Datenanalyse ist auch die Gewinnung von geeignetem Kontextwissen wichtig. Wissenschaftliche Erkenntnisse werden in Artikeln veröffentlicht, die über große Literaturdatenbanken zugänglich sind. Mittels Textmining können daraus Informationen extrahiert werden, z.B. miRNA-Gen-Interaktionen, die in eigenen Datenbank aggregiert werden um spezifische Fragen mit nachvollziehbaren Belegen zu beantworten. In Kombination mit experimentellen Daten bieten sich so neue Möglichkeiten für integrative Methoden. Durch die Extraktion von Rohdaten und deren Vorprozessierung werden mehrere Datenquellen erschlossen, wie z.B. Literatur für Textmining von miRNA-Gen-Interaktionen (Kapitel 2), Long-Read- und RNA-seq-Daten für Genomics und Transcriptomics (Kapitel 4.2, 5) und IMS für Protein-Messungen (Kapitel 4.4). So dienen z.B. die poreSTAT und sequ-into Methoden der Vorprozessierung und Auswertung von Long-Read-Sequenzierungen. In der integrativen (down-stream) Analyse werden diese (heterogenen) Datenquellen verwendet. Für die Bestimmung von Zelltypen in scRNA-seq-Experimenten wurde die cPred-Methode (Kapitel 4.2) erfolgreich im Kontext der SARS-CoV-2-Pandemie eingesetzt. Auch die robuste Pipeline RoDE fand dort Anwendung, die viele Methoden zur (differentiellen) Datenanalyse, zum Reporting und zur Visualisierung bereitstellt (Kapitel 6.1). Themen der Benutzbarkeit von (bioinformatischer) Software werden an Hand von praktischen Anwendungen diskutiert (Kapitel 3). Die entwickelte miRNA-Gen-Interaktionsdatenbank gibt wertvolle Einblicke in Atherosklerose-relevante Prozesse und dient als regulatorisches Netzwerk für die Vorhersage von aktiven miRNA-Regulatoren in RoDE (Kapitel 6.1). Die cPred-Methode, RoDE-Ergebnisse, scRNA-seq- und IMS-Daten werden im 3D-Index Aorta3D (Kapitel 6.2) zusammengeführt, der relevante Datensätze durchsuchbar macht. Die diskutierten Methoden führen zu erheblichen Verbesserungen für die integrative Datenanalyse in komplexen menschlichen Krankheitskontexten

    Automated and Improved Search Query Effectiveness Design for Systematic Literature Reviews

    Full text link
    This research explores and investigates strategies towards automation of the systematic literature review (SLR) process. SLR is a valuable research method that follows a comprehensive, transparent, and reproducible research methodology. SLRs are at the heart of evidence-based research in various research domains, from healthcare to software engineering. They allow researchers to systematically collect and integrate empirical evidence in response to a focused research question, setting the foundation for future research. SLRs are also beneficial to researchers in learning about the state of the art of research and enriching their knowledge of a topic of research. Given their demonstrated value, SLRs are becoming an increasingly popular type of publication in different disciplines. Despite the valuable contributions of SLRs to science, performing timely, reliable, comprehensive, and unbiased SLRs is a challenging endeavour. With the rapid growth in primary research published every year, SLRs might fail to provide complete coverage of existing evidence and even end up being outdated by the time of publication. These challenges have sparked motivation and discussion in research communities to explore automation techniques to support the SLR process. In investigating automatic methods for supporting the systematic review process, this thesis develops three main areas. First, by conducting a systematic literature review, we found the state of the art of automation techniques that are employed to facilitate the systematic review process. Then, in the second study, we identified the real challenges researchers face when conducting SLRs, through an empirical study. Moreover, we distinguished solutions that help researchers to overcome these challenges. We also identified the researchers' concerns regarding adopting automation techniques in SLR practice. Finally, in the third study, we leveraged the findings of our previous studies to investigate a solution to facilitate the SLR search process. We evaluated our proposed method by running some experiments