Search CORE

162 research outputs found

A graph-search framework for associating gene identifiers with documents

Author: A Yeh
AM Cohen
AM Cohen
AM Cohen
C Zhai
Consortium TGO
D Hanisch
E Hatcher
E Minkov
E Minkov
Einat Minkov
F Sha
J Crim
K Franzén
K Fundel
K Humphreys
L Hirschman
L Hirschman
M Collins
M Craven
R Bunescu
RI Kondor
T Rindflesch
U Leser
William W Cohen
WW Cohen
WW Cohen
WW Cohen
Y Altun
Y Freund
Z Kou
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: One step in the model organism database curation process is to find, for each article, the identifier of every gene discussed in the article. We consider a relaxation of this problem suitable for semi-automated systems, in which each article is associated with a ranked list of possible gene identifiers, and experimentally compare methods for solving this geneId ranking problem. In addition to baseline approaches based on combining named entity recognition (NER) systems with a "soft dictionary" of gene synonyms, we evaluate a graph-based method which combines the outputs of multiple NER systems, as well as other sources of information, and a learning method for reranking the output of the graph-based method. RESULTS: We show that named entity recognition (NER) systems with similar F-measure performance can have significantly different performance when used with a soft dictionary for geneId-ranking. The graph-based approach can outperform any of its component NER systems, even without learning, and learning can further improve the performance of the graph-based ranking approach. CONCLUSION: The utility of a named entity recognition (NER) system for geneId-finding may not be accurately predicted by its entity-level F1 performance, the most common performance measure. GeneId-ranking systems are best implemented by combining several NER systems. With appropriate combination methods, usefully accurate geneId-ranking systems can be constructed based on easily-available resources, without resorting to problem-specific, engineered components

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Discovering Complex Relationships between Drugs and Diseases

Author: Sharma Ranjana
Publication venue: North Dakota State University
Publication date: 01/01/2011
Field of study

Finding the complex semantic relations between existing drugs and new diseases will help in the drug development in a new way. Most of the drugs which have found new uses have been discovered due to serendipity. Hence, the prediction of the uses of drugs for more than one disease should be done in a systematic way by studying the semantic relations between the drugs and diseases and also the other entities involved in the relations. Hence, in order to study the complex semantic relations between drugs and diseases an application was developed that integrates the heterogeneous data in different formats from different public databases which are available online. A high level ontology was also developed to integrate the data and only the fields required for the current study were used. The data was collected from different data sources such as DrugBank, UniProt/SwissProt, GeneCards and OMIM. Most of these data sources are the standard data sources and have been used by National Committee of Biotechnology Information of Nation Institute of Health. The data was parsed and integrated and complex semantic relations were discovered. This is a simple and novel effort which may find uses in development of new drug targets and polypharmacology

NDSU Libraries Institutional Repository

Systematic approaches to mine, predict and visualize biological functions

Author: Chang Yi-Chien
Publication venue
Publication date: 12/02/2016
Field of study

With advances in high-throughput technologies and next-generation sequencing, the amount of genomic and proteomic data is dramatically increasing in the post-genomic era. One of the biggest challenges that has arisen is the connection of sequences to their activities and the understanding of their cellular functions and interactions. In this dissertation, I present three different strategies for mining, predicting and visualizing biological functions. In the first part, I present the COMputational Bridges to Experiments (COMBREX) project, which facilitates the functional annotation of microbial proteins by leveraging the power of scientific community. The goal is to bring computational biologists and biochemists together to expand our knowledge. A database-driven web portal has been built to serve as a hub for the community. Predicted annotations will be deposited into the database and the recommendation system will guide biologists to the predictions whose experimental validation will be more beneficial to our knowledge of microbial proteins. In addition, by taking advantage of the rich content, we develop a web service to help community members enrich their genome annotations. In the second part, I focus on identifying the genes for enzyme activities that lack genetic details in the major biological databases. Protein sequences are unknown for about one-third of the characterized enzyme activities listed in the EC system, the so-called orphan enzymes. Our approach considers the similarities between enzyme activities, enabling us to deal with broad types of orphan enzymes in eukaryotes. I apply our framework to human orphan enzymes and show that we can successfully fill the knowledge gaps in the human metabolic network. In the last part, I construct a platform for visually analyzing the eco-system level metabolic network. Most microbes live in a multiple-species environment. The underlying nutrient exchange can be seen as a dynamic eco-system level metabolic network. The complexity of the network poses new visualization challenges. Using the data predicted by Computation Of Microbial Ecosystems in Time and Space (COMETS), I demonstrate that our platform is a powerful tool for investigating the interactions of the microbial community. We apply it to the exploration of a simulated microbial eco-system in the human gut. The result reflects both known knowledge and novel mutualistic interactions, such as the nutrients exchanges between E. coli, C. difficile and L. acidophilus

Boston University Institutional Repository (OpenBU)

LAITOR - Literature Assistant for Identification of Terms co-Occurrences and Relationships

Author: Andrade-Navarro Miguel A
Barbosa-Silva Adriano
Fontaine Jean-Fred
Magalhães Ivan LF
Ortega J Miguel
Pavlopoulos Georgios A
Schneider Reinhard
Soldatos Theodoros G
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Biological knowledge is represented in scientific literature that often describes the function of genes/proteins (bioentities) in terms of their interactions (biointeractions). Such bioentities are often related to biological concepts of interest that are specific of a determined research field. Therefore, the study of the current literature about a selected topic deposited in public databases, facilitates the generation of novel hypotheses associating a set of bioentities to a common context. Results We created a text mining system (LAITOR: <it>Literature Assistant for Identification of Terms co-Occurrences and Relationships</it>) that analyses co-occurrences of bioentities, biointeractions, and other biological terms in MEDLINE abstracts. The method accounts for the position of the co-occurring terms within sentences or abstracts. The system detected abstracts mentioning protein-protein interactions in a standard test (BioCreative II IAS test data) with a precision of 0.82-0.89 and a recall of 0.48-0.70. We illustrate the application of LAITOR to the detection of plant response genes in a dataset of 1000 abstracts relevant to the topic. Conclusions Text mining tools combining the extraction of interacting bioentities and biological concepts with network displays can be helpful in developing reasonable hypotheses in different scientific backgrounds.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Open Repository and Bibliography - Luxembourg

The gene normalization task in BioCreative III

Author: A McCallum
AA Morgan
AP Dawid
AS Schwartz
B Settles
B Turner
C Lindberg
Cheng-Ju Kuo
Chih-Hsuan Wei
Chun-Nan Hsu
CN Hsu
D Hong-Jie
D Rebholz-Schuhmann
David Campos
DD Lewis
Dina Vishnyakova
E Agirre
F Leitner
F Rinaldi
F Rinaldi
Fabio Rinaldi
Feifan Liu
H Liu
H Liu
Han-Cheol Cho
HD Carroll
Hong-Jie Dai
Hongfang Liu
Hung-Yu Kao
Illes Solt
J Hakenberg
J Whitechill
Jingchen Liu
Karin Verspoor
Kevin M Livingston
KG Dowell
L Hirschman
L Smith
M Ashburner
M Gerner
M Hall
M Huang
Manabu Torii
Martin Gerner
Martin Romacker
ME Colosimo
Minlie Huang
Naoaki Okazaki
P Donmez
P Ruch
P Smyth
P Welinder
Padmini Srinivasan
Patrick Ruch
R Leaman
R Snow
Richard Tzong-Han Tsai
S Bhattacharya
S Brin
S Matos
S Sarntivijai
Sanmitra Bhattacharya
Sergio Matos
Shashank Agarwal
T Kappeler
T Zhang
TH Haveliwala
VC Raykar
VS Sheng
W John Wilbur
X Wang
Z Lu
Zhiyong Lu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance

Crossref

Springer - Publisher Connector

PubMed Central

ZORA

Visualizing Cluster-specific Genes from Single-cell Transcriptomics Data Using Association Plots

Author: Fadakar S.
Gralinska E.
Kohl C.
Vingron M.
Publication venue: 'Elsevier BV'
Publication date: 07/03/2022
Field of study

Visualizing single-cell transcriptomics data in an informative way is a major challenge in biological data analysis. Clustering of cells is a prominent analysis step and the results are usually visualized in a planar embedding of the cells using methods like PCA, t-SNE, or UMAP. Given a cluster of cells, one frequently searches for the genes highly expressed specifically in that cluster. At this point, visualization is usually replaced by studying a list of differentially expressed genes. Association Plots are derived from correspondence analysis and constitute a planar visualization of the features which characterize a given cluster of observations. We have adapted Association Plots to address the challenge of visualizing cluster-specific genes in large single-cell data sets. Our method is made available as a free R package called APL. We demonstrate the application of APL and Association Plots to single-cell RNA-seq data on two example data sets. First, we present how to delineate novel marker genes using Association Plots with the example of Peripheral Blood Mononuclear Cell data. Second, we show how to apply Association Plots for annotating cell clusters to known cell types using Association Plots and a predefined list of marker genes. To do this we will use data from the human cell atlas of fetal gene expression. Results from Association Plots will also be compared to methods for deriving differentially expressed genes, and we will show the integration of APL with Gene Ontology Enrichment

MPG.PuRe

LAITOR - Literature Assistant for Identification of Terms co-Occurrences and Relationships

Author: Andrade-Navarro M.A.
Barbosa-Silva A.
Fontaine J.F.
Magalhaes I.L.
Ortega J.M.
Pavlopoulos G.A.
Schneider R.
Soldatos T.G.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/02/2010
Field of study

BACKGROUND: Biological knowledge is represented in scientific literature that often describes the function of genes/proteins (bioentities) in terms of their interactions (biointeractions). Such bioentities are often related to biological concepts of interest that are specific of a determined research field. Therefore, the study of the current literature about a selected topic deposited in public databases, facilitates the generation of novel hypotheses associating a set of bioentities to a common context. RESULTS: We created a text mining system (LAITOR: Literature Assistant for Identification of Terms co-Occurrences and Relationships) that analyses co-occurrences of bioentities, biointeractions, and other biological terms in MEDLINE abstracts. The method accounts for the position of the co-occurring terms within sentences or abstracts. The system detected abstracts mentioning protein-protein interactions in a standard test (BioCreative II IAS test data) with a precision of 0.82-0.89 and a recall of 0.48-0.70. We illustrate the application of LAITOR to the detection of plant response genes in a dataset of 1000 abstracts relevant to the topic. CONCLUSIONS: Text mining tools combining the extraction of interacting bioentities and biological concepts with network displays can be helpful in developing reasonable hypotheses in different scientific backgrounds

MDC Repository

A critical evaluation of network and pathway based classifiers for outcome prediction in breast cancer

Author: A Subramanian
C Desmedt
Christine Staiger
D Hanahan
E Lee
F Reyal
G Abraham
GR Mishra
Gunnar W. Klau
HY Chuang
I Ulitsky
IW Taylor
Joaquín Dopazo
K Chin
KR Brown
L Ein-Dor
L Tian
LD Miller
LFA Wessels
LJ van’t Veer
Lodewyk F. A. Wessels
M Kanehisa
Marcus Dittrich
MH van Vliet
MJ van de Vijver
ML Gatza
MT Dittrich
P Dao
Raul Kooter
S Loi
S Ma
SA Chowdhury
Sidney Cadot
Tobias Müller
TSK Prasad
V Popovici
Y Pawitan
Y Wang
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/10/2011
Field of study

Recently, several classifiers that combine primary tumor data, like gene expression data, and secondary data sources, such as protein-protein interaction networks, have been proposed for predicting outcome in breast cancer. In these approaches, new composite features are typically constructed by aggregating the expression levels of several genes. The secondary data sources are employed to guide this aggregation. Although many studies claim that these approaches improve classification performance over single gene classifiers, the gain in performance is difficult to assess. This stems mainly from the fact that different breast cancer data sets and validation procedures are employed to assess the performance. Here we address these issues by employing a large cohort of six breast cancer data sets as benchmark set and by performing an unbiased evaluation of the classification accuracies of the different approaches. Contrary to previous claims, we find that composite feature classifiers do not outperform simple single gene classifiers. We investigate the effect of (1) the number of selected features; (2) the specific gene set from which features are selected; (3) the size of the training set and (4) the heterogeneity of the data set on the performance of composite feature and single gene classifiers. Strikingly, we find that randomization of secondary data sources, which destroys all biological information in these sources, does not result in a deterioration in performance of composite feature classifiers. Finally, we show that when a proper correction for gene set size is performed, the stability of single gene sets is similar to the stability of composite feature sets. Based on these results there is currently no reason to prefer prognostic classifiers based on composite features over single gene classifiers for predicting outcome in breast cancer

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

VU Research Portal

CWI's Institutional Repository

Directory of Open Access Journals

PubMed Central

Online-Publikations-Server der Universität Würzburg

FigShare

Enabling Complex Semantic Queries to Bioinformatics Databases through Intuitive Search Over Data

Author: Sima Ana Claudia
Publication venue: Université de Lausanne, Faculté de biologie et médecine
Publication date: 26/10/2020
Field of study

Data integration promises to be one of the main catalysts in enabling new insights to be drawn from the wealth of biological data already available publicly. However, the heterogene- ity of the existing data sources still poses significant challenges for achieving interoperability among biological databases. Furthermore, merely solving the technical challenges of data in- tegration, for example through the use of common data representation formats, leaves open the larger problem. Namely, the steep learning curve required for understanding the data models of each public source, as well as the technical language through which the sources can be queried and joined. As a consequence, most of the available biological data remain practically unexplored today. In this thesis, we address these problems jointly, by first introducing an ontology-based data integration solution in order to mitigate the data source heterogeneity problem. We illustrate through the concrete example of Bgee, a gene expression data source, how relational databases can be exposed as virtual Resource Description Framework (RDF) graphs, through relational-to-RDF mappings. This has the important advantage that the original data source can remain unmodified, while still becoming interoperable with external RDF sources. We complement our methods with applied case studies designed to guide domain experts in formulating expressive federated queries targeting the integrated data across the domains of evolutionary relationships and gene expression. More precisely, we introduce two com- parative analyses, first within the same domain (using orthology data from multiple, inter- operable, data sources) and second across domains, in order to study the relation between expression change and evolution rate following a duplication event. Finally, in order to bridge the semantic gap between users and data, we design and im- plement Bio-SODA, a question answering system over domain knowledge graphs, that does not require training data for translating user questions to SPARQL. Bio-SODA uses a novel ranking approach that combines syntactic and semantic similarity, while also incorporating node centrality metrics to rank candidate matches for a given user question. Our results in testing Bio-SODA across several real-world databases that span multiple domains (both within and outside bioinformatics) show that it can answer complex, multi-fact queries, be- yond the current state-of-the-art in the more well-studied open-domain question answering. -- L’intégration des données promet d’être l’un des principaux catalyseurs permettant d’extraire des nouveaux aperçus de la richesse des données biologiques déjà disponibles publiquement. Cependant, l’hétérogénéité des sources de données existantes pose encore des défis importants pour parvenir à l’interopérabilité des bases de données biologiques. De plus, en surmontant seulement les défis techniques de l’intégration des données, par exemple grâce à l’utilisation de formats standard de représentation de données, on laisse ouvert un problème encore plus grand. À savoir, la courbe d’apprentissage abrupte nécessaire pour comprendre la modéli- sation des données choisie par chaque source publique, ainsi que le langage technique par lequel les sources peuvent être interrogés et jointes. Par conséquent, la plupart des données biologiques publiquement disponibles restent pratiquement inexplorés aujourd’hui. Dans cette thèse, nous abordons l’ensemble des deux problèmes, en introduisant d’abord une solution d’intégration de données basée sur ontologies, afin d’atténuer le problème d’hété- rogénéité des sources de données. Nous montrons, à travers l’exemple de Bgee, une base de données d’expression de gènes, une approche permettant les bases de données relationnelles d’être publiés sous forme de graphes RDF (Resource Description Framework) virtuels, via des correspondances relationnel-vers-RDF (« relational-to-RDF mappings »). Cela présente l’important avantage que la source de données d’origine peut rester inchangé, tout en de- venant interopérable avec les sources RDF externes. Nous complétons nos méthodes avec des études de cas appliquées, conçues pour guider les experts du domaine dans la formulation de requêtes fédérées expressives, ciblant les don- nées intégrées dans les domaines des relations évolutionnaires et de l’expression des gènes. Plus précisément, nous introduisons deux analyses comparatives, d’abord dans le même do- maine (en utilisant des données d’orthologie provenant de plusieurs sources de données in- teropérables) et ensuite à travers des domaines interconnectés, afin d’étudier la relation entre le changement d’expression et le taux d’évolution suite à une duplication de gène. Enfin, afin de mitiger le décalage sémantique entre les utilisateurs et les données, nous concevons et implémentons Bio-SODA, un système de réponse aux questions sur des graphes de connaissances domaine-spécifique, qui ne nécessite pas de données de formation pour traduire les questions des utilisateurs vers SPARQL. Bio-SODA utilise une nouvelle ap- proche de classement qui combine la similarité syntactique et sémantique, tout en incorporant des métriques de centralité des nœuds, pour classer les possibles candidats en réponse à une question utilisateur donnée. Nos résultats suite aux tests effectués en utilisant Bio-SODA sur plusieurs bases de données à travers plusieurs domaines (tantôt liés à la bioinformatique qu’extérieurs) montrent que Bio-SODA réussit à répondre à des questions complexes, en- gendrant multiples entités, au-delà de l’état actuel de la technique en matière de systèmes de réponses aux questions sur les données structures, en particulier graphes de connaissances

Serveur académique lausannois