Search CORE

320 research outputs found

Co-occurrence based meta-analysis of scientific texts: retrieving biological relationships between genes

Author: Dorssers L.C.J. (Lambert)
Eijk C.C. (Christiaan) van der
Jelier R. (Rob)
Jenster G.W. (Guido)
Kors J.A. (Jan)
Mons B. (Barend)
Mulligen E.M. (Erik) van
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2005
Field of study

MOTIVATION: The advent of high-throughput experiments in molecular biology creates a need for methods to efficiently extract and use information for large numbers of genes. Recently, the associative concept space (ACS) has been developed for the representation of information extracted from biomedical literature. The ACS is a Euclidean space in which thesaurus concepts are positioned and the distances between concepts indicates their relatedness. The ACS uses co-occurrence of concepts as a source of information. In this paper we evaluate how well the system can retrieve functionally related genes and we compare its performance with a simple gene co-occurrence method. RESULTS: To assess the performance of the ACS we composed a test set of five groups of functionally related genes. With the ACS good scores were obtained for four of the five groups. When compared to the gene co-occurrence method, the ACS is capable of revealing more functional biological relations and can achieve results with less literature available per gene. Hierarchical clustering was performed on the ACS output, as a potential aid to users, and was found to provide useful clusters. Our results suggest that the algorithm can be of value for researchers studying large numbers of genes. AVAILABILITY: The ACS program is available upon request from the authors

EUR Research Repository

Erasmus University Digital Repository

Overview of BioCreative II gene normalization

Author: Cohen Aaron M
Cohen K Bretonnel
Divoli Anna
Fluck Juliane
Fundel Katrin
Hakenberg Jörg
Hirschman Lynette
Hsu Chun-Nan
Krauthammer Michael
Lau William W
Leaman Robert
Liu Heng-hui
Liu Hongfang
Lu Zhiyong
Morgan Alexander A
Ruch Patrick
Schuemie Martijn
Sun Chengjie
Torres Rafael
Wang Xinglong
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Background: The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task, even for the human expert; genes are often described rather than referred to by gene symbol and, confusingly, one gene name may refer to different genes (often from different organisms). For BioCreative II, the task was to list the Entrez Gene identifiers for human genes or gene products mentioned in PubMed/MEDLINE abstracts. We selected abstracts associated with articles previously curated for human genes. We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90%. Results: Twenty groups submitted one to three runs each, for a total of 54 runs. Three systems achieved F-measures (balanced precision and recall) between 0.80 and 0.81. Combining the system outputs using simple voting schemes and classifiers obtained improved results; the best composite system achieved an F-measure of 0.92 with 10-fold cross-validation. A 'maximum recall' system based on the pooled responses of all participants gave a recall of 0.97 (with precision 0.23), identifying 763 out of 785 identifiers. Conclusion: Major advances for the BioCreative II gene normalization task include broader participation (20 versus 8 teams) and a pooled system performance comparable to human experts, at over 90% agreement. These results show promise as tools to link the literature with biological databases

Springer - Publisher Connector

EUR Research Repository

Erasmus University Digital Repository

The gene normalization task in BioCreative III

Author: A McCallum
AA Morgan
AP Dawid
AS Schwartz
B Settles
B Turner
C Lindberg
Cheng-Ju Kuo
Chih-Hsuan Wei
Chun-Nan Hsu
CN Hsu
D Hong-Jie
D Rebholz-Schuhmann
David Campos
DD Lewis
Dina Vishnyakova
E Agirre
F Leitner
F Rinaldi
F Rinaldi
Fabio Rinaldi
Feifan Liu
H Liu
H Liu
Han-Cheol Cho
HD Carroll
Hong-Jie Dai
Hongfang Liu
Hung-Yu Kao
Illes Solt
J Hakenberg
J Whitechill
Jingchen Liu
Karin Verspoor
Kevin M Livingston
KG Dowell
L Hirschman
L Smith
M Ashburner
M Gerner
M Hall
M Huang
Manabu Torii
Martin Gerner
Martin Romacker
ME Colosimo
Minlie Huang
Naoaki Okazaki
P Donmez
P Ruch
P Smyth
P Welinder
Padmini Srinivasan
Patrick Ruch
R Leaman
R Snow
Richard Tzong-Han Tsai
S Bhattacharya
S Brin
S Matos
S Sarntivijai
Sanmitra Bhattacharya
Sergio Matos
Shashank Agarwal
T Kappeler
T Zhang
TH Haveliwala
VC Raykar
VS Sheng
W John Wilbur
X Wang
Z Lu
Zhiyong Lu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

BACKGROUND: We report the Gene Normalization (GN) challenge in BioCreative III where participating teams were asked to return a ranked list of identifiers of the genes detected in full-text articles. For training, 32 fully and 500 partially annotated articles were prepared. A total of 507 articles were selected as the test set. Due to the high annotation cost, it was not feasible to obtain gold-standard human annotations for all test articles. Instead, we developed an Expectation Maximization (EM) algorithm approach for choosing a small number of test articles for manual annotation that were most capable of differentiating team performance. Moreover, the same algorithm was subsequently used for inferring ground truth based solely on team submissions. We report team performance on both gold standard and inferred ground truth using a newly proposed metric called Threshold Average Precision (TAP-k). RESULTS: We received a total of 37 runs from 14 different teams for the task. When evaluated using the gold-standard annotations of the 50 articles, the highest TAP-k scores were 0.3297 (k=5), 0.3538 (k=10), and 0.3535 (k=20), respectively. Higher TAP-k scores of 0.4916 (k=5, 10, 20) were observed when evaluated using the inferred ground truth over the full test set. When combining team results using machine learning, the best composite system achieved TAP-k scores of 0.3707 (k=5), 0.4311 (k=10), and 0.4477 (k=20) on the gold standard, representing improvements of 12.4%, 21.8%, and 26.6% over the best team results, respectively. CONCLUSIONS: By using full text and being species non-specific, the GN task in BioCreative III has moved closer to a real literature curation task than similar tasks in the past and presents additional challenges for the text mining community, as revealed in the overall team results. By evaluating teams using the gold standard, we show that the EM algorithm allows team submissions to be differentiated while keeping the manual annotation effort feasible. Using the inferred ground truth we show measures of comparative performance between teams. Finally, by comparing team rankings on gold standard vs. inferred ground truth, we further demonstrate that the inferred ground truth is as effective as the gold standard for detecting good team performance

Springer - Publisher Connector

Biomedical word sense disambiguation with ontologies and metadata: automation meets accuracy

Author: Alexopoulou Dimitra
Andreopoulos Bill
Dietze Heiko
Doms Andreas
Gandon Fabien
Hakenberg Jörg
Khelif Khaled
Schroeder Michael
Wächter Thomas
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Ontology term labels can be ambiguous and have multiple senses. While this is no problem for human annotators, it is a challenge to automated methods, which identify ontology terms in text. Classical approaches to word sense disambiguation use co-occurring words or terms. However, most treat ontologies as simple terminologies, without making use of the ontology structure or the semantic similarity between terms. Another useful source of information for disambiguation are metadata. Here, we systematically compare three approaches to word sense disambiguation, which use ontologies and metadata, respectively. Results The 'Closest Sense' method assumes that the ontology defines multiple senses of the term. It computes the shortest path of co-occurring terms in the document to one of these senses. The 'Term Cooc' method defines a log-odds ratio for co-occurring terms including co-occurrences inferred from the ontology structure. The 'MetaData' approach trains a classifier on metadata. It does not require any ontology, but requires training data, which the other methods do not. To evaluate these approaches we defined a manually curated training corpus of 2600 documents for seven ambiguous terms from the Gene Ontology and MeSH. All approaches over all conditions achieve 80% success rate on average. The 'MetaData' approach performed best with 96%, when trained on high-quality data. Its performance deteriorates as quality of the training data decreases. The 'Term Cooc' approach performs better on Gene Ontology (92% success) than on MeSH (73% success) as MeSH is not a strict is-a/part-of, but rather a loose is-related-to hierarchy. The 'Closest Sense' approach achieves on average 80% success rate. Conclusion Metadata is valuable for disambiguation, but requires high quality training data. Closest Sense requires no training, but a large, consistently modelled ontology, which are two opposing conditions. Term Cooc achieves greater 90% success given a consistently modelled ontology. Overall, the results show that well structured ontologies can play a very important role to improve disambiguation. Availability The three benchmark datasets created for the purpose of disambiguation are available in Additional file <supplr sid="S1">1</supplr>. <suppl id="S1"> <title> Additional file 1 </title> <text> Benchmark datasets used in the experiments. The three corpora (High quality/Low quantity corpus; Medium quality/Medium quantity corpus; Low quality/High quantity corpus) are given in the form of PubMed identifiers (PMID) for True/False cases for the 7 ambiguous terms examined (GO/MeSH/UMLS identifiers are also given). </text> <file name="1471-2105-10-28-S1.txt"> Click here for file </file> </suppl

Springer - Publisher Connector

Directory of Open Access Journals

INRIA a CCSD electronic archive server

SJSU ScholarWorks

Calling on a million minds for community annotation in WikiProteins.

Author: Ashburner Michael
Bairoch Amos
Barris Nickolas
Berkeley Alfred
Borner Katy
Chichester Christine
Cockerill Matthew
den Dunnen Johan
Hermjakob Henning
Lewis Suzanna
Meijssen Gerard
Melton William
Moeller Erik
Mons Albert
Mons Barend
Musen Mark
Pacheco Roberto
Packer Abel
Roes Peter Jan
van Mulligen Erik
van Ommen Gert-Jan
Wales Jimmy
Weeber Marc
Publication venue: Genome Biol
Publication date: 01/01/2008
Field of study

WikiProteins enables community annotation in a Wiki-based system. Extracts of major data sources have been fused into an editable environment that links out to the original sources. Data from community edits create automatic copies of the original data. Semantic technology captures concepts co-occurring in one sentence and thus potential factual statements. In addition, indirect associations between concepts have been calculated. We call on a 'million minds' to annotate a 'million concepts' and to collect facts from the literature with the reward of collaborative knowledge discovery. The system is available for beta testing at http://www.wikiprofessional.org.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are

Springer - Publisher Connector

IUScholarWorks (University of Indiana)

EUR Research Repository

Erasmus University Digital Repository

Identifying overrepresented concepts in gene lists from literature: a statistical approach based on Poisson mixture model

Author: Chee Brant
He Xin
Ling Xu
Sarma Moushumi Sen
Schatz Bruce
Zhai Chengxiang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Large-scale genomic studies often identify large gene lists, for example, the genes sharing the same expression patterns. The interpretation of these gene lists is generally achieved by extracting concepts overrepresented in the gene lists. This analysis often depends on manual annotation of genes based on controlled vocabularies, in particular, Gene Ontology (GO). However, the annotation of genes is a labor-intensive process; and the vocabularies are generally incomplete, leaving some important biological domains inadequately covered. Results We propose a statistical method that uses the primary literature, i.e. free-text, as the source to perform overrepresentation analysis. The method is based on a statistical framework of mixture model and addresses the methodological flaws in several existing programs. We implemented this method within a literature mining system, BeeSpace, taking advantage of its analysis environment and added features that facilitate the interactive analysis of gene sets. Through experimentation with several datasets, we showed that our program can effectively summarize the important conceptual themes of large gene sets, even when traditional GO-based analysis does not yield informative results. Conclusions We conclude that the current work will provide biologists with a tool that effectively complements the existing ones for overrepresentation analysis from genomic experiments. Our program, Genelist Analyzer, is freely available at: <url>http://workerbee.igb.uiuc.edu:8080/BeeSpace/Search.jsp</url></p

Springer - Publisher Connector

Directory of Open Access Journals

The implicitome: A resource for rationalizing gene-disease associations

Author: Bruskiewich R. (Richard)
Dunnen J.T. (Johan) den
Emmelien A. (Aten)
Good B.M. (Benjamin M.)
Haagen H.H.H.B.M. (Herman) van
Hettne K.M. (Kristina)
Hoen P.A.C. (Peter) 't
Kaliyaperumal R. (Rajaram)
Kors J.A. (Jan)
Laros J.F.J. (Jeroen F.)
Li T.S. (Tong Shu)
Mina E. (Eleni)
Mons B. (Barend)
Roos M. (Marco)
Schuemie M.J. (Martijn)
Schultes E. (Erik)
Su A.I. (Andrew I.)
Tatum Z. (Zuotian)
Thompson M. (Mark)
Van Der Horst E. (Eelke)
Van Mulligen E.M. (Erik M.)
Van Ommen G.-J.B. (Gert-Jan B.)
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/02/2016
Field of study

High-throughput experimental methods such as medical sequencing and genome-wide association studies (GWAS) identify increasingly large numbers of potential relations between genetic variants and diseases. Both biological complexity (millions of potential gene-disease associations) and the accelerating rate of data production necessitate computational approaches to prioritize and rationalize potential gene-disease relations. Here, we use concept profile technology to expose from the biomedical literature both explicitly stated gene-disease relations (the explicitome) and a much larger set of implied gene-disease associations (the implicitome). Implicit relations are largely unknown to, or are even unintended by the original authors, but they vastly extend the reach of existing

Erasmus University Digital Repository

Disentangling categorical relationships through a graph of co-occurrences

Author: Araujo Lourdes
Arenas Alex
Borge-Holthoefer Javier
Capitán José A.
Cuesta José A.
Martínez-Romo Juan
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/2011
Field of study

The mesoscopic structure of complex networks has proven a powerful level of description to understand the linchpins of the system represented by the network. Nevertheless, themapping of a series of relationships between elements, in terms of a graph, is sometimes not straightforward. Given that all the information we would extract using complex network tools depend on this initial graph, it is mandatory to preprocess the data to build it on in the most accurate manner. Here we propose a procedure to build a network, attending only to statistically significant relations between constituents. We use a paradigmatic example of word associations to show the development of our approach. Analyzing the modular structure of the obtained network we are able to disentangle categorical relations, disambiguating words with success that is comparable to the best algorithms designed to the same end.We acknowledge financia support through Grant No. FIS2009-13364-C02-01, Holopedia (Grant No. TIN2010-21128-C02-01), MOSAICO (Grant No. FIS2006-01485), PRODIEVO (Grant No. FIS2011-22449), and Complexity-NET RESINEE, all of them from Ministerio de Educación y Ciencia in Spain, as well as support from Research Networks MODELICO-CM (Grant No. S2009/ESP-1691) and MA2VICMR (Grant No. S2009/TIC-1542) from Comunidad de Madrid, and Network 2009-SGR-838 from Generalitat de Catalunya

Universidad Carlos III de Madrid e-Archivo

Proceedings of the 2008 MidSouth Computational Biology and Bioinformatics Society (MCBIOS) Conference

Author: A Churbanov
A Churbanov
A Fujita
A Gyenesei
A Hijikata
A Rawat
A Shipra
AA Ptitsyn
AA Ptitsyn
AA Ptitsyn
AW Schreiber
B Roux
CA Bottoms
CB Giles
D Quest
D Sean
D Wilkins
Dawn Wilkins
ES Chen
G Gamberoni
H Hong
H Liu
H Meng
H Xu
HM Bovelstad
I Fishel
I Medina
James C Fuscoe
Jonathan D Wren
JS Yuan
JS Zielinski
JW Fan
K Thomson
L Guo
L Hertzberg
L Narlikar
L Shi
LK Schnackenberg
LL Elo
M Chae
M Landry
M Mete
M Mete
M Pirooznia
MA Hibbs
MD Dyer
MF Burkart
MG Dozmorov
MG Dozmorov
MK Das
N Mei
ND Mukhopadhyay
O Uzuner
P Li
P Minguez
QH Zhu
R Loganantharaj
RL Frank
RS Wang
S Gao
S Martin
S Sonnenburg
S Winters-Hilt
S Winters-Hilt
S Winters-Hilt
S Winters-Hilt
S Winters-Hilt
S Yuan
SB Montgomery
SM Bridges
Stephen Winters-Hilt
Susan Bridges
T Huan
T Lee
V Kulkarni
V Nagarajan
VI Torvik
WK Lim
WS Sanders
X Chen
Y Ding
Y Gusev
Y Huang
Y Lin
Yuriy Gusev
Z Su
Z Yu
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Springer - Publisher Connector

Directory of Open Access Journals