Search CORE

466 research outputs found

Biomedical Chinese-English CLIR Using an Extended CMeSH Resource to Expand Queries

Author: Ananiadou S
Thompson P
Wang X
Publication venue
Publication date: 01/01/2012
Field of study

The University of Manchester - Institutional Repository

Automatic categorization of diverse experimental information in the bioscience literature

Author: Brown Nick
Chen Wen
Davis Paul
Fang Ruihua
Fernandes Jolene
Gelbart William M.
Marygold Steven J.
Matthews Beverley
Millburn Gillian
Schindelman Gary
Sternberg Paul W.
Tuli Mary Ann
Van Auken Kimberly
Wang Xiaodong
Zhang Haiyan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Background: Curation of information from bioscience literature into biological knowledge databases is a crucial way of capturing experimental information in a computable form. During the biocuration process, a critical first step is to identify from all published literature the papers that contain results for a specific data type the curator is interested in annotating. This step normally requires curators to manually examine many papers to ascertain which few contain information of interest and thus, is usually time consuming. We developed an automatic method for identifying papers containing these curation data types among a large pool of published scientific papers based on the machine learning method Support Vector Machine (SVM). This classification system is completely automatic and can be readily applied to diverse experimental data types. It has been in use in production for automatic categorization of 10 different experimental datatypes in the biocuration process at WormBase for the past two years and it is in the process of being adopted in the biocuration process at FlyBase and the Saccharomyces Genome Database (SGD). We anticipate that this method can be readily adopted by various databases in the biocuration community and thereby greatly reducing time spent on an otherwise laborious and demanding task. We also developed a simple, readily automated procedure to utilize training papers of similar data types from different bodies of literature such as C. elegans and D. melanogaster to identify papers with any of these data types for a single database. This approach has great significance because for some data types, especially those of low occurrence, a single corpus often does not have enough training papers to achieve satisfactory performance. Results: We successfully tested the method on ten data types from WormBase, fifteen data types from FlyBase and three data types from Mouse Genomics Informatics (MGI). It is being used in the curation work flow at WormBase for automatic association of newly published papers with ten data types including RNAi, antibody, phenotype, gene regulation, mutant allele sequence, gene expression, gene product interaction, overexpression phenotype, gene interaction, and gene structure correction. Conclusions: Our methods are applicable to a variety of data types with training set containing several hundreds to a few thousand documents. It is completely automatic and, thus can be readily incorporated to different workflow at different literature-based databases. We believe that the work presented here can contribute greatly to the tremendous task of automating the important yet labor-intensive biocuration effort

Crossref

Springer - Publisher Connector

Harvard University - DASH

Caltech Authors

Chinese-English Cross-Lingual Information Retrieval in Biomedicine Using Ontology-Based Query Expansion

Author: Wang Xinkai
Publication venue
Publication date: 01/08/2012
Field of study

The University of Manchester - Institutional Repository

Promoting ranking diversity for genomics search with relevance-novelty combined model

Author: Hu Xiaohua
Huang Jimmy Xiangji
Li Zhoujun
Yin Xiaoshi
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Springer - Publisher Connector

PubMed Central

Gene Ontology density estimation and discourse analysis for automatic GeneRiF extraction

Author: Ehrler Frédéric
Gobeill Julien
Mottaz Anaïs
Ruch Patrick
Tbahriti Imad
Veuthey Anne-Lise
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background This paper describes and evaluates a sentence selection engine that extracts a GeneRiF (Gene Reference into Functions) as defined in ENTREZ-Gene based on a MEDLINE record. Inputs for this task include both a gene and a pointer to a MEDLINE reference. In the suggested approach we merge two independent sentence extraction strategies. The first proposed strategy (LASt) uses argumentative features, inspired by discourse-analysis models. The second extraction scheme (GOEx) uses an automatic text categorizer to estimate the density of Gene Ontology categories in every sentence; thus providing a full ranking of all possible candidate GeneRiFs. A combination of the two approaches is proposed, which also aims at reducing the size of the selected segment by filtering out non-content bearing rhetorical phrases. Results Based on the TREC-2003 Genomics collection for GeneRiF identification, the LASt extraction strategy is already competitive (52.78%). When used in a combined approach, the extraction task clearly shows improvement, achieving a Dice score of over 57% (+10%). Conclusions Argumentative representation levels and conceptual density estimation using Gene Ontology contents appear complementary for functional annotation in proteomics.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Archive ouverte UNIGE

Is searching full text more effective than searching abstracts?

Author: A Singhal
A Trotman
B Larsen
B Rafkind
B Sigurbjörnsson
C Clarke
CW Cleverdon
CW Gay
D Demner-Fushman
D Metzler
DK Harman
G Salton
G Salton
G Salton
H Shatkay
H Yu
H Yu
H Yu
I Tbahriti
J Dean
J Kamps
Jimmy Lin
JM Ponte
JP Callan
K Seki
K Sparck Jones
L Hunter
LA Barroso
M Kaszkiel
M Krallinger
M Lalmas
M Wang
MA Hearst
MJ Schuemie
P Ogilvie
P Zweigenbaum
P Zweigenbaum
PK Shah
R Wilkinson
S Brin
S Ghemawat
S Tellex
SE Robertson
SE Robertson
T Elsayed
WJ Wilbur
WR Hersh
X Liu
Z Kou
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background With the growing availability of full-text articles online, scientists and other consumers of the life sciences literature now have the ability to go beyond searching bibliographic records (title, abstract, metadata) to directly access full-text content. Motivated by this emerging trend, I posed the following question: is searching full text more effective than searching abstracts? This question is answered by comparing text retrieval algorithms on MEDLINE® abstracts, full-text articles, and spans (paragraphs) within full-text articles using data from the TREC 2007 genomics track evaluation. Two retrieval models are examined: <it>bm25 </it>and the ranking algorithm implemented in the open-source Lucene search engine. Results Experiments show that treating an entire article as an indexing unit does not consistently yield higher effectiveness compared to abstract-only search. However, retrieval based on spans, or paragraphs-sized segments of full-text articles, consistently outperforms abstract-only search. Results suggest that highest overall effectiveness may be achieved by combining evidence from spans and full articles. Conclusion Users searching full text are more likely to find relevant articles than searching only abstracts. This finding affirms the value of full text collections for text retrieval and provides a starting point for future work in exploring algorithms that take advantage of rapidly-growing digital archives. Experimental results also highlight the need to develop distributed text retrieval algorithms, since full-text articles are significantly longer than abstracts and may require the computational resources of multiple machines in a cluster. The MapReduce programming model provides a convenient framework for organizing such computations.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Digital Repository at the University of Maryland

Data-poor categorization and passage retrieval for Gene Ontology Annotation in Swiss-Prot

Author: Ehrler Frédéric
Geissbühler Antoine
Jimeno Antonio
Ruch Patrick
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

Abstract Background In the context of the BioCreative competition, where training data were very sparse, we investigated two complementary tasks: 1) given a Swiss-Prot triplet, containing a protein, a GO (Gene Ontology) term and a relevant article, extraction of a short passage that justifies the GO category assignement; 2) given a Swiss-Prot pair, containing a protein and a relevant article, automatic assignement of a set of categories. Methods Sentence is the basic retrieval unit. Our classifier computes a distance between each sentence and the GO category provided with the Swiss-Prot entry. The Text Categorizer computes a distance between each GO term and the text of the article. Evaluations are reported both based on annotator judgements as established by the competition and based on mean average precision measures computed using a curated sample of Swiss-Prot. Results Our system achieved the best recall and precision combination both for passage retrieval and text categorization as evaluated by official evaluators. However, text categorization results were far below those in other data-poor text categorization experiments The top proposed term is relevant in less that 20% of cases, while categorization with other biomedical controlled vocabulary, such as the Medical Subject Headings, we achieved more than 90% precision. We also observe that the scoring methods used in our experiments, based on the retrieval status value of our engines, exhibits effective confidence estimation capabilities. Conclusion From a comparative perspective, the combination of retrieval and natural language processing methods we designed, achieved very competitive performances. Largely data-independent, our systems were no less effective that data-intensive approaches. These results suggests that the overall strategy could benefit a large class of information extraction tasks, especially when training data are missing. However, from a user perspective, results were disappointing. Further investigations are needed to design applicable end-user text mining tools for biologists.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

University of Melbourne Institutional Repository

Biomedical Question Answering: A Survey of Approaches and Challenges

Author: Chen Mosha
Huang Songfang
Jin Qiao
Liu Xiaozhong
Tan Chuanqi
Xiong Guangzhi
Ying Huaiyuan
Yu Qianlan
Yu Sheng
Yuan Zheng
Publication venue
Publication date: 08/09/2021
Field of study

Automatic Question Answering (QA) has been successfully applied in various domains such as search engines and chatbots. Biomedical QA (BQA), as an emerging QA task, enables innovative applications to effectively perceive, access and understand complex biomedical knowledge. There have been tremendous developments of BQA in the past two decades, which we classify into 5 distinctive approaches: classic, information retrieval, machine reading comprehension, knowledge base and question entailment approaches. In this survey, we introduce available datasets and representative methods of each BQA approach in detail. Despite the developments, BQA systems are still immature and rarely used in real-life settings. We identify and characterize several key challenges in BQA that might lead to this issue, and discuss some potential future directions to explore.Comment: In submission to ACM Computing Survey

arXiv.org e-Print Archive