Search CORE

15 research outputs found

The effect of word sense disambiguation accuracy on literature based discovery

Author: AJ Jimeno-Yepes
AR Aronson
D West
DR Swanson
DR Swanson
E Agirre
E Agirre
H Liu
J Preiss
J Preiss
J Preiss
Judita Preiss
M Carpuat
M Carpuat
M Rimmer
M Sanderson
M Stevenson
M Weeber
M Weeber
M Weeber
M Yetisgen-Yildiz
Mark Stevenson
O Bodenreider
P Resnik
RN Kostoff
W Cheng
YS Chan
Z Zhong
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Background The volume of research published in the biomedical domain has increasingly lead to researchers focussing on specific areas of interest and connections between findings being missed. Literature based discovery (LBD) attempts to address this problem by searching for previously unnoticed connections between published information (also known as “hidden knowledge”). A common approach is to identify hidden knowledge via shared linking terms. However, biomedical documents are highly ambiguous which can lead LBD systems to over generate hidden knowledge by hypothesising connections through different meanings of linking terms. Word Sense Disambiguation (WSD) aims to resolve ambiguities in text by identifying the meaning of ambiguous terms. This study explores the effect of WSD accuracy on LBD performance. Methods An existing LBD system is employed and four approaches to WSD of biomedical documents integrated with it. The accuracy of each WSD approach is determined by comparing its output against a standard benchmark. Evaluation of the LBD output is carried out using timeslicing approach, where hidden knowledge is generated from articles published prior to a certain cutoff date and a gold standard extracted from publications after the cutoff date. Results WSD accuracy varies depending on the approach used. The connection between the performance of the LBD and WSD systems are analysed to reveal a correlation between WSD accuracy and LBD performance. Conclusion This study reveals that LBD performance is sensitive to WSD accuracy. It is therefore concluded that WSD has the potential to improve the output of LBD systems by reducing the amount of spurious hidden knowledge that is generated. It is also suggested that further improvements in WSD accuracy have the potential to improve LBD accuracy

University of Salford Institutional Repository

Crossref

Springer - Publisher Connector

PubMed Central

White Rose Research Online

Identification of highly related references about gene-disease association

Author: A Faro
A Veloso
A Özgür
AJ Jimeno-Yepes
C Baral
C Perez-Iratxeta
Chia-Chun Shih
CO Tudor
D Hristovski
D Kwon
G Gonzalez
H-W Chun
I Amini
J Kim
J Xu
J Zhao
J-Y Yeh
KA Gray
L Zhang
N Tiffin
N Tiffin
R Frijters
R Rak
R-L Liu
R-L Liu
Rey-Long Liu
S Gerani
S Kim
SE Robertson
ST Ahmed
T Joachims
T Tao
T-Y Liu
TC Wiegers
WA Cheung
Y Hu
Z Cao
Z Lu
Z Lu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

GeneRIF indexing: sentence selection based on machine learning

Author: Aronson AR
Jimeno-Yepes AJ
Mork JG
Sticco JC
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 31/05/2013
Field of study

BACKGROUND: A Gene Reference Into Function (GeneRIF) describes novel functionality of genes. GeneRIFs are available from the National Center for Biotechnology Information (NCBI) Gene database. GeneRIF indexing is performed manually, and the intention of our work is to provide methods to support creating the GeneRIF entries. The creation of GeneRIF entries involves the identification of the genes mentioned in MEDLINE®; citations and the sentences describing a novel function. RESULTS: We have compared several learning algorithms and several features extracted or derived from MEDLINE sentences to determine if a sentence should be selected for GeneRIF indexing. Features are derived from the sentences or using mechanisms to augment the information provided by them: assigning a discourse label using a previously trained model, for example. We show that machine learning approaches with specific feature combinations achieve results close to one of the annotators. We have evaluated different feature sets and learning algorithms. In particular, Naïve Bayes achieves better performance with a selection of features similar to one used in related work, which considers the location of the sentence, the discourse of the sentence and the functional terminology in it. CONCLUSIONS: The current performance is at a level similar to human annotation and it shows that machine learning can be used to automate the task of sentence selection for GeneRIF annotation. The current experiments are limited to the human species. We would like to see how the methodology can be extended to other species, specifically the normalization of gene mentions in other species

University of Melbourne Institutional Repository

Studying the correlation between different word sense disambiguation methods and summarization effectiveness in biomedical texts

Author: Aronson AR
Diaz A
Jimeno-Yepes AJ
Plaza L
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 26/08/2011
Field of study

BACKGROUND: Word sense disambiguation (WSD) attempts to solve lexical ambiguities by identifying the correct meaning of a word based on its context. WSD has been demonstrated to be an important step in knowledge-based approaches to automatic summarization. However, the correlation between the accuracy of the WSD methods and the summarization performance has never been studied. RESULTS: We present three existing knowledge-based WSD approaches and a graph-based summarizer. Both the WSD approaches and the summarizer employ the Unified Medical Language System (UMLS) Metathesaurus as the knowledge source. We first evaluate WSD directly, by comparing the prediction of the WSD methods to two reference sets: the NLM WSD dataset and the MSH WSD collection. We next apply the different WSD methods as part of the summarizer, to map documents onto concepts in the UMLS Metathesaurus, and evaluate the summaries that are generated. The results obtained by the different methods in both evaluations are studied and compared. CONCLUSIONS: It has been found that the use of WSD techniques has a positive impact on the results of our graph-based summarizer, and that, when both the WSD and summarization tasks are assessed over large and homogeneous evaluation collections, there exists a correlation between the overall results of the WSD and summarization tasks. Furthermore, the best WSD algorithm in the first task tends to be also the best one in the second. However, we also found that the improvement achieved by the summarizer is not directly correlated with the WSD performance. The most likely reason is that the errors in disambiguation are not equally important but depend on the relative salience of the different concepts in the document to be summarized

University of Melbourne Institutional Repository

MeSH indexing based on automatically generated summaries

Author: Aronson AR
Diaz A
Jimeno-Yepes AJ
Mork JG
Plaza L
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 26/06/2013
Field of study

BACKGROUND: MEDLINE citations are manually indexed at the U.S. National Library of Medicine (NLM) using as reference the Medical Subject Headings (MeSH) controlled vocabulary. For this task, the human indexers read the full text of the article. Due to the growth of MEDLINE, the NLM Indexing Initiative explores indexing methodologies that can support the task of the indexers. Medical Text Indexer (MTI) is a tool developed by the NLM Indexing Initiative to provide MeSH indexing recommendations to indexers. Currently, the input to MTI is MEDLINE citations, title and abstract only. Previous work has shown that using full text as input to MTI increases recall, but decreases precision sharply. We propose using summaries generated automatically from the full text for the input to MTI to use in the task of suggesting MeSH headings to indexers. Summaries distill the most salient information from the full text, which might increase the coverage of automatic indexing approaches based on MEDLINE. We hypothesize that if the results were good enough, manual indexers could possibly use automatic summaries instead of the full texts, along with the recommendations of MTI, to speed up the process while maintaining high quality of indexing results. RESULTS: We have generated summaries of different lengths using two different summarizers, and evaluated the MTI indexing on the summaries using different algorithms: MTI, individual MTI components, and machine learning. The results are compared to those of full text articles and MEDLINE citations. Our results show that automatically generated summaries achieve similar recall but higher precision compared to full text articles. Compared to MEDLINE citations, summaries achieve higher recall but lower precision. CONCLUSIONS: Our results show that automatic summaries produce better indexing than full text articles. Summaries produce similar recall to full text but much better precision, which seems to indicate that automatic summaries can efficiently capture the most important contents within the original articles. The combination of MEDLINE citations and automatically generated summaries could improve the recommendations suggested by MTI. On the other hand, indexing performance might be dependent on the MeSH heading being indexed. Summarization techniques could thus be considered as a feature selection algorithm that might have to be tuned individually for each MeSH heading

University of Melbourne Institutional Repository

Biomedical word sense disambiguation with bidirectional long short-term memory and attention-based neural networks

Author: A Duque
A Graves
AJ Jimeno-Yepes
AJ Yepes
Canlin Zhang
Daniel Biś
F Gers
Guergana K. Savova
H Liu
H Liu
H Xu
H Yu
L Bottou
R Navigli
S Festag
S Hochreiter
Xiuwen Liu
Y Wang
Zhe He
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Passage-Based Bibliographic Coupling: An Inter-Article Similarity Measure for Biomedical Articles

Author: A Elkiss
A Ritchie
AJ Jimeno-Yepes
B Aljaber
CO Tudor
EV Bernstam
F Janssens
F Janssens
FM Ortuño
H Small
HD White
HG Small
J Lin
KG Becker
KW Boyack
KW Boyack
KW Boyack
MJ Schuemie
MM Kessler
P Glenisson
R.-L. Liu
Rey-Long Liu
S Liu
TC Wiegers
Wolfgang Glanzel
X Liu
X Liu
Z Lu
Publication venue: 'Public Library of Science (PLoS)'
Publication date
Field of study

Crossref

Text mining to support gene ontology curation and vice versa

Author: A Doms
A Singhal
A Yeh
AA Morgan
AJ Jimeno-Yepes
AL Veuthey
BP Anton
C Blaschke
CJ Mungall
CL Mills
D Ferrucci
D Rebholz-Schuhmann
E Pasche
EB Camon
EC Dimmer
F Ehrler
F Sebastiani
J Burger
J Lin
L Bell
L Hirschman
L Perfetto
L Smith
M Lupu
MA Bauer
P Hainaut
P Ruch
PD Lena
S Abdou
SIB Swiss Institute of Bioinformatics Members
W Hersh
WS Campbell
YL Yip
Z Zeng
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 24/11/2016
Field of study

In this chapter, we explain how text mining can support the curation of molecular biology databases dealing with protein functions. We also show how curated data can play a disruptive role in the developments of text mining methods. We review a decade of efforts to improve the automatic assignment of Gene Ontology (GO) descriptors, the reference ontology for the characterization of genes and gene products. To illustrate the high potential of this approach, we compare the performances of an automatic text categorizer and show a large improvement of +225 % in both precision and recall on benchmarked data. We argue that automatic text categorization functions can ultimately be embedded into a Question-Answering (QA) system to answer questions related to protein functions. Because GO descriptors can be relatively long and specific, traditional QA systems cannot answer such questions. A new type of QA system, so-called Deep QA which uses machine learning methods trained with curated contents, is thus emerging. Finally, future advances of text mining instruments are directly dependent on the availability of high-quality annotated contents at every curation step. Databases workflows must start recording explicitly all the data they curate and ideally also some of the data they do not curate

Crossref

Hes-so: ArODES Open Archive (University of Applied Sciences and Arts Western Switzerland / Haute école spécialisée de Suisse occidentale / FH Westschweiz)

Springer - Publisher Connector

MeSH Now: automatic MeSH indexing at PubMed scale via learning to rank

Author: A Jimeno-Yepes
A Mottaz
A Névéol
A Névéol
A Névéol
AJ Jimeno-Yepes
AR Aronson
BL Humphreys
C Perez-Iratxeta
C Quoc
D Metzler
D Trieschnigg
DR Masys
G Tsatsaronis
G Tsoumakas
I Partalas
J Lin
JG Mork
JG Mork
JH Friedman
JL D’Souza
JP DeShazo
K Auken Van
K Liu
K Liu
KW Boyack
LD Gruppen
M Huang
MA Sartor
ME Funk
ME Ruiz
MR Tennant
NR Smalheiser
P Ruch
PF Brown
PJ Huber
Q Wu
R Islamaj Dogan
S Bhattacharya
S Peng
S Sohn
S Zhu
SC Burrows
SD Jani
T Nakazato
T Ono
T-Y Liu
VI Torvik
W Liu
WA Cheung
WJWK Wilbur
Y Mao
Y Mao
Z Lu
Z Lu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

miRiaD: A Text Mining Tool for Detecting Associations of microRNAs with Diseases

Author: AJ Jimeno-Yepes
B Xie
BY Jiang
C Blenkiron
C Burges
C Gong
C Tudor
CG Chapman
CH Wei
EW Myers
H Dweep
H Naeem
IS Vlachos
J Moura
K Järvelin
KJ Rayner
M Gori
MR Fabian
Q Jiang
Q Wang
R Leaman
S Greco
S Maciotta
SD Hsu
T Colangelo
T Li
X Xu
X Yu
Y Li
Y Peng
Y Peng
Y Xu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref