Search CORE

White Rose Research Online

Identifying Patients with Pneumonia from Free-Text Intensive Care Unit Reports

Author: Brad J Glavan
Fei Xia
Lucy Vanderwende
Mark M Wurfel
Meliha Yetisgen-Yildiz
Melihay@u Washington Edu
Publication venue
Publication date: 01/01/2011
Field of study

Abstract Clinical research studying critical illness phenotypes relies on the identification of clinical syndromes defined by consensus definitions. Pneumonia is a prime example. Historically, identifying pneumonia has required manual chart review, which is a time and resource intensive process. The overall research goal of our work is to develop automated approaches that accurately identify critical illness phenotypes. In this paper, we describe our approach to the identification of pneumonia from electronic medical records, present our preliminary results, and describe future steps

CiteSeerX

Exploring relation types for literature-based discovery

Author: Aronson
Bodenreider
Cohen
Cohen
de Marneffe
Fader
Gordon
Hearst
Hristovski
Hristovski
Hu
Hu
Judita Preiss
Kosto
Mark Stevenson
Petrič
Pratt
Rindesch
Robert Gaizauskas
Smalheiser
Smalheiser
Smalheiser
Smalheiser
Swanson
Swanson
Swanson
Swanson
Swanson
Swanson
Thaicharoen
Tsujii
Tsuruoka
Weeber
Weeber
Yetisgen-Yildiz
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2014
Field of study

Objective Literature-based discovery (LBD) aims to identify “hidden knowledge” in the medical literature by: (1) analyzing documents to identify pairs of explicitly related concepts (terms), then (2) hypothesizing novel relations between pairs of unrelated concepts that are implicitly related via a shared concept to which both are explicitly related. Many LBD approaches use simple techniques to identify semantically weak relations between concepts, for example, document co-occurrence. These generate huge numbers of hypotheses, difficult for humans to assess. More complex techniques rely on linguistic analysis, for example, shallow parsing, to identify semantically stronger relations. Such approaches generate fewer hypotheses, but may miss hidden knowledge. The authors investigate this trade-off in detail, comparing techniques for identifying related concepts to discover which are most suitable for LBD. Materials and methods A generic LBD system that can utilize a range of relation types was developed. Experiments were carried out comparing a number of techniques for identifying relations. Two approaches were used for evaluation: replication of existing discoveries and the “time slicing” approach.1 Results Previous LBD discoveries could be replicated using relations based either on document co-occurrence or linguistic analysis. Using relations based on linguistic analysis generated many fewer hypotheses, but a significantly greater proportion of them were candidates for hidden knowledge. Discussion and Conclusion The use of linguistic analysis-based relations improves accuracy of LBD without overly damaging coverage. LBD systems often generate huge numbers of hypotheses, which are infeasible to manually review. Improving their accuracy has the potential to make these systems significantly more usabl

CiteSeerX

University of Salford Institutional Repository

White Rose Research Online

The effect of word sense disambiguation accuracy on literature based discovery

Author: AJ Jimeno-Yepes
AR Aronson
D West
DR Swanson
DR Swanson
E Agirre
E Agirre
H Liu
J Preiss
J Preiss
J Preiss
Judita Preiss
M Carpuat
M Carpuat
M Rimmer
M Sanderson
M Stevenson
M Weeber
M Weeber
M Weeber
M Yetisgen-Yildiz
Mark Stevenson
O Bodenreider
P Resnik
RN Kostoff
W Cheng
YS Chan
Z Zhong
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Background The volume of research published in the biomedical domain has increasingly lead to researchers focussing on specific areas of interest and connections between findings being missed. Literature based discovery (LBD) attempts to address this problem by searching for previously unnoticed connections between published information (also known as “hidden knowledge”). A common approach is to identify hidden knowledge via shared linking terms. However, biomedical documents are highly ambiguous which can lead LBD systems to over generate hidden knowledge by hypothesising connections through different meanings of linking terms. Word Sense Disambiguation (WSD) aims to resolve ambiguities in text by identifying the meaning of ambiguous terms. This study explores the effect of WSD accuracy on LBD performance. Methods An existing LBD system is employed and four approaches to WSD of biomedical documents integrated with it. The accuracy of each WSD approach is determined by comparing its output against a standard benchmark. Evaluation of the LBD output is carried out using timeslicing approach, where hidden knowledge is generated from articles published prior to a certain cutoff date and a gold standard extracted from publications after the cutoff date. Results WSD accuracy varies depending on the approach used. The connection between the performance of the LBD and WSD systems are analysed to reveal a correlation between WSD accuracy and LBD performance. Conclusion This study reveals that LBD performance is sensitive to WSD accuracy. It is therefore concluded that WSD has the potential to improve the output of LBD systems by reducing the amount of spurious hidden knowledge that is generated. It is also suggested that further improvements in WSD accuracy have the potential to improve LBD accuracy

University of Salford Institutional Repository

White Rose Research Online

Table 4: Hamming loss, precision, accuracy, recall and F1-score for BoW and BoC varying the length of the training sequence in multi-labelled UVigoMED corpus.

Author: Aronson
Blei
Blizard
Bloehdorn
Bodenreider
Dai
Deerwester
Egozi
Elkin
Gabrilovich
Gabrilovich
Godbole
Harris
Hearst
Huang
Joachims
Jonquet
Kang
Kim
Landauer
Levelt
Lipscomb
Lowe
Medelyan
Milne
Pedregosa
Phan
Porter
Rigutini
Sahlgren
Sahlgren
Salton
Schapire
Sebastiani
Settles
Stock
Tsao
Tsoumakas
Täckström
Vivaldi
Wang
Wang
Yang
Yetisgen-Yildiz
Zhang
Zheng
Zhou
Zhou
Zhou
Publication venue: 'PeerJ'
Publication date
Field of study

Semi-automated screening of biomedical citations for systematic reviews

Author: A Aronson
A Blum
A Cohen
A Wilcox
B Settles
B Wallace
Byron C Wallace
C Blake
C Cole
C Counsell
Carla Brodley
Chih-Chung
Christopher H Schmid
CJL Chih-Wei Hsu
D Chen
DD Lewis
E Perrin
F Camous
G Druck
G Schohn
H Kilicoglu
Joseph Lau
K Brinker
KS Goh
KS Jones
L Breiman
L Hunter
M Barza
M Chung
M Yetisgen-Yildiz
N Japkowicz
P Wheeler
P Zweigenbaum
S Dasgupta
S Ertekin
S Kotsiantis
S Tong
T Joachims
T Terasawa
Thomas A Trikalinos
VN Vapnik
W Yu
Y Aphinyanaphongs
YAC Aphinyanaphongs
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Improving protein function prediction methods with integrated literature data

Author: A Karimpour-Fard
A Vazquez
A Vinayagam
Aaron P Gabow
AK Ramani
B Schwikowski
BTF Alako
C Brun
C von Mering
Debra S Goldberg
E Nabieva
HW Mewes
I Xenarios
J Rual
K Tsuda
L Hunter
L Hunter
L Tanabe
Lawrence E Hunter
M Ashburner
M Aubry
M Chagoyen
M Huynen
M Krallinger
M Krallinger
M Pelligri
M Yetisgen-Yildiz
OG Troyanskaya
P Srinivasan
PM Bowers
R Cilibrasi
R Hoffmann
S Letovsky
S Raychaudhuri
Sonia M Leach
T Schlitt
T Tanabe
TK Jenssen
U Karaoz
William A Baumgartner
Y Ofran
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target, and its potential for protein modifications. Several graph-theoretic approaches predict unidentified functions of proteins by using the functional annotations of better-characterized proteins in protein-protein interaction networks. We systematically consider the use of literature co-occurrence data, introduce a new method for quantifying the reliability of co-occurrence and test how performance differs across species. We also quantify changes in performance as the prediction algorithms annotate with increased specificity. Results We find that including information on the co-occurrence of proteins within an abstract greatly boosts performance in the Functional Flow graph-theoretic function prediction algorithm in yeast, fly and worm. This increase in performance is not simply due to the presence of additional edges since supplementing protein-protein interactions with co-occurrence data outperforms supplementing with a comparably-sized genetic interaction dataset. Through the combination of protein-protein interactions and co-occurrence data, the neighborhood around unknown proteins is quickly connected to well-characterized nodes which global prediction algorithms can exploit. Our method for quantifying co-occurrence reliability shows superior performance to the other methods, particularly at threshold values around 10% which yield the best trade off between coverage and accuracy. In contrast, the traditional way of asserting co-occurrence when at least one abstract mentions both proteins proves to be the worst method for generating co-occurrence data, introducing too many false positives. Annotating the functions with greater specificity is harder, but co-occurrence data still proves beneficial. Conclusion Co-occurrence data is a valuable supplemental source for graph-theoretic function prediction algorithms. A rapidly growing literature corpus ensures that co-occurrence data is a readily-available resource for nearly every studied organism, particularly those with small protein interaction databases. Though arguably biased toward known genes, co-occurrence data provides critical additional links to well-studied regions in the interaction network that graph-theoretic function prediction algorithms can exploit.</p

Public Library of Science (PLOS)

Integrated Bio-Entity Network: A System for Biological Knowledge Discovery

Author: A Ceol
A Chatr-aryamontri
A Coulet
A Grote
A Koike
A Mottaz
A Rzhetsky
A Yuryev
B Aranda
C Alfarano
C Blaschke
C Friedman
C Stark
CB Giles
CF Schaefer
D Barrell
D Hristovski
D Maglott
D Maglott
D Tikk
DR Swanson
EW Dijkstra
F Leitner
G Gonzalez
GR Mishra
H Liu
I Iossifov
I Vastrik
J Bjorne
JD Wren
Jinfeng Zhang
JO Korbel
Jun S. Liu
K Du
K Han
KD Pruitt
L Gong
L Salwinski
Lindsey Bell
LJ Jensen
LS Wong
M Ashburner
M Castagna
M Devignes
M Huang
M Kanehisa
M Krallinger
M Krallinger
M Kuhn
M Kuhn
M Yetisgen-Yildiz
MG Kann
N Daraselia
N Sierro
OL Griffith
P Pagel
P Shahi
P Srinivasan
QC Bui
QC Bui
R Apweiler
R Chowdhary
R Crnich
R Frijters
R Hoffmann
R Hoffmann
R Saetre
Rajesh Chowdhary
S Gama-Castro
S Mathivanan
S Naidu
S Yilmaz
T Beuming
TH Cormen
TS Keshava Prasad
V Matys
Xufeng Niu
Y Li
Y Wang
Ying Xu
Z Gao
Z Huang
Publication venue: Public Library of Science
Publication date: 27/06/2011
Field of study

A significant part of our biological knowledge is centered on relationships between biological entities (bio-entities) such as proteins, genes, small molecules, pathways, gene ontology (GO) terms and diseases. Accumulated at an increasing speed, the information on bio-entity relationships is archived in different forms at scattered places. Most of such information is buried in scientific literature as unstructured text. Organizing heterogeneous information in a structured form not only facilitates study of biological systems using integrative approaches, but also allows discovery of new knowledge in an automatic and systematic way. In this study, we performed a large scale integration of bio-entity relationship information from both databases containing manually annotated, structured information and automatic information extraction of unstructured text in scientific literature. The relationship information we integrated in this study includes protein–protein interactions, protein/gene regulations, protein–small molecule interactions, protein–GO relationships, protein–pathway relationships, and pathway–disease relationships. The relationship information is organized in a graph data structure, named integrated bio-entity network (IBN), where the vertices are the bio-entities and edges represent their relationships. Under this framework, graph theoretic algorithms can be designed to perform various knowledge discovery tasks. We designed breadth-first search with pruning (BFSP) and most probable path (MPP) algorithms to automatically generate hypotheses—the indirect relationships with high probabilities in the network. We show that IBN can be used to generate plausible hypotheses, which not only help to better understand the complex interactions in biological systems, but also provide guidance for experimental designs

Public Library of Science (PLOS)

Literature Mining for the Discovery of Hidden Connections between Drugs, Genes and Diseases

Author: AA Morgan
AC Nicholson
AJ Perez
Andrey Rzhetsky
AP Weetman
B Dell'Osso
B Rapoport
B Vaidya
BA Imhof
BT Alako
C Blaschke
C Nielsen
C Puozzo
CJ McDougle
CR Faltynek
D Chaussabel
D Denys
D Hristovski
D Olive
D Shao
DB Kell
DR Swanson
DR Swanson
E Yung
EC Butcher
EC Butcher
GR Hajer
H Kakeya
H Shatkay
HP Fischer
I Kola
J Han
J Kuhlmann
JA Wagner
Jacob de Vlieg
JD Wren
JD Wren
K Kajinami
K Miguita
K Njung'e
K Tomiyama
K Vandenborre
L Prokunina
LJ Jensen
M Briley
M Briley
M Campillos
M Hayashi
M Imoto
M Inazu
M Kamata
M Sugiyama
M Yetisgen-Yildiz
MA Andrade
MA Andrade
Marianne van Vugt
N Daraselia
NR Smalheiser
PD Pelton
PR Newby
R Frijters
R Frijters
R Frijters
R Homayouni
R Jelier
RA DiGiacomo
Raoul Frijters
René van Schaik
Ruben Smeets
RY Mukhtar
S Gordon
S Morikawa
S Raychaudhuri
S Raychaudhuri
SN Vaishnavi
SS Fuller
T Fawcett
T Hiramatsu
T Ito
T Shokawa
T Tabata
TK Jenssen
TT Ashburn
U Kaneyuki
WA Colburn
WK Goodman
Wynand Alkema
Y Ichimaru
Y Sugimoto
Y Tamori
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

The scientific literature represents a rich source for retrieval of knowledge on associations between biomedical concepts such as genes, diseases and cellular processes. A commonly used method to establish relationships between biomedical concepts from literature is co-occurrence. Apart from its use in knowledge retrieval, the co-occurrence method is also well-suited to discover new, hidden relationships between biomedical concepts following a simple ABC-principle, in which A and C have no direct relationship, but are connected via shared B-intermediates. In this paper we describe CoPub Discovery, a tool that mines the literature for new relationships between biomedical concepts. Statistical analysis using ROC curves showed that CoPub Discovery performed well over a wide range of settings and keyword thesauri. We subsequently used CoPub Discovery to search for new relationships between genes, drugs, pathways and diseases. Several of the newly found relationships were validated using independent literature sources. In addition, new predicted relationships between compounds and cell proliferation were validated and confirmed experimentally in an in vitro cell proliferation assay. The results show that CoPub Discovery is able to identify novel associations between genes, drugs, pathways and diseases that have a high probability of being biologically valid. This makes CoPub Discovery a useful tool to unravel the mechanisms behind disease, to find novel drug targets, or to find novel applications for existing drugs

Radboud Repository

Explicitly searching for useful inventions: dynamic relatedness and the costs of connecting versus synthesizing

Author: A Fishman
A Lemelin
A Nerkar
A Nerkar
A Verbeek
B Looy Van
BL Milman
C Baldwin
C Gay
C St. John
C Sternitzke
Chihmao Hsieh
CM Hsieh
D Hristovski
D Sahal
DR Swanson
DR Swanson
E Garfield
EJ Iversen
F Narin
F Narin
F Narin
FM Gollop
G Atallah
G Dosi
G Gavetti
H Small
H Small
HC Livesay
HW Park
I Cockburn
I Wartburg von
J Barney
J Callaert
J Dorroh
J Hausman
J Lanjouw
J Long
J.-P. S. Ruth
JA Schumpeter
K Brouthers
K Palepu
K Pavitt
K Pavitt
L Bornmann
L Fleming
L Fleming
L Lombardo
LC Ribeiro
M Acosta
M Albert
M Gary
M Liu
M Lubatkin
M Meyer
M Meyer
M Schilling
M Yetisgen-Yildiz
M-H Huang
MD Gordon
MH MacRoberts
MP Carpenter
MS Meyer
P Adler
P Collins
P Faucompré
P Thomas
PB Maurseth
R Amit
R Baron
R Baron
R Burt
R Burt
R Davis
R Levin
RA D’Aveni
RJW Tijssen
RK Lindsay
RP Merges
RP Rumelt
S Bhattacharya
S Brusoni
S Shane
S Tamada
S Thomke
S-J Wang
ScS Lo
SD Bass
SE Khilji
T Landauer
U Schmoch
U Schmoch
W Bijker
WB Arthur
X Tong
Y-G Lee
Y-G Lee
Y-H Cheng
Publication venue: Springer Netherlands
Publication date: 01/01/2010
Field of study

Inventions combine technological features. When features are barely related, burdensomely broad knowledge is required to identify the situations that they share. When features are overly related, burdensomely broad knowledge is required to identify the situations that distinguish them. Thus, according to my first hypothesis, when features are moderately related, the costs of connecting and costs of synthesizing are cumulatively minimized, and the most useful inventions emerge. I also hypothesize that continued experimentation with a specific set of features is likely to lead to the discovery of decreasingly useful inventions; the earlier-identified connections reflect the more common consumer situations. Covering data from all industries, the empirical analysis provides broad support for the first hypothesis. Regressions to test the second hypothesis are inconclusive when examining industry types individually. Yet, this study represents an exploratory investigation, and future research should test refined hypotheses with more sophisticated data, such as that found in literature-based discovery research