Search CORE

21 research outputs found

Features generated for computational splice-site prediction correspond to functional elements

Author: A Goren
AJ McCullough
AJ McCullough
AL Blum
C Gooding
C Mathe
D Koller
G Kol
G Yeo
GE Crooks
H Liu
J Královicová
K Chua
K Han
KK Nelson
L Cartegni
L Mariño-Ramírez
Lise Getoor
LP Lim
LR Coulter
M Pertea
M Pertea
MB Stadler
ML Hastings
R Guigo
R Islamaj
R Islamaj Dogan
R Kohavi
R Singh
Rezarta Islamaj Dogan
S Degroeve
S Degroeve
Stephen M Mount
T Zhang
W John Wilbur
WG Fairbrother
XH Zhang
XH Zhang
XH Zhang
Y Yang
Z Wang
ZM Zheng
Publication venue: BioMed Central
Publication date: 01/10/2007
Field of study

Abstract Background Accurate selection of splice sites during the splicing of precursors to messenger RNA requires both relatively well-characterized signals at the splice sites and auxiliary signals in the adjacent exons and introns. We previously described a feature generation algorithm (FGA) that is capable of achieving high classification accuracy on human 3' splice sites. In this paper, we extend the splice-site prediction to 5' splice sites and explore the generated features for biologically meaningful splicing signals. Results We present examples from the observed features that correspond to known signals, both core signals (including the branch site and pyrimidine tract) and auxiliary signals (including GGG triplets and exon splicing enhancers). We present evidence that features identified by FGA include splicing signals not found by other methods. Conclusion Our generated features capture known biological signals in the expected sequence interval flanking splice sites. The method can be easily applied to other species and to similar classification problems, such as tissue-specific regulatory elements, polyadenylation sites, promoters, etc.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Digital Repository at the University of Maryland

Using Noun Phrases for Navigating Biomedical Literature on Pubmed: How Many Updates Are We Losing Track of?

Author: A Névéol
A Rzhetsky
Andrey Rzhetsky
BM Fonseca
C Jacquemin
C Manning
CD Manning
D Beeferman
D Rebholz-Schuhmann
D Shotton
D Shotton
D Srikrishna
D Trieschnigg
Devabhaktuni Srikrishna
DR Hunter
GF Cooper
J Evans
J Lin
JPA Ionnidis
M Muin
M Weeber
Marc A. Coram
MH MacRoberts
MJ Schuemie
N Tran
O Bodenreider
P Srinivasan
PL Elkin
Q He
Q Li
R Islamaj Dogan
R Schifanella
RA DiGiacomo
S Bird
T Rindflesch
T Wachter
V Sintchenko
W Kim
Y Huang
Z Lu
Z Sun
Publication venue: Public Library of Science
Publication date: 14/09/2011
Field of study

Author-supplied citations are a fraction of the related literature for a paper. The “related citations” on PubMed is typically dozens or hundreds of results long, and does not offer hints why these results are related. Using noun phrases derived from the sentences of the paper, we show it is possible to more transparently navigate to PubMed updates through search terms that can associate a paper with its citations. The algorithm to generate these search terms involved automatically extracting noun phrases from the paper using natural language processing tools, and ranking them by the number of occurrences in the paper compared to the number of occurrences on the web. We define search queries having at least one instance of overlap between the author-supplied citations of the paper and the top 20 search results as citation validated (CV). When the overlapping citations were written by same authors as the paper itself, we define it as CV-S and different authors is defined as CV-D. For a systematic sample of 883 papers on PubMed Central, at least one of the search terms for 86% of the papers is CV-D versus 65% for the top 20 PubMed “related citations.” We hypothesize these quantities computed for the 20 million papers on PubMed to differ within 5% of these percentages. Averaged across all 883 papers, 5 search terms are CV-D, and 10 search terms are CV-S, and 6 unique citations validate these searches. Potentially related literature uncovered by citation-validated searches (either CV-S or CV-D) are on the order of ten per paper – many more if the remaining searches that are not citation-validated are taken into account. The significance and relationship of each search result to the paper can only be vetted and explained by a researcher with knowledge of or interest in that paper

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text

Author: A Abi-Haidar
A Ceol
A Chatr-aryamontri
A Cohen
A Kolchinsky
A Lourenco
A McCallum
A Ng
A Yeh
Alfonso Valencia
AM Cohen
Andrew Chatr-aryamontri
Andrew Winter
Ashish V Tendulkar
B Aranda
B Settles
BP Suomela
C Blaschke
C Elkan
C Stark
Charles Elkan
D Bauer
D Salgado
David Salgado
E Marcotte
F Ehrler
F Leitner
F Leitner
F Leitner
F Rinaldi
F Rinaldi
F Rinaldi
Fabio Rinaldi
Feifan Liu
Florian Leitner
G Andrew
Gerold Schneider
Gianni Cesareni
GL Poulter
Graciela Gonzalez
H Daumé III
H Hermjakob
H Shatkay
H Wang
Hagit Shatkay
HK Rekapalli
I Donaldson
J Lin
Jean-Fred Fontaine
JR Curran
Keith Noto
KG Dowell
L Tanabe
Leonardo Briganti
Livia Perfetto
Luana Licata
Luis Rocha
Luisa Castagnoli
M Hall
M Harris
M Hollander
M Krallinger
M Krallinger
M Krallinger
M Krallinger
M Krallinger
M Oberoi
Marta Iannuccelli
Martin Krallinger
Miguel A Andrade-Navarro
Miguel Vazquez
Mike Tyers
P Wang
R Chowdhary
R Hoffmann
Rafal Rak
Rezarta Islamaj Dogan
Robert Leaman
S Kim
S Matos
S Orchard
Sergio Matos
Shashank Agarwal
Sun Kim
T Kappeler
T Ono
T Zhang
W Baumgartner
W Hersh
W Hersh
W John Wilbur
W Wilbur
Xinglong Wang
Y Niu
Y Sasaki
Z Cao
Zhiyong Lu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

BACKGROUND: Determining usefulness of biomedical text mining systems requires realistic task definition and data selection criteria without artificial constraints, measuring performance aspects that go beyond traditional metrics. The BioCreative III Protein-Protein Interaction (PPI) tasks were motivated by such considerations, trying to address aspects including how the end user would oversee the generated output, for instance by providing ranked results, textual evidence for human interpretation or measuring time savings by using automated systems. Detecting articles describing complex biological events like PPIs was addressed in the Article Classification Task (ACT), where participants were asked to implement tools for detecting PPI-describing abstracts. Therefore the BCIII-ACT corpus was provided, which includes a training, development and test set of over 12,000 PPI relevant and non-relevant PubMed abstracts labeled manually by domain experts and recording also the human classification times. The Interaction Method Task (IMT) went beyond abstracts and required mining for associations between more than 3,500 full text articles and interaction detection method ontology concepts that had been applied to detect the PPIs reported in them.RESULTS:A total of 11 teams participated in at least one of the two PPI tasks (10 in ACT and 8 in the IMT) and a total of 62 persons were involved either as participants or in preparing data sets/evaluating these tasks. Per task, each team was allowed to submit five runs offline and another five online via the BioCreative Meta-Server. From the 52 runs submitted for the ACT, the highest Matthew's Correlation Coefficient (MCC) score measured was 0.55 at an accuracy of 89 and the best AUC iP/R was 68. Most ACT teams explored machine learning methods, some of them also used lexical resources like MeSH terms, PSI-MI concepts or particular lists of verbs and nouns, some integrated NER approaches. For the IMT, a total of 42 runs were evaluated by comparing systems against manually generated annotations done by curators from the BioGRID and MINT databases. The highest AUC iP/R achieved by any run was 53, the best MCC score 0.55. In case of competitive systems with an acceptable recall (above 35) the macro-averaged precision ranged between 50 and 80, with a maximum F-Score of 55. CONCLUSIONS: The results of the ACT task of BioCreative III indicate that classification of large unbalanced article collections reflecting the real class imbalance is still challenging. Nevertheless, text-mining tools that report ranked lists of relevant articles for manual selection can potentially reduce the time needed to identify half of the relevant articles to less than 1/4 of the time when compared to unranked results. Detecting associations between full text articles and interaction detection method PSI-MI terms (IMT) is more difficult than might be anticipated. This is due to the variability of method term mentions, errors resulting from pre-processing of articles provided as PDF files, and the heterogeneity and different granularity of method term concepts encountered in the ontology. However, combining the sophisticated techniques developed by the participants with supporting evidence strings derived from the articles for human interpretation could result in practical modules for biological annotation workflows

Crossref

Springer - Publisher Connector

Monash University Research Portal

User intent behind medical queries

Author: Boyer C.
Dogan R. Islamaj
Meats E.
Palotti J.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Understanding PubMed(R) user search behavior through log analysis

Author: A. Neveol
G. C. Murray
Madle
Madle
R. Islamaj Dogan
Roy
Z. Lu
Publication venue: 'Oxford University Press (OUP)'
Publication date
Field of study

Crossref

Analyzing genomic data: understanding the genome

Author: Becker RA
Chang F
DeLisi C
Islamaj Dogan R
Kalow W
Karolchik D
Madupu R
Moore GE
Schrödinger E
Wilkinson MD
Publication venue: 'Wiley'
Publication date
Field of study

Crossref

PubMed Phrases, an open set of coherent phrases for searching biomedical literature

Author: A Resnick
A Smith
CD Manning
DM Blei
HJ Larson
L Yeganova
L Yeganova
M Gambhir
R Islamaj
R Islamaj Dogan
R Murphy
RA Baeza-Yates
S Kim
S Kim
S Robertson
S. Kim
W Kim
WG Kim
Y Benjamini
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

BIoC: a minimalist approach to interoperability for biomedical text processing

Author: A. Valencia
C. H. Wu
D. C. Comeau
F. Leitner
F. Rinaldi
Hu
Johnson
K. B. Cohen
K. Verspoor
Liu
M. Krallinger
M. Torii
P. Ciccarese
R. Islamaj Dogan
Rebholz-Schuhmann
Rinaldi
Sohn
T. C. Wiegers
W. J. Wilbur
Wang
Wei
Y. Peng
Z. Lu
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2013
Field of study

A vast amount of scientific information is encoded in natural language text, and the quantity of such text has become so great that it is no longer economically feasible to have a human as the first step in the search process. Natural language processing and text mining tools have become essential to facilitate the search for and extraction of information from text. This has led to vigorous research efforts to create useful tools and to create humanly labeled text corpora, which can be used to improve such tools. To encourage combining these efforts into larger, more powerful and more capable systems, a common interchange format to represent, store and exchange the data in a simple manner between different language processing systems and text mining tools is highly desirable. Here we propose a simple extensible mark-up language format to share text documents and annotations. The proposed annotation approach allows a large number of different annotations to be represented including sentences, tokens, parts of speech, named entities such as genes or diseases and relationships between named entities. In addition, we provide simple code to hold this data, read it from and write it back to extensible mark-up language files and perform some sample processing. We also describe completed as well as ongoing work to apply the approach in several directions. Code and data are available at http://bioc.sourceforge.net/. Database URL: http://bioc.sourceforge.net

Crossref

PubMed Central

ZORA

University of Melbourne Institutional Repository

A TWO-STAGE EVOLUTIONARY APPROACH FOR EFFECTIVE CLASSIFICATION OF HYPERSENSITIVE DNA SEQUENCES

Author: AMARDA SHEHU
Fan R.-E.
Habib T.
Huang C.-L.
Islamaj-Dogan R.
KENNETH A. DE JONG
Koza J.
Leslie C.
Noble W. S.
Schölkopf B.
UDAY KAMATH
Vapnik V. N.
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date
Field of study

Crossref

Better duplicate detection for systematic reviewers: evaluation of Systematic Review Assistant-Deduplication Module

Author: AW Chan
B Smith
BC Wallace
C Lefebvre
C Ramos-Remus
I Chalmers
J Kleijnen
JP Ioannidis
JP Rovers
M Sampson
MC Sievert
P Glasziou
P Royle
R Islamaj Dogan
T Odaka
X Qi
Y Jiang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref