Search CORE

53 research outputs found

Is Bagging Effective in the Classification of Small-Sample Genomic and Proteomic Data?

Author: Braga-Neto UM
Vu TT
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

There has been considerable interest recently in the application of bagging in the classification of both gene-expression data and protein-abundance mass spectrometry data. The approach is often justified by the improvement it produces on the performance of unstable, overfitting classification rules under small-sample situations. However, the question of real practical interest is whether the ensemble scheme will improve performance of those classifiers sufficiently to beat the performance of single stable, nonoverfitting classifiers, in the case of small-sample genomic and proteomic data sets. To investigate that question, we conducted a detailed empirical study, using publicly-available data sets from published genomic and proteomic studies. We observed that, under t-test and RELIEF filter-based feature selection, bagging generally does a good job of improving the performance of unstable, overfitting classifiers, such as CART decision trees and neural networks, but that improvement was not sufficient to beat the performance of single stable, nonoverfitting classifiers, such as diagonal and plain linear discriminant analysis, or 3-nearest neighbors. Furthermore, as expected, the ensemble method did not improve the performance of these classifiers significantly. Representative experimental results are presented and discussed in this work

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Reliable Classifier to Differentiate Primary and Secondary Acute Dengue Infection Based on IgG ELISA

Author: A Igarashi
AV Vorndam
DH Clarke
DJ Gubler
DJ Gubler
DM Morens
DS Burke
DW Vaughn
E Chungue
Ernesto T. A. Marques
G Kuno
JG Rigau-Perez
Lisa F. P. Ng
M Cordeiro
Marli Tenório Cordeiro
MG Guzman
MP Miagostovich
Rita Maria Ribeiro Nogueira
RS Lanciotti
S Matheus
S Schilling
SB Halstead
SB Halstead
Ulisses Braga-Neto
UM Braga-Neto
UM Braga-Neto
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

Dengue virus infection causes a wide spectrum of illness, ranging from sub-clinical to severe disease. Severe dengue is associated with sequential viral infections. A strict definition of primary versus secondary dengue infections requires a combination of several tests performed at different stages of the disease, which is not practical.We developed a simple method to classify dengue infections as primary or secondary based on the levels of dengue-specific IgG. A group of 109 dengue infection patients were classified as having primary or secondary dengue infection on the basis of a strict combination of results from assays of antigen-specific IgM and IgG, isolation of virus and detection of the viral genome by PCR tests performed on multiple samples, collected from each patient over a period of 30 days. The dengue-specific IgG levels of all samples from 59 of the patients were analyzed by linear discriminant analysis (LDA), and one- and two-dimensional classifiers were designed. The one-dimensional classifier was estimated by bolstered resubstitution error estimation to have 75.1% sensitivity and 92.5% specificity. The two-dimensional classifier was designed by taking also into consideration the number of days after the onset of symptoms, with an estimated sensitivity and specificity of 91.64% and 92.46%. The performance of the two-dimensional classifier was validated using an independent test set of standard samples from the remaining 50 patients. The classifications of the independent set of samples determined by the two-dimensional classifiers were further validated by comparing with two other dengue classification methods: hemagglutination inhibition (HI) assay and an in-house anti-dengue IgG-capture ELISA method. The decisions made with the two-dimensional classifier were in 100% accordance with the HI assay and 96% with the in-house ELISA.Once acute dengue infection has been determined, a 2-D classifier based on common dengue virus IgG kits can reliably distinguish primary and secondary dengue infections. Software for calculation and validation of the 2-D classifier is made available for download

Public Library of Science (PLOS)

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Directory of Open Access Journals

PubMed Central

Texas A&M Repository

On optimal Bayesian classification and risk estimation under multiple classes

Author: A Zollanvari
B Efron
B Efron
B Efron
B Hanczar
B Hanczar
BE Boser
C Cortes
C-C Chang
CM Bishop
ER Dougherty
H Xu
H Xu
JM Knight
L Devroye
LA Dalton
LA Dalton
LA Dalton
LA Dalton
LA Dalton
LA Dalton
LA Dalton
Lori A. Dalton
MJ van de Vijver
Mohammadmahdi R. Yousefi
MR Yousefi
MR Yousefi
MR Yousefi
MS Esfahani
NL Johnson
S Kotz
UM Braga-Neto
UM Braga-Neto
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Small-Sample Error Estimation for Bagged Classification Rules

Author: A Assareh
A Bhattacharjee
A Statnikov
B Efron
B Efron
B Efron
B Wu
B Zhang
B-L Adam
EC Gunther
G Izmirlian
G Martínez-Muñoz
HJ Issaq
L Breiman
L Breiman
L Xu
LJ Van't Veer
MJ van de Vijver
P Geurts
R Díaz-Uriarte
RE Banfield
RE Schapire
RO Duda
S Alvarez
T Bylander
TT Vu
U Braga-Neto
U Braga-Neto
U Braga-Neto
UM Braga-Neto
W Tong
Y Freund
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2010
Field of study

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

Texas A&M Repository

Using gene expression profiles from peripheral blood to identify asymptomatic responses to acute respiratory viral infections

Author: A Grishin
A Statnikov
AK Zaas
Alexander Statnikov
CF Aliferis
CF Aliferis
CF Aliferis
CF Wright
Constantin F Aliferis
GY Chen
J Dresios
Jörn-Hendrik Weitkamp
KA Carlson
Lauren McVoy
Nikita I Lytkin
O Kepp
O Ramilo
RJ Schneider
RR Novoa
T Ohman
UM Braga-Neto
VN Vapnik
Z Yang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background A recent study reported that gene expression profiles from peripheral blood samples of healthy subjects prior to viral inoculation were indistinguishable from profiles of subjects who received viral challenge but remained asymptomatic and uninfected. If true, this implies that the host immune response does not have a molecular signature. Given the high sensitivity of microarray technology, we were intrigued by this result and hypothesize that it was an artifact of data analysis. Findings Using acute respiratory viral challenge microarray data, we developed a molecular signature that for the first time allowed for an accurate differentiation between uninfected subjects prior to viral inoculation and subjects who remained asymptomatic after the viral challenge. Conclusions Our findings suggest that molecular signatures can be used to characterize immune responses to viruses and may improve our understanding of susceptibility to viral infection with possible implications for vaccine development.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Stratification bias in low signal microarray studies

Author: A Dupuy
A Molinaro
AP Bradley
Brian J Parker
C Ambroise
D Berrar
D Hand
F Provost
F Provost
F Provost
IH Witten
J Hanley
J Platt
J Swets
J Swets
Justin Bedo
L van 't Veer
P Flach
R Duda
R Kohavi
R Simon
S Dudoit
S Keerthi
S Varma
SG Baker
Simon Günter
T Dietterich
T Fawcett
T Hastie
T Sing
UM Braga-Neto
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Expanding the Understanding of Biases in Development of Clinical-Grade Molecular Signatures: A Case Study in Acute Respiratory Viral Infections

Author: A Rangarajan
A Statnikov
A Statnikov
A Statnikov
A Statnikov
A Statnikov
AK Zaas
AK Zaas
Alexander Statnikov
AM Glas
C Ambroise
CF Aliferis
CF Aliferis
CF Aliferis
Constantin F. Aliferis
EE Ntzani
ER DeLong
F Azuaje
FJ Gonzalez
GG Jackson
I Guyon
I Guyon
I Tsamardinos
J Pearl
J Pearl
JA Sparano
JT Leek
Jörn-Hendrik Weitkamp
KA Baggerly
Lauren McVoy
LM Cope
Nikita I. Lytkin
O Ramilo
R Kohavi
R Simon
RA Irizarry
RA Irizarry
RL Somorjai
TW Anderson
UM Braga-Neto
Vladimir Brusic
VN Vapnik
WE Johnson
Y Benjamini
Y Benjamini
Z Liu
Publication venue: Public Library of Science
Publication date: 01/06/2011
Field of study

The promise of modern personalized medicine is to use molecular and clinical information to better diagnose, manage, and treat disease, on an individual patient basis. These functions are predominantly enabled by molecular signatures, which are computational models for predicting phenotypes and other responses of interest from high-throughput assay data. Data-analytics is a central component of molecular signature development and can jeopardize the entire process if conducted incorrectly. While exploratory data analysis may tolerate suboptimal protocols, clinical-grade molecular signatures are subject to vastly stricter requirements. Closing the gap between standards for exploratory versus clinically successful molecular signatures entails a thorough understanding of possible biases in the data analysis phase and developing strategies to avoid them.Using a recently introduced data-analytic protocol as a case study, we provide an in-depth examination of the poorly studied biases of the data-analytic protocols related to signature multiplicity, biomarker redundancy, data preprocessing, and validation of signature reproducibility. The methodology and results presented in this work are aimed at expanding the understanding of these data-analytic biases that affect development of clinically robust molecular signatures.Several recommendations follow from the current study. First, all molecular signatures of a phenotype should be extracted to the extent possible, in order to provide comprehensive and accurate grounds for understanding disease pathogenesis. Second, redundant genes should generally be removed from final signatures to facilitate reproducibility and decrease manufacturing costs. Third, data preprocessing procedures should be designed so as not to bias biomarker selection. Finally, molecular signatures developed and applied on different phenotypes and populations of patients should be treated with great caution

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Factors Influencing the Statistical Power of Complex Data Analysis Protocols for Molecular Signature Development from Microarray Data

Author: A Bhattacharjee
A Butte
A Dupuy
A Potti
A Rosenwald
A Statnikov
A Statnikov
A Statnikov
Alexander Statnikov
AM Glas
B Freidlin
Bryan E. Shepherd
CF Aliferis
Constantin F. Aliferis
CX Ling
DG Beer
DJ Hand
EJ Yeoh
EL Lehmann
FE Harrell Jr
Frank E. Harrell
G Casella
Ioannis Tsamardinos
JA Sparano
Jonathan S. Schildcrout
JP Ioannidis
KK Dobbin
KK Dobbin
L Ein-Dor
L Shi
LA Habel
LJ van't Veer
M Saerens
MD Radmacher
ME Burczynski
MJ Marton
ML Lee
N Iizuka
P Baldi
PI Good
R Kohavi
R Simon
RE Fan
S Michiels
S Mukherjee
S Paik
S Paik
S Ramaswamy
SL Pomeroy
T Bammler
T Hastie
TR Golub
TS Furey
UM Braga-Neto
Vladimir B. Bajic
VN Vapnik
W Jiang
Publication venue: Public Library of Science
Publication date: 17/03/2009
Field of study

Critical to the development of molecular signatures from microarray and other high-throughput data is testing the statistical significance of the produced signature in order to ensure its statistical reproducibility. While current best practices emphasize sufficiently powered univariate tests of differential expression, little is known about the factors that affect the statistical power of complex multivariate analysis protocols for high-dimensional molecular signature development.We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power. The effects are demonstrated empirically by an analysis of 7 of the largest microarray cancer outcome prediction datasets and supplementary simulations, and by contrasting them to prior analyses of the same data.THE FINDINGS OF THE PRESENT STUDY HAVE TWO IMPORTANT PRACTICAL IMPLICATIONS: First, high-throughput studies by avoiding under-powered data analysis protocols, can achieve substantial economies in sample required to demonstrate statistical significance of predictive signal. Factors that affect power are identified and studied. Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis. Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Re-Annotation Is an Essential Step in Systems Biology Modeling of Functional Genomics Data

Author: A Harel
A Hutloff
AM Schnoes
Bart H. J. van den Berg
BH van den Berg
C Smith
CA Ouzounis
CE Jones
CE Rudd
CH Wu
D Barrell
D Devos
D Kemmer
DA Benson
DP Wall
E Eyras
E Quevillon
F Meurens
Fiona M. McCarthy
FM McCarthy
G Moreno-Hagelsieb
H Zhou
ICGS Consortium
Iddo Friedberg
J Burnside
JC Camus
JR Wortman
K Sellheyer
KM Kim
L Tian
LL Chen
M Andersson
M Andersson
M Ashburner
M Pruess
M Schena
M Vidric
ME van Berkel
MK Richardson
N Daraselia
N Gupta
N Rocques
O Gundogdu
PB Neerincx
PE Neiman
R Apweiler
R Edgar
RA Shilling
S Washietl
SE Brenner
Shane C. Burgess
SL Salzberg
Susan J. Lamont
T Barrett
TJ Buza
TJ Buza
UM Braga-Neto
V Wood
X Wang
YP de Jong
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

One motivation of systems biology research is to understand gene functions and interactions from functional genomics data such as that derived from microarrays. Up-to-date structural and functional annotations of genes are an essential foundation of systems biology modeling. We propose that the first essential step in any systems biology modeling of functional genomics data, especially for species with recently sequenced genomes, is gene structural and functional re-annotation. To demonstrate the impact of such re-annotation, we structurally and functionally re-annotated a microarray developed, and previously used, as a tool for disease research. We quantified the impact of this re-annotation on the array based on the total numbers of structural- and functional-annotations, the Gene Annotation Quality (GAQ) score, and canonical pathway coverage. We next quantified the impact of re-annotation on systems biology modeling using a previously published experiment that used this microarray. We show that re-annotation improves the quantity and quality of structural- and functional-annotations, allows a more comprehensive Gene Ontology based modeling, and improves pathway coverage for both the whole array and a differentially expressed mRNA subset. Our results also demonstrate that re-annotation can result in a different knowledge outcome derived from previous published research findings. We propose that, because of this, re-annotation should be considered to be an essential first step for deriving value from functional genomics data

Digital Repository @ Iowa State University (ISU)

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Scholars Junction - Mississippi State University Institutional Repository

An Improved, Bias-Reduced Probabilistic Functional Gene Network of Baker's Yeast, Saccharomyces cerevisiae

Background: Probabilistic functional gene networks are powerful theoretical frameworks for integrating heterogeneous functional genomics and proteomics data into objective models of cellular systems. Such networks provide syntheses of millions of discrete experimental observations, spanning DNA microarray experiments, physical protein interactions, genetic interactions, and comparative genomics; the resulting networks can then be easily applied to generate testable hypotheses regarding specific gene functions and associations. Methodology/Principal Findings: We report a significantly improved version (v. 2) of a probabilistic functional gene network [1] of the baker's yeast, Saccharomyces cerevisiae. We describe our optimization methods and illustrate their effects in three major areas: the reduction of functional bias in network training reference sets, the application of a probabilistic model for calculating confidences in pair-wise protein physical or genetic interactions, and the introduction of simple thresholds that eliminate many false positive mRNA co-expression relationships. Using the network, we predict and experimentally verify the function of the yeast RNA binding protein Puf6 in 60S ribosomal subunit biogenesis. Conclusions/Significance: YeastNet v. 2, constructed using these optimizations together with additional data, shows significant reduction in bias and improvements in precision and recall, in total covering 102,803 linkages among 5,483 yeast proteins (95% of the validated proteome). YeastNet is available from http://www.yeastnet.org.This work was supported by grants from the N.S.F. (IIS-0325116, EIA-0219061), N.I.H. (GM06779-01,GM076536-01), Welch (F-1515), and a Packard Fellowship (EMM). These agencies were not involved in the design and conduct of the study, in the collection, analysis, and interpretation of the data, or in the preparation, review, or approval of the manuscript.Cellular and Molecular Biolog

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Texas ScholarWorks