5,178 research outputs found
Feature Augmentation via Nonparametrics and Selection (FANS) in High Dimensional Classification
We propose a high dimensional classification method that involves
nonparametric feature augmentation. Knowing that marginal density ratios are
the most powerful univariate classifiers, we use the ratio estimates to
transform the original feature measurements. Subsequently, penalized logistic
regression is invoked, taking as input the newly transformed or augmented
features. This procedure trains models equipped with local complexity and
global simplicity, thereby avoiding the curse of dimensionality while creating
a flexible nonlinear decision boundary. The resulting method is called Feature
Augmentation via Nonparametrics and Selection (FANS). We motivate FANS by
generalizing the Naive Bayes model, writing the log ratio of joint densities as
a linear combination of those of marginal densities. It is related to
generalized additive models, but has better interpretability and computability.
Risk bounds are developed for FANS. In numerical analysis, FANS is compared
with competing methods, so as to provide a guideline on its best application
domain. Real data analysis demonstrates that FANS performs very competitively
on benchmark email spam and gene expression data sets. Moreover, FANS is
implemented by an extremely fast algorithm through parallel computing.Comment: 30 pages, 2 figure
Cancer gene prioritization by integrative analysis of mRNA expression and DNA copy number data: a comparative review
A variety of genome-wide profiling techniques are available to probe
complementary aspects of genome structure and function. Integrative analysis of
heterogeneous data sources can reveal higher-level interactions that cannot be
detected based on individual observations. A standard integration task in
cancer studies is to identify altered genomic regions that induce changes in
the expression of the associated genes based on joint analysis of genome-wide
gene expression and copy number profiling measurements. In this review, we
provide a comparison among various modeling procedures for integrating
genome-wide profiling data of gene copy number and transcriptional alterations
and highlight common approaches to genomic data integration. A transparent
benchmarking procedure is introduced to quantitatively compare the cancer gene
prioritization performance of the alternative methods. The benchmarking
algorithms and data sets are available at http://intcomp.r-forge.r-project.orgComment: PDF file including supplementary material. 9 pages. Preprin
Challenges of Big Data Analysis
Big Data bring new opportunities to modern society and challenges to data
scientists. On one hand, Big Data hold great promises for discovering subtle
population patterns and heterogeneities that are not possible with small-scale
data. On the other hand, the massive sample size and high dimensionality of Big
Data introduce unique computational and statistical challenges, including
scalability and storage bottleneck, noise accumulation, spurious correlation,
incidental endogeneity, and measurement errors. These challenges are
distinguished and require new computational and statistical paradigm. This
article give overviews on the salient features of Big Data and how these
features impact on paradigm change on statistical and computational methods as
well as computing architectures. We also provide various new perspectives on
the Big Data analysis and computation. In particular, we emphasis on the
viability of the sparsest solution in high-confidence set and point out that
exogeneous assumptions in most statistical methods for Big Data can not be
validated due to incidental endogeneity. They can lead to wrong statistical
inferences and consequently wrong scientific conclusions
A method for analyzing censored survival phenotype with gene expression data
<p>Abstract</p> <p>Background</p> <p>Survival time is an important clinical trait for many disease studies. Previous works have shown certain relationship between patients' gene expression profiles and survival time. However, due to the censoring effects of survival time and the high dimensionality of gene expression data, effective and unbiased selection of a gene expression signature to predict survival probabilities requires further study.</p> <p>Method</p> <p>We propose a method for an integrated study of survival time and gene expression. This method can be summarized as a two-step procedure: in the first step, a moderate number of genes are pre-selected using correlation or liquid association (LA). Imputation and transformation methods are employed for the correlation/LA calculation. In the second step, the dimension of the predictors is further reduced using the modified sliced inverse regression for censored data (censorSIR).</p> <p>Results</p> <p>The new method is tested via both simulated and real data. For the real data application, we employed a set of 295 breast cancer patients and found a linear combination of 22 gene expression profiles that are significantly correlated with patients' survival rate.</p> <p>Conclusion</p> <p>By an appropriate combination of feature selection and dimension reduction, we find a method of identifying gene expression signatures which is effective for survival prediction.</p
Recommended from our members
A genome-wide association study in chronic thromboembolic pulmonary hypertension and the ADAMTS13-VWF axis
Chronic thromboembolic pulmonary hypertension (CTEPH) is an important and severe consequence of pulmonary embolism (PE), resulting from failure of thrombus resolution. Identifying genetic risk factors for CTEPH would provide important insights into pathobiology and might allow risk-stratification following PE. A genome-wide association study (GWAS) was performed in 1250 CTEPH patients, 1492 healthy controls and ~7 million single-nucleotide polymorphisms to identify novel disease loci.
The ABO locus was identified as the most significant common variant genetic association with CTEPH in both a discovery and validation cohort. The A1 subgroup of ABO was enriched in CTEPH and this may result in multiple functional consequences including variation in plasma von Willebrand factor (VWF) levels.
Abnormalities in haemostasis are implicated in CTEPH pathobiology, including elevated levels of VWF, which is cleaved by ADAMTS13 (a disintegrin and metalloproteinase with a thrombospondin type 1 motif, member 13). The ADAMTS13-VWF axis was investigated in 208 CTEPH patients including its relationship to ABO blood groups and ADAMTS13 genetic variants.
Plasma ADAMTS13 levels are markedly reduced in CTEPH. This is independent of pulmonary hypertension, disease severity or systemic inflammation. Plasma VWF levels were confirmed to be markedly increased in CTEPH. These findings implicate dysregulation of the ADAMTS13-VWF axis in CTEPH pathobiology
- …