Search CORE

5,178 research outputs found

Feature Augmentation via Nonparametrics and Selection (FANS) in High Dimensional Classification

Author: Fan Jianqing
Feng Yang
Jiang Jiancheng
Tong Xin
Publication venue
Publication date: 02/01/2015
Field of study

We propose a high dimensional classification method that involves nonparametric feature augmentation. Knowing that marginal density ratios are the most powerful univariate classifiers, we use the ratio estimates to transform the original feature measurements. Subsequently, penalized logistic regression is invoked, taking as input the newly transformed or augmented features. This procedure trains models equipped with local complexity and global simplicity, thereby avoiding the curse of dimensionality while creating a flexible nonlinear decision boundary. The resulting method is called Feature Augmentation via Nonparametrics and Selection (FANS). We motivate FANS by generalizing the Naive Bayes model, writing the log ratio of joint densities as a linear combination of those of marginal densities. It is related to generalized additive models, but has better interpretability and computability. Risk bounds are developed for FANS. In numerical analysis, FANS is compared with competing methods, so as to provide a guideline on its best application domain. Real data analysis demonstrates that FANS performs very competitively on benchmark email spam and gene expression data sets. Moreover, FANS is implemented by an extremely fast algorithm through parallel computing.Comment: 30 pages, 2 figure

arXiv.org e-Print Archive

Princeton University Open Access Repository

Identification of a Sjogren\u27s syndrome susceptibility locus at OAS1 that influences isoform switching, protein expression, and responsiveness to type I interferons

Author: et al.
Huang Andrew J. W.
Publication venue: Digital Commons@Becker
Publication date: 01/01/2017
Field of study

Digital Commons@Becker

Cancer gene prioritization by integrative analysis of mRNA expression and DNA copy number data: a comparative review

Author: Akavia
Andrews
Baasiri
Chin
Dai
De Bie
Futreal
H.-U. Klein
Haverty
Hawkins
Hyman
Johnson
Kao
L. Lahti
M. Dugas
M. Schafer
McLendon
Menezes
Mullighan
Mullighan
Myllykangas
Olshen
Ortiz-Estevez
Phillips
Qin
S. Bicciato
Solvang
Soneson
Stranger
van Wieringen
van Wieringen
Publication venue: 'Oxford University Press (OUP)'
Publication date: 20/11/2011
Field of study

A variety of genome-wide profiling techniques are available to probe complementary aspects of genome structure and function. Integrative analysis of heterogeneous data sources can reveal higher-level interactions that cannot be detected based on individual observations. A standard integration task in cancer studies is to identify altered genomic regions that induce changes in the expression of the associated genes based on joint analysis of genome-wide gene expression and copy number profiling measurements. In this review, we provide a comparison among various modeling procedures for integrating genome-wide profiling data of gene copy number and transcriptional alterations and highlight common approaches to genomic data integration. A transparent benchmarking procedure is introduced to quantitatively compare the cancer gene prioritization performance of the alternative methods. The benchmarking algorithms and data sets are available at http://intcomp.r-forge.r-project.orgComment: PDF file including supplementary material. 9 pages. Preprin

arXiv.org e-Print Archive

Crossref

PubMed Central

Wageningen University & Research Publications

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

Challenges of Big Data Analysis

Author: Fan Jianqing
Han Fang
Liu Han
Publication venue: 'Oxford University Press (OUP)'
Publication date: 06/02/2014
Field of study

Big Data bring new opportunities to modern society and challenges to data scientists. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. These challenges are distinguished and require new computational and statistical paradigm. This article give overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. We also provide various new perspectives on the Big Data analysis and computation. In particular, we emphasis on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity. They can lead to wrong statistical inferences and consequently wrong scientific conclusions

arXiv.org e-Print Archive

CiteSeerX

Princeton University Open Access Repository

A method for analyzing censored survival phenotype with gene expression data

Author: Chen Chun-Houh
Li Ker-Chau
Sun Wei
Wu Tongtong
Yuan Shinsheng
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Survival time is an important clinical trait for many disease studies. Previous works have shown certain relationship between patients' gene expression profiles and survival time. However, due to the censoring effects of survival time and the high dimensionality of gene expression data, effective and unbiased selection of a gene expression signature to predict survival probabilities requires further study. Method We propose a method for an integrated study of survival time and gene expression. This method can be summarized as a two-step procedure: in the first step, a moderate number of genes are pre-selected using correlation or liquid association (LA). Imputation and transformation methods are employed for the correlation/LA calculation. In the second step, the dimension of the predictors is further reduced using the modified sliced inverse regression for censored data (censorSIR). Results The new method is tested via both simulated and real data. For the real data application, we employed a set of 295 breast cancer patients and found a linear combination of 22 gene expression profiles that are significantly correlated with patients' survival rate. Conclusion By an appropriate combination of feature selection and dimension reduction, we find a method of identifying gene expression signatures which is effective for survival prediction.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Carolina Digital Repository

Digital Repository at the University of Maryland

Recommended from our members

A genome-wide association study in chronic thromboembolic pulmonary hypertension and the ADAMTS13-VWF axis

Author: Newnham Michael
Publication venue: University of Cambridge
Publication date: 03/06/2019
Field of study

Chronic thromboembolic pulmonary hypertension (CTEPH) is an important and severe consequence of pulmonary embolism (PE), resulting from failure of thrombus resolution. Identifying genetic risk factors for CTEPH would provide important insights into pathobiology and might allow risk-stratification following PE. A genome-wide association study (GWAS) was performed in 1250 CTEPH patients, 1492 healthy controls and ~7 million single-nucleotide polymorphisms to identify novel disease loci. The ABO locus was identified as the most significant common variant genetic association with CTEPH in both a discovery and validation cohort. The A1 subgroup of ABO was enriched in CTEPH and this may result in multiple functional consequences including variation in plasma von Willebrand factor (VWF) levels. Abnormalities in haemostasis are implicated in CTEPH pathobiology, including elevated levels of VWF, which is cleaved by ADAMTS13 (a disintegrin and metalloproteinase with a thrombospondin type 1 motif, member 13). The ADAMTS13-VWF axis was investigated in 208 CTEPH patients including its relationship to ABO blood groups and ADAMTS13 genetic variants. Plasma ADAMTS13 levels are markedly reduced in CTEPH. This is independent of pulmonary hypertension, disease severity or systemic inflammation. Plasma VWF levels were confirmed to be markedly increased in CTEPH. These findings implicate dysregulation of the ADAMTS13-VWF axis in CTEPH pathobiology

Apollo (Cambridge)