Search CORE

700 research outputs found

Scalable Data Augmentation for Deep Learning

Author: Polson Nicholas G.
Sokolov Vadim O.
Wang Yuexi
Publication venue
Publication date: 22/03/2019
Field of study

Scalable Data Augmentation (SDA) provides a framework for training deep learning models using auxiliary hidden layers. Scalable MCMC is available for network training and inference. SDA provides a number of computational advantages over traditional algorithms, such as avoiding backtracking, local modes and can perform optimization with stochastic gradient descent (SGD) in TensorFlow. Standard deep neural networks with logit, ReLU and SVM activation functions are straightforward to implement. To illustrate our architectures and methodology, we use P\'{o}lya-Gamma logit data augmentation for a number of standard datasets. Finally, we conclude with directions for future research

arXiv.org e-Print Archive

Feature Augmentation via Nonparametrics and Selection (FANS) in High Dimensional Classification

Author: Fan Jianqing
Feng Yang
Jiang Jiancheng
Tong Xin
Publication venue
Publication date: 02/01/2015
Field of study

We propose a high dimensional classification method that involves nonparametric feature augmentation. Knowing that marginal density ratios are the most powerful univariate classifiers, we use the ratio estimates to transform the original feature measurements. Subsequently, penalized logistic regression is invoked, taking as input the newly transformed or augmented features. This procedure trains models equipped with local complexity and global simplicity, thereby avoiding the curse of dimensionality while creating a flexible nonlinear decision boundary. The resulting method is called Feature Augmentation via Nonparametrics and Selection (FANS). We motivate FANS by generalizing the Naive Bayes model, writing the log ratio of joint densities as a linear combination of those of marginal densities. It is related to generalized additive models, but has better interpretability and computability. Risk bounds are developed for FANS. In numerical analysis, FANS is compared with competing methods, so as to provide a guideline on its best application domain. Real data analysis demonstrates that FANS performs very competitively on benchmark email spam and gene expression data sets. Moreover, FANS is implemented by an extremely fast algorithm through parallel computing.Comment: 30 pages, 2 figure

arXiv.org e-Print Archive

Princeton University Open Access Repository

Feature selection and classification of imbalanced datasets. Application to PET images of children with Autistic Spectrum Disorders

Author: Boddaert Nathalie
Cachia Arnaud
Chabane Nadia
Duchesnay Edouard
Mangin Jean-Franois
Martinot Jean-Luc
Zilbovicius Monica
Publication venue: HAL CCSD
Publication date: 03/05/2011
Field of study

Learning with discriminative methods is generally based on minimizing themisclassification of training samples, which may be unsuitable for imbalanceddatasets where the recognition might be biased in favor of the most numerousclass. This problem can be addressed with a generative approach, which typicallyrequires more parameters to be determined leading to reduced performances inhigh dimension. In such situations, dimension reduction becomes a crucial issue.We propose a feature selection / classification algorithm based on generativemethods in order to predict the clinical status of a highly imbalanced datasetmade of PET scans of forty-five low-functioning children with autism spectrumdisorders (ASD) and thirteen non-ASD low-functioning children. ASDs aretypically characterized by impaired social interaction, narrow interests, andrepetitive behaviours, with a high variability in expression and severity. Thenumerous findings revealed by brain imaging studies suggest that ASD isassociated with a complex and distributed pattern of abnormalities that makesthe identification of a shared and common neuroimaging profile a difficult task.In this context, our goal is to identify the rest functional brain imagingabnormalities pattern associated with ASD and to validate its efficiency inindividual classification. The proposed feature selection algorithm detected acharacteristic pattern in the ASD group that included a hypoperfusion in theright Superior Temporal Sulcus (STS) and a hyperperfusion in the contralateralpostcentral area. Our algorithm allowed for a significantly accurate (88\%),sensitive (91\%) and specific (77\%) prediction of clinical category. For thisimbalanced dataset, with only 13 control scans, the proposed generativealgorithm outperformed other state-of-the-art discriminant methods. The highpredictive power of the characteristic pattern, which has been automaticallyidentified on whole brains without any priors, confirms previous findingsconcerning the role of STS in ASD. This work offers exciting possibilities forearly autism detection and/or the evaluation of treatment response in individualpatients

HAL Descartes

HAL-CEA

Hal-Diderot

Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values

Author: Marko Nicholas
Razzaghi Talayeh
Roderick Oleg
Safro Ilya
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 07/04/2016
Field of study

This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625

arXiv.org e-Print Archive

Directory of Open Access Journals

FigShare