Search CORE

37 research outputs found

High-dimensional statistical learning: Roots, justifications, and potential machineries

Author: Zollanvari Amin
Publication venue: Cancer Informatics
Publication date: 01/01/2015
Field of study

High-dimensional data generally refer to data in which the number of variables is larger than the sample size. Analyzing such datasets poses great challenges for classical statistical learning because the finite-sample performance of methods developed within classical statistical learning does not live up to classical asymptotic premises in which the sample size unboundedly grows for a fixed dimensionality of observations

Crossref

Directory of Open Access Journals

PubMed Central

Nazarbayev University Repository

Recommended from our members

An Automated Bayesian Framework for Integrative Gene Expression Analysis and Predictive Medicine

Author: Alterovitz Gil
Parikh Neena
Zollanvari Amin
Publication venue: American Medical Informatics Association
Publication date: 21/03/2013
Field of study

Motivation: This work constructs a closed loop Bayesian Network framework for predictive medicine via integrative analysis of publicly available gene expression findings pertaining to various diseases. Results: An automated pipeline was successfully constructed. Integrative models were made based on gene expression data obtained from GEO experiments relating to four different diseases using Bayesian statistical methods. Many of these models demonstrated a high level of accuracy and predictive ability. The approach described in this paper can be applied to any complex disorder and can include any number and type of genome-scale studies

Harvard University - DASH

Analytic Study of Performance of Error Estimators for Linear Discriminant Analysis with Applications in Genomics

Author: Zollanvari Amin
Publication venue
Publication date
Field of study

Error estimation must be used to find the accuracy of a designed classifier, an issue that is critical in biomarker discovery for disease diagnosis and prognosis in genomics and proteomics. This dissertation is concerned with the analytical formulation of the joint distribution of the true error of misclassification and two of its commonly used estimators, resubstitution and leave-one-out, as well as their marginal and mixed moments, in the context of the Linear Discriminant Analysis (LDA) classification rule. In the first part of this dissertation, we obtain the joint sampling distribution of the actual and estimated errors under a general parametric Gaussian assumption. Exact results are provided in the univariate case and an accurate approximation is obtained in the multivariate case. We show how these results can be applied in the computation of conditional bounds and the regression of the actual error, given the observed error estimate. In practice the unknown parameters of the Gaussian distributions, which figure in the expressions, are not known and need to be estimated. Using the usual maximum-likelihood estimates for such parameters and plugging them into the theoretical exact expressions provides a sample-based approximation to the joint distribution, and also sample-based methods to estimate upper conditional bounds. In the second part of this dissertation, exact analytical expressions for the bias, variance, and Root Mean Square (RMS) for the resubstitution and leave-one-out error estimators in the univariate Gaussian model are derived. All probabilistic characteristics of an error estimator are given by the knowledge of its joint distribution with the true error. Partial information is contained in their mixed moments, in particular, their second mixed moment. Marginal information regarding an error estimator is contained in its marginal moments, in particular, its mean and variance. Since we are interested in estimator accuracy and wish to use the RMS to measure that accuracy, we desire knowledge of the second-order moments, marginal and mixed, with the true error. In the multivariate case, using the double asymptotic approach with the assumption of knowing the common covariance matrix of the Gaussian model, analytical expressions for the first moments, second moments, and mixed moment with the actual error for the resubstitution and leave-one-out error estimators are derived. The results provide accurate small sample approximations and this is demonstrated in the present situation via numerical comparisons. Application of the results is discussed in the context of genomics

OAKTrust Digital Repository (Texas A&M Univ)

Incorporating prior knowledge induced from stochastic differential equations in the classification of stochastic observations

Author: Dougherty Edward R.
Zollanvari Amin
Publication venue: Eurasip Journal on Bioinformatics and Systems Biology
Publication date: 01/01/2016
Field of study

In classification, prior knowledge is incorporated in a Bayesian framework by assuming that the feature-label distribution belongs to an uncertainty class of feature-label distributions governed by a prior distribution. A posterior distribution is then derived from the prior and the sample data. An optimal Bayesian classifier (OBC) minimizes the expected misclassification error relative to the posterior distribution. From an application perspective, prior construction is critical

Crossref

Springer - Publisher Connector

PubMed Central

Nazarbayev University Repository

Recommended from our members

A Bayesian Translational Framework for Knowledge Propagation, Discovery, and Integration Under Specific Contexts

Author: Alterovitz Gil
Deng Michelle Rui
Zollanvari Amin
Publication venue: American Medical Informatics Association
Publication date: 21/03/2013
Field of study

The immense corpus of biomedical literature existing today poses challenges in information search and integration. Many links between pieces of knowledge occur or are significant only under certain contexts—rather than under the entire corpus. This study proposes using networks of ontology concepts, linked based on their co-occurrences in annotations of abstracts of biomedical literature and descriptions of experiments, to draw conclusions based on context-specific queries and to better integrate existing knowledge. In particular, a Bayesian network framework is constructed to allow for the linking of related terms from two biomedical ontologies under the queried context concept. Edges in such a Bayesian network allow associations between biomedical concepts to be quantified and inference to be made about the existence of some concepts given prior information about others. This approach could potentially be a powerful inferential tool for context-specific queries, applicable to ontologies in other fields as well

Harvard University - DASH

Edge-Aware Spatial Denoising Filtering Based on a Psychological Model of Stimulus Similarity

Author: James Alex Pappachen
Mathew Joshin
Zollanvari Amin
Publication venue: IEEE
Publication date: 01/01/2017
Field of study

Noise reduction is a fundamental operation in image quality enhancement. In recent years, a large body of techniques at the crossroads of statistics and functional analysis have been developed to minimize the blurring artifact introduced in the denoising process. Recent studies focus on edge-aware filters due to their tendency to preserve image structures. In this study, we adopt a psychological model of similarity based on Shepard’s generalization law and introduce a new signal-dependent window selection technique. Such a focus is warranted because blurring is essentially a cognitive act related to the human perception of physical stimuli (pixels). The proposed windowing technique can be used to implement a wide range of edge-aware spatial denoising filters, thereby transforming them into nonlocal filters. We employ simulations using both synthetic and real image samples to evaluate the performance of the proposed method by quantifying the enhancement in the signal strength, noise suppression, and structural preservation measured in terms of the Peak Signal-to-Noise Ratio (PSNR), Mean Square Error (MSE), and Structural Similarity (SSIM) index, respectively. In our experiments, we observe that incorporating the proposed windowing technique in the design of mean, median, and nonlocalmeans filters substantially reduces the MSE while simultaneously increasing the PSNR and the SSIM

Crossref

Nazarbayev University Repository

An Efficient Method to Estimate the Optimum Regularization Parameter in RLDA

Author: Bakir Daniyar
Pappachen Alex James
Zollanvari Amin
Publication venue
Publication date: 01/01/2016
Field of study

Motivation: The biomarker discovery process in high-throughput genomic profiles has presented the statistical learning community with a challenging problem, namely learning when the number of variables is comparable or exceeding the sample size. In these settings, many classical techniques including linear discriminant analysis (LDA) falter. Poor performance of LDA is attributed to the ill-conditioned nature of sample covariance matrix when the dimension and sample size are comparable. To alleviate this problem regularized LDA (RLDA) has been classically proposed in which the sample covariance matrix is replaced by its ridge estimate. However, the performance of RLDA depends heavily on the regularization parameter used in the ridge estimate of sample covariance matrix

Crossref

Nazarbayev University Repository

Recommended from our members

Robust Prediction-Based Analysis for Genome-Wide Association and Expression Studies

Author: Alterovitz Gil
An Ning
K. Koppula Skanda
Zollanvari Amin
Publication venue: American Medical Informatics Association
Publication date: 11/03/2014
Field of study

Here we describe a prediction-based framework to analyze omic data and generate models for both disease diagnosis and identification of cellular pathways which are significant in complex diseases. Our framework differs from previous analysis in its use of underlying biology (cellular pathways/gene-sets) to produce predictive feature-disease models. In our study of alcoholism, lung cancer, and schizophrenia, we demonstrate the framework’s ability to robustly analyze omic data of multiple types and sources, identify significant features sets, and produce accurate predictive models

Harvard University - DASH

The Illusion of Distribution-Free Small-Sample Classification in Genomics

Author: Braga-Neto Ulisses M
Dougherty Edward R
Zollanvari Amin
Publication venue: Bentham Science Publishers Ltd
Publication date: 01/01/2011
Field of study

Classification has emerged as a major area of investigation in bioinformatics owing to the desire to discriminate phenotypes, in particular, disease conditions, using high-throughput genomic data. While many classification rules have been posed, there is a paucity of error estimation rules and an even greater paucity of theory concerning error estimation accuracy. This is problematic because the worth of a classifier depends mainly on its error rate. It is common place in bio-informatics papers to have a classification rule applied to a small labeled data set and the error of the resulting classifier be estimated on the same data set, most often via cross-validation, without any assumptions being made on the underlying feature-label distribution. Concomitant with a lack of distributional assumptions is the absence of any statement regarding the accuracy of the error estimate. Without such a measure of accuracy, the most common one being the root-mean-square (RMS), the error estimate is essentially meaningless and the worth of the entire paper is questionable. The concomitance of an absence of distributional assumptions and of a measure of error estimation accuracy is assured in small-sample settings because even when distribution-free bounds exist (and that is rare), the sample sizes required under the bounds are so large as to make them useless for small samples. Thus, distributional bounds are necessary and the distributional assumptions need to be stated. Owing to the epistemological dependence of classifiers on the accuracy of their estimated errors, scientifically meaningful distribution-free classification in high-throughput, small-sample biology is an illusion

CiteSeerX

Crossref

Harvard University - DASH

PubMed Central