21 research outputs found
On the explanatory power of principal components
We show that if we have an orthogonal base () in a
-dimensional vector space, and select vectors and
such that the vectors traverse the origin, then the probability of
being to closer to all the vectors in the base than to is at
least 1/2 and converges as increases to infinity to a normal distribution
on the interval [-1,1]; i.e., . This result has
relevant consequences for Principal Components Analysis in the context of
regression and other learning settings, if we take the orthogonal base as the
direction of the principal components.Comment: 10 pages, 3 figure
Cross-validation and Peeling Strategies for Survival Bump Hunting using Recursive Peeling Methods
We introduce a framework to build a survival/risk bump hunting model with a
censored time-to-event response. Our Survival Bump Hunting (SBH) method is
based on a recursive peeling procedure that uses a specific survival peeling
criterion derived from non/semi-parametric statistics such as the
hazards-ratio, the log-rank test or the Nelson-Aalen estimator. To optimize the
tuning parameter of the model and validate it, we introduce an objective
function based on survival or prediction-error statistics, such as the log-rank
test and the concordance error rate. We also describe two alternative
cross-validation techniques adapted to the joint task of decision-rule making
by recursive peeling and survival estimation. Numerical analyses show the
importance of replicated cross-validation and the differences between criteria
and techniques in both low and high-dimensional settings. Although several
non-parametric survival models exist, none addresses the problem of directly
identifying local extrema. We show how SBH efficiently estimates extreme
survival/risk subgroups unlike other models. This provides an insight into the
behavior of commonly used models and suggests alternatives to be adopted in
practice. Finally, our SBH framework was applied to a clinical dataset. In it,
we identified subsets of patients characterized by clinical and demographic
covariates with a distinct extreme survival outcome, for which tailored medical
interventions could be made. An R package `PRIMsrc` is available on CRAN and
GitHub.Comment: Keywords: Exploratory Survival/Risk Analysis, Survival/Risk
Estimation & Prediction, Non-Parametric Method, Cross-Validation, Bump
Hunting, Rule-Induction Metho
Unsupervised Bump Hunting Using Principal Components
Principal Components Analysis is a widely used technique for dimension
reduction and characterization of variability in multivariate populations. Our
interest lies in studying when and why the rotation to principal components can
be used effectively within a response-predictor set relationship in the context
of mode hunting. Specifically focusing on the Patient Rule Induction Method
(PRIM), we first develop a fast version of this algorithm (fastPRIM) under
normality which facilitates the theoretical studies to follow. Using basic
geometrical arguments, we then demonstrate how the PC rotation of the predictor
space alone can in fact generate improved mode estimators. Simulation results
are used to illustrate our findings.Comment: 24 pages, 9 figure
Metabolomics of ApcMin/+ mice genetically susceptible to intestinal cancer
BACKGROUND: To determine how diets high in saturated fat could increase polyp formation in the mouse model of intestinal neoplasia, Apc( Min/+ ), we conducted large-scale metabolome analysis and association study of colon and small intestine polyp formation from plasma and liver samples of Apc( Min/+ ) vs. wild-type littermates, kept on low vs. high-fat diet. Label-free mass spectrometry was used to quantify untargeted plasma and acyl-CoA liver compounds, respectively. Differences in contrasts of interest were analyzed statistically by unsupervised and supervised modeling approaches, namely Principal Component Analysis and Linear Model of analysis of variance. Correlation between plasma metabolite concentrations and polyp numbers was analyzed with a zero-inflated Generalized Linear Model. RESULTS: Plasma metabolome in parallel to promotion of tumor development comprises a clearly distinct profile in Apc( Min/+ ) mice vs. wild type littermates, which is further altered by high-fat diet. Further, functional metabolomics pathway and network analyses in Apc( Min/+ ) mice on high-fat diet revealed associations between polyp formation and plasma metabolic compounds including those involved in amino-acids metabolism as well as nicotinamide and hippuric acid metabolic pathways. Finally, we also show changes in liver acyl-CoA profiles, which may result from a combination of Apc( Min/+ )-mediated tumor progression and high fat diet. The biological significance of these findings is discussed in the context of intestinal cancer progression. CONCLUSIONS: These studies show that high-throughput metabolomics combined with appropriate statistical modeling and large scale functional approaches can be used to monitor and infer changes and interactions in the metabolome and genome of the host under controlled experimental conditions. Further these studies demonstrate the impact of diet on metabolic pathways and its relation to intestinal cancer progression. Based on our results, metabolic signatures and metabolic pathways of polyposis and intestinal carcinoma have been identified, which may serve as useful targets for the development of therapeutic interventions
Metabolomics of ApcMin/+\u3c/sup\u3e Mice Genetically Susceptible to Intestinal Cancer
Background: To determine how diets high in saturated fat could increase polyp formation in the mouse model of intestinal neoplasia, ApcMin/+, we conducted large-scale metabolome analysis and association study of colon and small intestine polyp formation from plasma and liver samples of ApcMin/+ vs. wild-type littermates, kept on low vs. high-fat diet. Label-free mass spectrometry was used to quantify untargeted plasma and acyl-CoA liver compounds, respectively. Differences in contrasts of interest were analyzed statistically by unsupervised and supervised modeling approaches, namely Principal Component Analysis and Linear Model of analysis of variance. Correlation between plasma metabolite concentrations and polyp numbers was analyzed with a zero-inflated Generalized Linear Model.Results: Plasma metabolome in parallel to promotion of tumor development comprises a clearly distinct profile in ApcMin/+ mice vs. wild type littermates, which is further altered by high-fat diet. Further, functional metabolomics pathway and network analyses in ApcMin/+ mice on high-fat diet revealed associations between polyp formation and plasma metabolic compounds including those involved in amino-acids metabolism as well as nicotinamide and hippuric acid metabolic pathways. Finally, we also show changes in liver acyl-CoA profiles, which may result from a combination of ApcMin/+-mediated tumor progression and high fat diet. The biological significance of these findings is discussed in the context of intestinal cancer progression.Conclusions: These studies show that high-throughput metabolomics combined with appropriate statistical modeling and large scale functional approaches can be used to monitor and infer changes and interactions in the metabolome and genome of the host under controlled experimental conditions. Further these studies demonstrate the impact of diet on metabolic pathways and its relation to intestinal cancer progression. Based on our results, metabolic signatures and metabolic pathways of polyposis and intestinal carcinoma have been identified, which may serve as useful targets for the development of therapeutic interventions. © 2014 Dazard et al.; licensee BioMed Central Ltd
Studying genetic determinants of natural variation in human gene expression using Bayesian ANOVA
Standard genetic mapping techniques scan chromosomal segments for location of genetic linkage and association signals. The majority of these methods consider only correlations at single markers and/or phenotypes with explicit detailing of the genetic structure. These methods tend to be limited by their inability to consider the effect of large numbers of model variables jointly. In contrast, we propose a Bayesian analysis of variance (ANOVA) method to categorize individuals based on similarity of multidimensional profiles and attempt to analyze all variables simultaneously. Using Problem 1 of the Genetic Analysis Workshop 15 data set, we demonstrate the method's utility for joint analysis of gene expression levels and single-nucleotide polymorphism genotypes. We show that the method extracts similar information to that of previous genetic mapping analyses, and suggest extensions of the method for mining unique information not previously found
The dynamics of E1A in regulating networks and canonical pathways in quiescent cells
<p>Abstract</p> <p>Background</p> <p>Adenoviruses force quiescent cells to re-enter the cell cycle to replicate their DNA, and for the most part, this is accomplished after they express the E1A protein immediately after infection. In this context, E1A is believed to inactivate cellular proteins (e.g., p130) that are known to be involved in the silencing of E2F-dependent genes that are required for cell cycle entry. However, the potential perturbation of these types of genes by E1A relative to their functions in regulatory networks and canonical pathways remains poorly understood.</p> <p>Findings</p> <p>We have used DNA microarrays analyzed with Bayesian ANOVA for microarray (BAM) to assess changes in gene expression after E1A alone was introduced into quiescent cells from a regulated promoter. Approximately 2,401 genes were significantly modulated by E1A, and of these, 385 and 1033 met the criteria for generating networks and functional and canonical pathway analysis respectively, as determined by using Ingenuity Pathway Analysis software. After focusing on the highest-ranking cellular processes and regulatory networks that were responsive to E1A in quiescent cells, we observed that many of the up-regulated genes were associated with DNA replication, the cell cycle and cellular compromise. We also identified a cadre of up regulated genes with no previous connection to E1A; including genes that encode components of global DNA repair systems and DNA damage checkpoints. Among the down-regulated genes, we found that many were involved in cell signalling, cell movement, and cellular proliferation. Remarkably, a subset of these was also associated with p53-independent apoptosis, and the putative suppression of this pathway may be necessary in the viral life cycle until sufficient progeny have been produced.</p> <p>Conclusions</p> <p>These studies have identified for the first time a large number of genes that are relevant to E1A's activities in promoting quiescent cells to re-enter the cell cycle in order to create an optimum environment for adenoviral replication.</p
Local Sparse Bump Hunting
The search for structures in real datasets e.g. in the form of bumps, components, classes or clusters is important as these often reveal underlying phenomena leading to scientific discoveries. One of these tasks, known as bump hunting, is to locate domains of a multidimensional input space where the target function assumes local maxima without pre-specifying their total number. A number of related methods already exist, yet are challenged in the context of high dimensional data. We introduce a novel supervised and multivariate bump hunting strategy for exploring modes or classes of a target function of many continuous variables. This addresses the issues of correlation, interpretability, and high-dimensionality (p ≫ n case), while making minimal assumptions. The method is based upon a divide and conquer strategy, combining a tree-based method, a dimension reduction technique, and the Patient Rule Induction Method (PRIM). Important to this task, we show how to estimate the PRIM meta-parameters. Using accuracy evaluation procedures such as cross-validation and ROC analysis, we show empirically how the method outperforms a naive PRIM as well as competitive non-parametric supervised and unsupervised methods in the problem of class discovery. The method has practical application especially in the case of noisy high-throughput data. It is applied to a class discovery problem in a colon cancer micro-array dataset aimed at identifying tumor subtypes in the metastatic stage. Supplemental Materials are available online
Abstract C018: Disparity subtyping: Bringing precision medicine closer to disparity science
Abstract The genomics revolution also spawned the dawn of precision medicine. As in the National Research Council definition, if its promise is fully realized, then more accurate decisions about individual patient treatment decisions and outcomes will be possible. Disparities researchers have also begun looking to the precision medicine paradigm with the hope that some incorporation of its principles will allow for a more focused and precise path forward to reduce population disparities. While the emphasis may switch to populations from individuals, central to the paradigm still is the ability to classify individuals into subpopulations who differ in meaningful ways with respect to underlying biology and outcomes. Identification of these subpopulations is an active area of precision medicine research. For instance, there are countless papers on molecular subtyping of various cancer phenotypes. How to do such a thing in disparity science has proven elusive since it requires identifying disparity subpopulations, which is a somewhat abstract concept. In this paper we present two different strategies—level set identification and peeling. The former is based on a recursive partitioning algorithm combined with clustering of similar partitions; the latter adopts a strategy of sequentially searching for and then extracting extreme difference subgroups in a population. Using series of simulation studies and then also studying various cancer outcomes from The Cancer Genome Atlas (TCGA) repository, we demonstrate that such disparity subtypes can indeed be found, characterized, and then validated on test data. Citation Format: J. Sunil Rao, Huilin Yu, Jean-Eudes Dazard. Disparity subtyping: Bringing precision medicine closer to disparity science [abstract]. In: Proceedings of the Eleventh AACR Conference on the Science of Cancer Health Disparities in Racial/Ethnic Minorities and the Medically Underserved; 2018 Nov 2-5; New Orleans, LA. Philadelphia (PA): AACR; Cancer Epidemiol Biomarkers Prev 2020;29(6 Suppl):Abstract nr C018
Local sparse bump hunting reveals molecular heterogeneity of colon tumors
The question of molecular heterogeneity and of tumoral phenotype in cancer remains unresolved. To understand the underlying molecular basis of this phenomenon, we analyzed genome-wide expression data of colon cancer metastasis samples, as these tumors are the most advanced and hence would be anticipated to be the most likely heterogeneous group of tumors, potentially exhibiting the maximum amount of genetic heterogeneity. Casting a statistical net around such a complex problem proves difficult because of the high dimensionality and multicollinearity of the gene expression space, combined with the fact that genes act in concert with one another and that not all genes surveyed might be involved. We devise a strategy to identify distinct subgroups of samples and determine the genetic/molecular signature that defines them. This involves use of the local sparse bump hunting algorithm, which provides a much more optimal and biologically faithful transformed space within which to search for bumps. In addition, thanks to the variable selection feature of the algorithm, we derived a novel sparse gene expression signature, which appears to divide all colon cancer patients into two populations: a population whose expression pattern can be molecularly encompassed within the bump and an outlier population that cannot be. Although all patients within any given stage of the disease, including the metastatic group, appear clinically homogeneous, our procedure revealed two subgroups in each stage with distinct genetic/molecular profiles. We also discuss implications of such a finding in terms of early detection, diagnosis and prognosis