7,089 research outputs found
Consistent Testing for Recurrent Genomic Aberrations
Genomic aberrations, such as somatic copy number alterations, are frequently
observed in tumor tissue. Recurrent aberrations, occurring in the same region
across multiple subjects, are of interest because they may highlight genes
associated with tumor development or progression. A number of tools have been
proposed to assess the statistical significance of recurrent DNA copy number
aberrations, but their statistical properties have not been carefully studied.
Cyclic shift testing, a permutation procedure using independent random shifts
of genomic marker observations on the genome, has been proposed to identify
recurrent aberrations, and is potentially useful for a wider variety of
purposes, including identifying regions with methylation aberrations or
overrepresented in disease association studies. For data following a
countable-state Markov model, we prove the asymptotic validity of cyclic shift
-values under a fixed sample size regime as the number of observed markers
tends to infinity. We illustrate cyclic shift testing for a variety of data
types, producing biologically relevant findings for three publicly available
datasets.Comment: 35 pages, 7 figure
Sparse integrative clustering of multiple omics data sets
High resolution microarrays and second-generation sequencing platforms are
powerful tools to investigate genome-wide alterations in DNA copy number,
methylation and gene expression associated with a disease. An integrated
genomic profiling approach measures multiple omics data types simultaneously in
the same set of biological samples. Such approach renders an integrated data
resolution that would not be available with any single data type. In this
study, we use penalized latent variable regression methods for joint modeling
of multiple omics data types to identify common latent variables that can be
used to cluster patient samples into biologically and clinically relevant
disease subtypes. We consider lasso [J. Roy. Statist. Soc. Ser. B 58 (1996)
267-288], elastic net [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005)
301-320] and fused lasso [J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2005)
91-108] methods to induce sparsity in the coefficient vectors, revealing
important genomic features that have significant contributions to the latent
variables. An iterative ridge regression is used to compute the sparse
coefficient vectors. In model selection, a uniform design [Monographs on
Statistics and Applied Probability (1994) Chapman & Hall] is used to seek
"experimental" points that scattered uniformly across the search domain for
efficient sampling of tuning parameter combinations. We compared our method to
sparse singular value decomposition (SVD) and penalized Gaussian mixture model
(GMM) using both real and simulated data sets. The proposed method is applied
to integrate genomic, epigenomic and transcriptomic data for subtype analysis
in breast and lung cancer data sets.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS578 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Combining chromosomal arm status and significantly aberrant genomic locations reveals new cancer subtypes
Many types of tumors exhibit chromosomal losses or gains, as well as local
amplifications and deletions. Within any given tumor type, sample specific
amplifications and deletionsare also observed. Typically, a region that is
aberrant in more tumors,or whose copy number change is stronger, would be
considered as a more promising candidate to be biologically relevant to cancer.
We sought for an intuitive method to define such aberrations and prioritize
them. We define V, the volume associated with an aberration, as the product of
three factors: a. fraction of patients with the aberration, b. the aberrations
length and c. its amplitude. Our algorithm compares the values of V derived
from real data to a null distribution obtained by permutations, and yields the
statistical significance, p value, of the measured value of V. We detected
genetic locations that were significantly aberrant and combined them with
chromosomal arm status to create a succint fingerprint of the tumor genome.
This genomic fingerprint is used to visualize the tumors, highlighting events
that are co ocurring or mutually exclusive. We allpy the method on three
different public array CGH datasets of Medulloblastoma and Neuroblastoma, and
demonstrate its ability to detect chromosomal regions that were known to be
altered in the tested cancer types, as well as to suggest new genomic locations
to be tested. We identified a potential new subtype of Medulloblastoma, which
is analogous to Neuroblastoma type 1.Comment: 34 pages, 3 figures; to appear in Cancer Informatic
Modeling association between DNA copy number and gene expression with constrained piecewise linear regression splines
DNA copy number and mRNA expression are widely used data types in cancer
studies, which combined provide more insight than separately. Whereas in
existing literature the form of the relationship between these two types of
markers is fixed a priori, in this paper we model their association. We employ
piecewise linear regression splines (PLRS), which combine good interpretation
with sufficient flexibility to identify any plausible type of relationship. The
specification of the model leads to estimation and model selection in a
constrained, nonstandard setting. We provide methodology for testing the effect
of DNA on mRNA and choosing the appropriate model. Furthermore, we present a
novel approach to obtain reliable confidence bands for constrained PLRS, which
incorporates model uncertainty. The procedures are applied to colorectal and
breast cancer data. Common assumptions are found to be potentially misleading
for biologically relevant genes. More flexible models may bring more insight in
the interaction between the two markers.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS605 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Comparison of TCGA and GENIE genomic datasets for the detection of clinically actionable alterations in breast cancer.
Whole exome sequencing (WES), targeted gene panel sequencing and single nucleotide polymorphism (SNP) arrays are increasingly used for the identification of actionable alterations that are critical to cancer care. Here, we compared The Cancer Genome Atlas (TCGA) and the Genomics Evidence Neoplasia Information Exchange (GENIE) breast cancer genomic datasets (array and next generation sequencing (NGS) data) in detecting genomic alterations in clinically relevant genes. We performed an in silico analysis to determine the concordance in the frequencies of actionable mutations and copy number alterations/aberrations (CNAs) in the two most common breast cancer histologies, invasive lobular and invasive ductal carcinoma. We found that targeted sequencing identified a larger number of mutational hotspots and clinically significant amplifications that would have been missed by WES and SNP arrays in many actionable genes such as PIK3CA, EGFR, AKT3, FGFR1, ERBB2, ERBB3 and ESR1. The striking differences between the number of mutational hotspots and CNAs generated from these platforms highlight a number of factors that should be considered in the interpretation of array and NGS-based genomic data for precision medicine. Targeted panel sequencing was preferable to WES to define the full spectrum of somatic mutations present in a tumor
A decision-theoretic approach for segmental classification
This paper is concerned with statistical methods for the segmental
classification of linear sequence data where the task is to segment and
classify the data according to an underlying hidden discrete state sequence.
Such analysis is commonplace in the empirical sciences including genomics,
finance and speech processing. In particular, we are interested in answering
the following question: given data and a statistical model of
the hidden states , what should we report as the prediction under
the posterior distribution ? That is, how should you make a
prediction of the underlying states? We demonstrate that traditional approaches
such as reporting the most probable state sequence or most probable set of
marginal predictions can give undesirable classification artefacts and offer
limited control over the properties of the prediction. We propose a decision
theoretic approach using a novel class of Markov loss functions and report
via the principle of minimum expected loss (maximum expected
utility). We demonstrate that the sequence of minimum expected loss under the
Markov loss function can be enumerated exactly using dynamic programming
methods and that it offers flexibility and performance improvements over
existing techniques. The result is generic and applicable to any probabilistic
model on a sequence, such as Hidden Markov models, change point or product
partition models.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS657 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Spatial clustering of array CGH features in combination with hierarchical multiple testing
We propose a new approach for clustering DNA features using array CGH data
from multiple tumor samples. We distinguish data-collapsing: joining contiguous
DNA clones or probes with extremely similar data into regions, from clustering:
joining contiguous, correlated regions based on a maximum likelihood principle.
The model-based clustering algorithm accounts for the apparent spatial patterns
in the data. We evaluate the randomness of the clustering result by a cluster
stability score in combination with cross-validation. Moreover, we argue that
the clustering really captures spatial genomic dependency by showing that
coincidental clustering of independent regions is very unlikely. Using the
region and cluster information, we combine testing of these for association
with a clinical variable in an hierarchical multiple testing approach. This
allows for interpreting the significance of both regions and clusters while
controlling the Family-Wise Error Rate simultaneously. We prove that in the
context of permutation tests and permutation-invariant clusters it is allowed
to perform clustering and testing on the same data set. Our procedures are
illustrated on two cancer data sets
- …