619 research outputs found
Remembering Leo Breiman
I published an interview of Leo Breiman in Statistical Science [Olshen
(2001)], and also the solution to a problem concerning almost sure convergence
of binary tree-structured estimators in regression [Olshen (2007)]. The former
summarized much of my thinking about Leo up to five years before his death. I
discussed the latter with Leo and dedicated that paper to his memory.
Therefore, this note is on other topics. In preparing it I am reminded how much
I miss this man of so many talents and interests. I miss him not because I
always agreed with him, but instead because his comments about statistics in
particular and life in general always elicited my substantial reflection.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS385 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Successive Standardization of Rectangular Arrays
In this note we illustrate and develop further with mathematics and examples,
the work on successive standardization (or normalization) that is studied
earlier by the same authors in Olshen and Rajaratnam (2010) and Olshen and
Rajaratnam (2011). Thus, we deal with successive iterations applied to
rectangular arrays of numbers, where to avoid technical difficulties an array
has at least three rows and at least three columns. Without loss, an iteration
begins with operations on columns: first subtract the mean of each column; then
divide by its standard deviation. The iteration continues with the same two
operations done successively for rows. These four operations applied in
sequence completes one iteration. One then iterates again, and again, and
again,.... In Olshen and Rajaratnam (2010) it was argued that if arrays are
made up of real numbers, then the set for which convergence of these successive
iterations fails has Lebesgue measure 0. The limiting array has row and column
means 0, row and column standard deviations 1. A basic result on convergence
given in Olshen and Rajaratnam (2010) is true, though the argument in Olshen
and Rajaratnam (2010) is faulty. The result is stated in the form of a theorem
here, and the argument for the theorem is correct. Moreover, many graphics
given in Olshen and Rajaratnam (2010) suggest that but for a set of entries of
any array with Lebesgue measure 0, convergence is very rapid, eventually
exponentially fast in the number of iterations. Because we learned this set of
rules from Bradley Efron, we call it "Efron's algorithm". More importantly, the
rapidity of convergence is illustrated by numerical examples
Successive normalization of rectangular arrays
Standard statistical techniques often require transforming data to have mean
and standard deviation . Typically, this process of "standardization" or
"normalization" is applied across subjects when each subject produces a single
number. High throughput genomic and financial data often come as rectangular
arrays where each coordinate in one direction concerns subjects who might have
different status (case or control, say), and each coordinate in the other
designates "outcome" for a specific feature, for example, "gene," "polymorphic
site" or some aspect of financial profile. It may happen, when analyzing data
that arrive as a rectangular array, that one requires BOTH the subjects and the
features to be "on the same footing." Thus there may be a need to standardize
across rows and columns of the rectangular matrix. There arises the question as
to how to achieve this double normalization. We propose and investigate the
convergence of what seems to us a natural approach to successive normalization
which we learned from our colleague Bradley Efron. We also study the
implementation of the method on simulated data and also on data that arose from
scientific experimentation.Comment: Published in at http://dx.doi.org/10.1214/09-AOS743 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org). With Correction
A Generalized Unimodality
Generalization of unimodality for random objects taking values in finite dimensional vector spac
Almost surely consistent nonparametric regression from recursive partitioning schemes
AbstractPresented here are results on almost sure convergence of estimators of regression functions subject to certain moment restrictions. Two somewhat different notions of almost sure convergence are studied: unconditional and conditional given a training sample. The estimators are local means derived from certain recursive partitioning schemes
Recommended from our members
GENOME WIDE DNA METHYLATION PROFILING IS PREDICTIVE OF OUTCOME IN JUVENILE MYELOMONOCYTIC LEUKEMIA
Gene Expression Differences between Enriched Normal and Chronic Myelogenous Leukemia Quiescent Stem/Progenitor Cells and Correlations with Biological Abnormalities
In comparing gene expression of normal and CML CD34+ quiescent (G0) cell, 292 genes were downregulated and 192 genes upregulated in the CML/G0 Cells. The differentially expressed genes were grouped according to their reported functions, and correlations were sought with biological differences previously observed between the same groups. The most relevant findings include the following. (i) CML G0 cells are in a more advanced stage of development and more poised to proliferate than normal G0 cells. (ii) When CML G0 cells are stimulated to proliferate, they differentiate and mature more rapidly than normal counterpart. (iii) Whereas normal G0 cells form only granulocyte/monocyte colonies when stimulated by cytokines, CML G0 cells form a combination of the above and erythroid clusters and colonies. (iv) Prominin-1 is the gene most downregulated in CML G0 cells, and this appears to be associated with the spontaneous formation of erythroid colonies by CML progenitors without EPO
A classification model for distinguishing copy number variants from cancer-related alterations
<p>Abstract</p> <p>Background</p> <p>Both somatic copy number alterations (CNAs) and germline copy number variants (CNVs) that are prevalent in healthy individuals can appear as recurrent changes in comparative genomic hybridization (CGH) analyses of tumors. In order to identify important cancer genes CNAs and CNVs must be distinguished. Although the Database of Genomic Variants (DGV) contains a list of all known CNVs, there is no standard methodology to use the database effectively.</p> <p>Results</p> <p>We develop a prediction model that distinguishes CNVs from CNAs based on the information contained in the DGV and several other variables, including segment's length, height, closeness to a telomere or centromere and occurrence in other patients. The models are fitted on data from glioblastoma and their corresponding normal samples that were collected as part of The Cancer Genome Atlas project and hybridized to Agilent 244 K arrays.</p> <p>Conclusions</p> <p>Using the DGV alone CNVs in the test set can be correctly identified with about 85% accuracy if the outliers are removed before segmentation and with 72% accuracy if the outliers are included, and additional variables improve the prediction by about 2-3% and 12%, respectively. Final models applied to data from ovarian tumors have about 90% accuracy with all the variables and 86% accuracy with the DGV alone.</p
- …