90 research outputs found
Multiple tests of association with biological annotation metadata
We propose a general and formal statistical framework for multiple tests of
association between known fixed features of a genome and unknown parameters of
the distribution of variable features of this genome in a population of
interest. The known gene-annotation profiles, corresponding to the fixed
features of the genome, may concern Gene Ontology (GO) annotation, pathway
membership, regulation by particular transcription factors, nucleotide
sequences, or protein sequences. The unknown gene-parameter profiles,
corresponding to the variable features of the genome, may be, for example,
regression coefficients relating possibly censored biological and clinical
outcomes to genome-wide transcript levels, DNA copy numbers, and other
covariates. A generic question of great interest in current genomic research
regards the detection of associations between biological annotation metadata
and genome-wide expression measures. This biological question may be translated
as the test of multiple hypotheses concerning association measures between
gene-annotation profiles and gene-parameter profiles. A general and rigorous
formulation of the statistical inference question allows us to apply the
multiple hypothesis testing methodology developed in [Multiple Testing
Procedures with Applications to Genomics (2008) Springer, New York] and related
articles, to control a broad class of Type I error rates, defined as
generalized tail probabilities and expected values for arbitrary functions of
the numbers of Type I errors and rejected hypotheses. The resampling-based
single-step and stepwise multiple testing procedures of [Multiple Testing
Procedures with Applications to Genomics (2008) Springer, New York] take into
account the joint distribution of the test statistics and provide Type I error
control in testing problems involving general data generating distributions
(with arbitrary dependence structures among variables), null hypotheses, and
test statistics.Comment: Published in at http://dx.doi.org/10.1214/193940307000000446 the IMS
Collections (http://www.imstat.org/publications/imscollections.htm) by the
Institute of Mathematical Statistics (http://www.imstat.org
Regulatory Motif Finding by Logic Regression
Multiple transcription factors coordinately control transcriptional regulation of genes in eukaryotes. Although multiple computational methods consider the identification of individual transcription factor binding sites (TFBSs), very few focus on the interactions between these sites. We consider finding transcription factor binding sites and their context specific interactions using microarray gene expression data. We devise a hybrid approach called LogicMotif composed of a TFBS identification method combined with the new regression methodology logic regression of Ruczinski et al. (2003). LogicMotif has two steps: First potential binding sites are identified from transcription control regions of genes of interest. Various available methods can be used in this first step when the genes of interest can be divided into groups such as up and down regulated. For this step, we also develop a simple univariate regression and extension method MFURE to extract candidate TFBSs from a large number of genes in the availability of microarray gene expression data. MFURE provides an alternative method for this step when partitioning of the genes into disjoint groups is not preferred. This first step aims to identify individual sites within gene groups of interest or sites that are correlated with the gene expression outcome. In the second step, logic regression is used to build a predictive model of outcome of interest (either gene expression or up and down regulation) using these potential sites. This two-fold approach creates a rich diverse set of potential binding sites in the first step and builds regression or classification models in the second step using logic regression that is particularly good at identifying complex interactions.
LogicMotif is applied to two publicly available data sets. A genome-wide gene expression data set of Saccharomyces cerevisiae is used for validation. The regression models obtained are interpretable and the biological implications are in agreement with the known resuts. This analysis suggests that LogicMotif provides biologically more reasonable regression models than previous analysis of this data set with standard linear regression methods. Another data set of Saccharomyces cerevisiae illustrates the use of LogicMotif in classification questions by building a model that discriminates between up and down regulated genes in iron copper deficiency. LogicMotif identified an inductive and two repressor motifs in this data set. The inductive motif matches the binding site of the transcription factor Aft1p that has a key role in regulation of the uptake process. One of the novel repressor sites is highly present in transcription control regions of FeS genes. This site could represent a TFBS for an unknown transcription factor involved in repression of genes encoding FeS proteins in iron deficiency. We established the stability of the method to the type of outcome variable by using both continuous and binary outcome variables for this data set. Our results indicate that logic regression used in combination with cluster/group operating binding site identification methods or with our proposed method MFURE is a powerful and flexible alternative to linear regression based motif finding methods
Supervised Detection of Conserved Motifs in DNA Sequences with cosmo
A number of computational methods have been proposed for identifying transcription factor binding sites from a set of unaligned sequences that are thought to share the motif in question. We here introduce an algorithm, called cosmo, that allows this search to be supervised by specifying a set of constraints that the position weight matrix of the unknown motif must satisfy. Such constraints may be formulated, for example, on the basis of prior knowledge about the structure of the transcription factor in question. The algorithm is based on the same two-component multinomial mixture model used by MEME, with stronger reliance, however, on the likelihood principle instead of more ad-hoc criteria like the E-value. The intensity parameter in the ZOOPS and TCM models, for instance, is estimated based on a profile-likelihood approach, and the width of the unknown motif is selected based on BIC. These changes allow cosmo to outperform MEME even in the absence of any constraints, as evidenced by 2- to 3-fold greater sensitivity in some simulation studies. Additional improvements in performance can be achieved by selecting the model type (OOPS, ZOOPS, or TCM) data-adaptively or by supplying correctly specified constraints, especially if the motif appears only as a weak signal in the data. The algorithm can data-adaptively choose between working in a given constrained model or in the completely unconstrained model, guarding against the risk of supplying mis-specified constraints. Simulation studies suggest that this approach can offer 3 to 3.5 times greater sensitivity than MEME. The algorithm has been implemented in the form of a stand-alone C program as well as a web application that can be accessed at http://cosmoweb.berkeley.edu. An R package is available through Bioconductor (http://bioconductor.org)
Asymptotically Optimal Model Selection Method with Right Censored Outcomes
Over the last two decades, non-parametric and semi-parametric approaches that adapt well known techniques such as regression methods to the analysis of right censored data, e.g. right censored survival data, became popular in the statistics literature. However, the problem of choosing the best model (predictor) among a set of proposed models (predictors) in the right censored data setting have not gained much attention. In this paper, we develop a new cross-validation based model selection method to select among predictors of right censored outcomes such as survival times. The proposed method considers the risk of a given predictor based on the training sample as a parameter of the full data distribution in a right censored data model. Then, the doubly robust locally efficient estimation method or an ad hoc inverse probability of censoring weighting method as presented in Robins and Rotnitzky (1992) and van der Laan and Robins (2002) is used to estimate this conditional risk parameter based on the validation sample. We prove that, under general conditions, the proposed cross-validated selector is asymptotically equivalent with an oracle benchmark selector based on the true data generating distribution. The presented method covers model selection with right censored data in prediction (univariate and multivariate) and density/hazard estimation problems
Multiple Tests of Association with Biological Annotation Metadata
We propose a general and formal statistical framework for the multiple tests of associations between known fixed features of a genome and unknown parameters of the distribution of variable features of this genome in a population of interest. The known fixed gene-annotation profiles, corresponding to the fixed features of the genome, may concern Gene Ontology (GO) annotation, pathway membership, regulation by particular transcription factors, nucleotide sequences, or protein sequences. The unknown gene-parameter profiles, corresponding to the variable features of the genome, may be, for example, regression coefficients relating genome-wide transcript levels or DNA copy numbers to possibly censored biological and clinical outcomes and covariates. A generic question of great interest in current genomic research, regarding the detection of associations between biological annotation metadata and genome-wide expression measures, may then be translated into the multiple tests of hypotheses concerning association measures between gene-annotation and gene-parameter profiles. A general and rigorous formulation of the statistical inference question allows us to apply the multiple testing methodology developed in Dudoit and van der Laan (2006) and related articles, to control a broad class of Type I error rates, in testing problems involving general data generating distributions (with arbitrary dependence structures among variables), null hypotheses, and test statistics. Resampling-based single-step and stepwise multiple testing procedures, that take into account the joint distribution of the test statistics, are provided to control Type I error rates defined as tail probabilities for arbitrary functions of the numbers of false positives and rejected hypotheses.
The proposed statistical and computational methods are illustrated using the acute lymphoblastic leukemia (ALL) microarray dataset of Chiaretti et al. (2004), with the aim of relating GO annotation to differential gene expression between B-cell ALL with the BCR/ABL fusion and cytogenetically normal NEG B-cell ALL. The sensitivity of the identified lists of GO terms to the choice of association parameter between GO annotation and differential gene expression demonstrates the importance of translating the biological question in terms of suitable gene-annotation profiles, gene-parameter profiles, and association measures. In particular, the results show the limitations of binary gene-parameter profiles of differential expression indicators, which are still the norm for combined GO annotation and microarray data analyses. Procedures based on such binary gene-parameter profiles tend to be conservative and lack robustness with respect to the estimator for the set of differentially expressed genes.
WWW companion: www.stat.berkeley.edu/~sandrine/Docs/Papers/DFF06/DFF.htm
Estimation of the Bivariate Survival Function with Generalized Bivariate Right Censored Data Structures
We propose a bivariate survival function estimator for a general right censored data structure that includes a time dependent covariate process. Firstly, an initial estimator that generalizes Dabrowska\u27s (1988) estimator is introduced. We obtain this estimator by a general methodology of constructing estimating functions in censored data models. The initial estimator is guaranteed to improve on Dabrowska\u27s estimator and remains consistent and asymptotically linear under informative censoring schemes if the censoring mechanism is estimated consistently. We then construct an orthogonalized estimating function which results in a more robust and efficient estimator than our initial estimator. A simulation study demonstrates the performance of the proposed estimators
Identification of Regulatory Elements Using A Feature Selection Method
Many methods have been described to identify regulatory motifs in the transcription control regions of genes that exhibit similar patterns of gene expression across a variety of experimental conditions. Here we focus on a single experimental condition, and utilize gene expression data to identify sequence motifs associated with genes that are activated under this experimental condition. We use a linear model with two way interactions to model gene expression as a function of sequence features (words) present in presumptive transcription control regions. The most relevant features are selected by a feature selection method called stepwise selection with monte carlo cross validation. We apply this method to a publicly available dataset of the yeast Saccharomyces cerevisiae, focussing on the 800 basepairs immediately upstream of each gene\u27s translation start site (the upstream control region (UCR)). We successfully identify regulatory motifs that are known to be active under the experimental conditions analyzed, and find additional significant sequences that may represent novel regulatory motifs. We also discuss a complementary method that utilizes gene expression data from a single microarray experiment and allows averaging over variety of experimental conditions as an alternative to motif finding methods that act on clusters of co-expressed genes
Multiple Testing Methods For ChIP-Chip High Density Oligonucleotide Array Data
Cawley et al. (2004) have recently mapped the locations of binding sites for three transcription factors along human chromosomes 21 and 22 using ChIP-Chip experiments. ChIP-Chip experiments are a new approach to the genome-wide identification of transcription factor binding sites and consist of chromatin (Ch) immunoprecipitation (IP) of transcription factor-bound genomic DNA followed by high density oligonucleotide hybridization (Chip) of the IP-enriched DNA. We investigate the ChIP-Chip data structure and propose methods for inferring the location of transcription factor binding sites from these data. The proposed methods involve testing for each probe whether it is part of a bound sequence or not using a scan statistic that takes into account the spatial structure of the data. Different multiple testing procedures are considered for controlling the family-wise error rate and false discovery rate. A nested-Bonferroni adjustment, that is more powerful than the traditional Bonferroni adjustment when the test statistics are dependent, is discussed. Simulation studies show that taking into account the spatial structure of the data substantially improves the sensitivity of the multiple testing procedures. Application of the proposed methods to ChIP-Chip data for transcription factor p53 identified many potential target binding regions along human chromosomes 21 and 22. Among these identified regions, 18% fall within a 3kb vicinity of the 5\u27UTR of a known gene or CpG island, 31% fall between the codon start site and the codon end site of a known gene but not inside an exon. More than half of these potential target sequences contain the p53 consensus binding site or very close matches to it. Moreover, these target segments include the 13 experimentally verified p53 binding regions of Cawley et al. (2004), as well as 49 additional regions that show higher hybridization signal than these 13 experimentally verified regions
Recurrent Events Analysis in the Presence of Time Dependent Covariates and Dependent Censoring
Recurrent events models have lately received a lot of attention in the literature. The majority of approaches discussed show the consistency of parameter estimates under the assumption that censoring is independent of the recurrent events process of interest conditional on the covariates included into the model. We provide an overview of available recurrent events analysis methods, and present an inverse probability of censoring weighted estimator for the regression parameters in the Andersen-Gill model that is commonly used for recurrent event analysis. This estimator remains consistent under informative censoring if the censoring mechanism is estimated consistently, and generally improves on the naive estimator for the Anderson-Gill model in the case of independent censoring. We illustrate the bias of ad hoc estimators in the presence of informative censoring with a simulation study and provide a data analysis of recurrent lung exacerbations in cystic fibrosis patients when some patients are lost to follow up
- …