Search CORE

1,451 research outputs found

Gains in Power from Structured Two-Sample Tests of Means on Graphs

Author: Dudoit Sandrine
Jacob Laurent
Neuvial Pierre
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2010
Field of study

We consider multivariate two-sample tests of means, where the location shift between the two populations is expected to be related to a known graph structure. An important application of such tests is the detection of differentially expressed genes between two patient populations, as shifts in expression levels are expected to be coherent with the structure of graphs reflecting gene properties such as biological process, molecular function, regulation, or metabolism. For a fixed graph of interest, we demonstrate that accounting for graph structure can yield more powerful tests under the assumption of smooth distribution shift on the graph. We also investigate the identification of non-homogeneous subgraphs of a given large graph, which poses both computational and multiple testing problems. The relevance and benefits of the proposed approach are illustrated on synthetic data and on breast cancer gene expression data analyzed in context of KEGG pathways

arXiv.org e-Print Archive

Collection Of Biostatistics Research Archive

Multiple tests of association with biological annotation metadata

Author: Mark J. Van Der Laan
Mark J. Van Der Laan
Rine Dudoit
Rine Dudoit Sunduz Keles
Sunduz Keles
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2008
Field of study

We propose a general and formal statistical framework for multiple tests of association between known fixed features of a genome and unknown parameters of the distribution of variable features of this genome in a population of interest. The known gene-annotation profiles, corresponding to the fixed features of the genome, may concern Gene Ontology (GO) annotation, pathway membership, regulation by particular transcription factors, nucleotide sequences, or protein sequences. The unknown gene-parameter profiles, corresponding to the variable features of the genome, may be, for example, regression coefficients relating possibly censored biological and clinical outcomes to genome-wide transcript levels, DNA copy numbers, and other covariates. A generic question of great interest in current genomic research regards the detection of associations between biological annotation metadata and genome-wide expression measures. This biological question may be translated as the test of multiple hypotheses concerning association measures between gene-annotation profiles and gene-parameter profiles. A general and rigorous formulation of the statistical inference question allows us to apply the multiple hypothesis testing methodology developed in [Multiple Testing Procedures with Applications to Genomics (2008) Springer, New York] and related articles, to control a broad class of Type I error rates, defined as generalized tail probabilities and expected values for arbitrary functions of the numbers of Type I errors and rejected hypotheses. The resampling-based single-step and stepwise multiple testing procedures of [Multiple Testing Procedures with Applications to Genomics (2008) Springer, New York] take into account the joint distribution of the test statistics and provide Type I error control in testing problems involving general data generating distributions (with arbitrary dependence structures among variables), null hypotheses, and test statistics.Comment: Published in at http://dx.doi.org/10.1214/193940307000000446 the IMS Collections (http://www.imstat.org/publications/imscollections.htm) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

GenomeGraphs: integrated genomic data visualization with R.

Author: Bullard James
Dudoit Sandrine
Durinck Steffen
Spellman Paul T
Publication venue: eScholarship, University of California
Publication date: 01/01/2009
Field of study

BackgroundBiological studies involve a growing number of distinct high-throughput experiments to characterize samples of interest. There is a lack of methods to visualize these different genomic datasets in a versatile manner. In addition, genomic data analysis requires integrated visualization of experimental data along with constantly changing genomic annotation and statistical analyses.ResultsWe developed GenomeGraphs, as an add-on software package for the statistical programming environment R, to facilitate integrated visualization of genomic datasets. GenomeGraphs uses the biomaRt package to perform on-line annotation queries to Ensembl and translates these to gene/transcript structures in viewports of the grid graphics package. This allows genomic annotation to be plotted together with experimental data. GenomeGraphs can also be used to plot custom annotation tracks in combination with different experimental data types together in one plot using the same genomic coordinate system.ConclusionGenomeGraphs is a flexible and extensible software package which can be used to visualize a multitude of genomic datasets within the statistical programming environment R

Springer - Publisher Connector

PubMed Central

eScholarship - University of California

Against extinction: a legacy of native Hawaiian resistance literature

Author: Dudoit D. Māhealani
Publication venue
Publication date: 01/01/1999
Field of study

ScholarSpace at University of Hawai'i at Manoa

Quantification and Visualization of LD Patterns and Identification of Haplotype Blocks

Author: Dudoit Sandrine
Wang Yan
Publication venue: Collection of Biostatistics Research Archive
Publication date: 29/06/2004
Field of study

Classical measures of linkage disequilibrium (LD) between two loci, based only on the joint distribution of alleles at these loci, present noisy patterns. In this paper, we propose a new distance-based LD measure, R, which takes into account multilocus haplotypes around the two loci in order to exploit information from neighboring loci. The LD measure R yields a matrix of pairwise distances between markers, based on the correlation between the lengths of shared haplotypes among chromosomes around these markers. Data analysis demonstrates that visualization of LD patterns through the R matrix reveals more deterministic patterns, with much less noise, than using classical LD measures. Moreover, the patterns are highly compatible with recently suggested models of haplotype block structure. We propose to apply the new LD measure to define haplotype blocks through cluster analysis. Specifically, we present a distance-based clustering algorithm, DHPBlocker, which performs hierarchical partitioning of an ordered sequence of markers into disjoint and adjacent blocks with a hierarchical structure. The proposed method integrates information on the two main existing criteria in defining haplotype blocks, namely, LD and haplotype diversity, through the use of silhouette width and description length as cluster validity measures, respectively. The new LD measure and clustering procedure are applied to single nucleotide polymorphism (SNP) datasets from the human 5q31 region (Daly et al. 2001) and the class II region of the human major histocompatibility complex (Jeffreys et al. 2001). Our results are in good agreement with published results. In addition, analyses performed on different subsets of markers indicate that the method is robust with regards to the allele frequency and density of the genotyped markers. Unlike previously proposed methods, our new cluster-based method can uncover hierarchical relationships among blocks and can be applied to polymorphic DNA markers or amino acid sequence data

Collection Of Biostatistics Research Archive

Optimal feature selection for sparse linear discriminant analysis and its applications in gene expression data

Author: Anderson
Baiqi Miao
Barry
Bickel
Cai
Candes
Cheng Wang
Donoho
Dudoit
Dudoit
Fan
Fan
Fan
Goeman
Golub
Hess
Lai
Li
Longbing Cao
Mai
Shao
Srivastava
Tibshirani
Tong
Wu
Yeung
Zuber
Publication venue: 'Elsevier BV'
Publication date: 01/01/2013
Field of study

This work studies the theoretical rules of feature selection in linear discriminant analysis (LDA), and a new feature selection method is proposed for sparse linear discriminant analysis. An

l_1

minimization method is used to select the important features from which the LDA will be constructed. The asymptotic results of this proposed two-stage LDA (TLDA) are studied, demonstrating that TLDA is an optimal classification rule whose convergence rate is the best compared to existing methods. The experiments on simulated and real datasets are consistent with the theoretical results and show that TLDA performs favorably in comparison with current methods. Overall, TLDA uses a lower minimum number of features or genes than other approaches to achieve a better result with a reduced misclassification rate.Comment: 20 pages, 3 figures, 5 tables, accepted by Computational Statistics and Data Analysi

arXiv.org e-Print Archive

Crossref

OPUS - University of Technology Sydney

The Discursive Effects of the Haiku-based SADUPA Poetry Technique in Palliative Care

Author: Dudoit Eric
Paul Melanie
Santarpia Alfonso
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2015
Field of study

International audienceThis qualitative study seeks to present the discursive effects of SADUPA, a new poetry-based technique centered on haiku, in the context of psycho-oncological treatment. The technique is used with a terminal cancer patient, Mr. A. The psychological processes involved with and the poetic writings arising from the technique are discussed. In particular, the discursive variations in Mr. A’s narrative of his illness are described as they occurred before and after his poetry writing. The authors suggest that writing workshops based on the brief poetic structures of the haiku can enable patients to produce a larger and more singular narrative about their end-of- life experiences

HAL AMU