Search CORE

156 research outputs found

Using the R Package crlmm for Genotyping and Copy Number Estimation

Author: Benilton Carvalho
Ingo Ruczinski
Matthew E. Ritchie
Rafael A. Irizarry
Robert B. Scharpf
Publication venue
Publication date
Field of study

Genotyping platforms such as Affymetrix can be used to assess genotype-phenotype as well as copy number-phenotype associations at millions of markers. While genotyping algorithms are largely concordant when assessed on HapMap samples, tools to assess copy number changes are more variable and often discordant. One explanation for the discordance is that copy number estimates are susceptible to systematic differences between groups of samples that were processed at different times or by different labs. Analysis algorithms that do not adjust for batch effects are prone to spurious measures of association. The R package crlmm implements a multilevel model that adjusts for batch effects and provides allele-specific estimates of copy number. This paper illustrates a workflow for the estimation of allele-specific copy number and integration of the marker-level estimates with complimentary Bioconductor software for inferring regions of copy number gain or loss. All analyses are performed in the statistical environment R.

Research Papers in Economics

A BAYESIAN MODEL FOR CROSS-STUDY DIFFERENTIAL GENE EXPRESSION

Author: Nobel Andrew B.
Parmigiani Giovanni
Scharpf Robert B.
Tjelemeland Hakon
Publication venue: Collection of Biostatistics Research Archive
Publication date: 21/11/2007
Field of study

In this paper we define a hierarchical Bayesian model for microarray expression data collected from several studies and use it to identify genes that show differential expression between two conditions. Key features include shrinkage across both genes and studies; flexible modeling that allows for interactions between platforms and the estimated effect, and for both concordant and discordant differential expression across studies. We evaluated the performance of our model in a comprehensive fashion, using both artificial data, and a split-sample validation approach that provides an agnostic assessment of the model\u27s behavior not only under the null hypothesis but also under a realistic alternative. The simulation results from the artificial data demonstrate the advantages of a Bayesian model. Compared to a more direct combination of t- or SAM-statistics, the 1-AUC values for the Bayesian model is roughly half of the corresponding values for the t- and SAM-statistics. Furthermore, the simulations provide guidelines for when the Bayesian model is most likely to be useful. Most noticeably, in small studies the Bayesian model generally Outperforms other methods when evaluated by AUC, FDR, and MDR across a range of simulation parameters, and this difference diminishes for larger sample sizes in the individual studies. The split-study validation illustrates appropriate shrinkage of the Bayesian model in the absence of platform-, sample-, and annotation-differences that otherwise complicate experimental data analyses. Finally, we fit our model to four breast cancer studies employing different technologies (cDNA and Affymetrix) to estimate differential expression in estrogen receptor positive tumors versus negative ones. Software and data for reproducing our analysis are publicly available

Collection Of Biostatistics Research Archive

A HIDDEN MARKOV MODEL FOR JOINT ESTIMATION OF GENOTYPE AND COPY NUMBER IN HIGH-THROUGHPUT SNP CHIPS

Author: Parmigiani Giovanni
Pevnser Jonathan
Ruczinski Ingo
Scharpf Robert B.
Publication venue: Collection of Biostatistics Research Archive
Publication date: 28/02/2007
Field of study

Amplifications and deletions of chromosomal DNA, as well as copy-neutral loss of heterozygosity have been associated with diseases processes. High-throughput single nucleotide polymorphism (SNP) arrays are useful for making genome-wide estimates of copy number and genotype calls. Because neighboring SNPs in high throughput SNP arrays are likely to have dependent copy number and genotype due to the underlying haplotype structure and linkage disequilibrium, hidden Markov models (HMM) may be useful for improving genotype calls and copy number estimates that do not incorporate information from nearby SNPs. We improve previous approaches that utilize a HMM framework for inference in high throughput SNP arrays by integrating copy number, genotype calls, and the corresponding confidence scores when available. Using simulated data, we demonstrate how confidence scores control smoothing in a probabilistic framework. Software for fitting HMMs to SNP array data is available in the R package ICE

Collection Of Biostatistics Research Archive

When Should One Substract Background Fluorescence in Two Color Microarrays?

Author: Iacobuzio-Donahue Christine A.
Parmigiani Giovanni
Scharpf Robert B.
Sneddon Julie B.
Publication venue: Collection of Biostatistics Research Archive
Publication date: 20/07/2005
Field of study

Two color microarrays are a powerful tool for genomic analysis, but have noise components that make inferences regarding gene expression inefficient and potentially misleading. Background fluorescence,whether attributable to non-specific binding or other sources,is an important component of noise. The decision to subtract fluorescence surrounding spots of hybridization from spot fluorescence has been controversial, with no clear criteria for determining circumstances that may favor, or disfavor, background subtraction. While it is generally accepted that subtracting background reduces bias but increases variance in the estimates of the ratios of interest, no formal analysis of the bias-variance trade off of background subtraction has been undertaken. In this paper, we use simulation to systematically examine the bias-variance trade off under a variety of possible experimental conditions. Our simulation is based on data obtained from two self versus self microarray experiments and is free of distributional assumptions. Our results identify factors that are important for determining whether to background subtract, including the correlation of foreground to background intensity ratios. Using these results we develop recommendations for diagnostic visualizations that can help decisions about background subtraction

Collection Of Biostatistics Research Archive

USING THE R PACKAGE crlmm FOR GENOTYPING AND COPY NUMBER ESTIMATION

Author: Carvalho Benilton
Irizarry Rafael
Ritchie Walter
Ruczinski Ingo
Scharpf Robert B.
Publication venue: Collection of Biostatistics Research Archive
Publication date: 29/09/2010
Field of study

Genotyping platforms such as Affymetrix can be used to assess genotype-phenotype as well as copy number-phenotype associations at millions of markers. While genotyping algorithms are largely concordant when assessed on HapMap samples, tools to assess copy number changes are more variable and often discordant. One explanation for the discordance is that copy number estimates are susceptible to systematic differences between groups of samples that were processed at different times or by different labs. Analysis algorithms that do not adjust for batch effects are prone to spurious measures of association. The R package crlmm implements a multilevel model that adjusts for batch effects and provides allele-specific estimates of copy number. This paper illustrates a workflow for the estimation of allele-specific copy number, develops markerand study-level summaries of batch effects, and demonstrates how the marker-level estimates can be integrated with complimentary Bioconductor software for inferring regions of copy number gain or loss. All analyses are performed in the statistical environment R. A compendium for reproducing the analysis is available from the author’s website (http://www.biostat.jhsph.edu/~rscharpf/crlmmCompendium/index.html)

Collection Of Biostatistics Research Archive

Using the R Package crlmm for Genotyping and Copy Number Estimation

Author: Carvalho Benilton
Irizarry Rafael A.
Ritchie Matthew E.
Ruczinski Ingo
Scharpf Robert B.
Publication venue: 'Foundation for Open Access Statistic'
Publication date: 01/05/2011
Field of study

Directory of Open Access Journals

PubMed Central

Journal of Statistical Software

A MULTILEVEL MODEL TO ADDRESS BATCH EFFECTS IN COPY NUMBER USING SNP ARRAYS

Author: Carvalho Benilton
Chakravarti Aravinda
Doan Betty
Irizarry Rafael A.
Ruczinski Ingo
Scharpf Robert B.
Publication venue: Collection of Biostatistics Research Archive
Publication date: 29/06/2009
Field of study

Submicroscopic changes in chromosomal DNA copy number dosage are common and have been implicated in many heritable diseases and cancers. Recent high-throughput technologies have a resolution that permits the detection of segmental changes in DNA copy number that span thousands of basepairs across the genome. Genome-wide association studies (GWAS) may simultaneously screen for copy number-phenotype and SNP-phenotype associations as part of the analytic strategy. However, genome-wide array analyses are particularly susceptible to batch effects as the logistics of preparing DNA and processing thousands of arrays often involves multiple laboratories and technicians, or changes over calendar time to the reagents and laboratory equipment. Failure to adjust for batch effects can lead to incorrect inference and requires inefficient post-hoc quality control procedures that exclude regions that are associated with batch. Our work extends previous model-based approaches for copy number estimation by explicitly modeling batch effects and using shrinkage to improve locus-specific estimates of copy number uncertainty. Key features of this approach include the use of diallelic genotype calls from experimental data to estimate batch- and locus-specific parameters of background and signal without the requirement of training data. We illustrate these ideas using a study of bipolar disease and a study of chromosome 21 trisomy. The former has batch effects that dominate much of the observed variation in quantile-normalized intensities, while the latter illustrates the robustness of our approach to datasets where as many as 25% of the samples have altered copy number. Locus-specific estimates of copy number can be plotted on the copy-number scale to investigate mosaicism and guide the choice of appropriate downstream approaches for smoothing the copy number as a function of physical position. The software is open source and implemented in the R package CRLMM available at Bioconductor (http:www.bioconductor.org)

Collection Of Biostatistics Research Archive

A Bayesian Model for Cross-Study Differential Gene Expression

Author: Nobel Andrew B.
Parmigiani Giovanni
Scharpf Robert B.
Tjelmeland Håkon
Publication venue
Publication date: 01/01/2009
Field of study

In this paper we define a hierarchical Bayesian model for microarray expression data collected from several studies and use it to identify genes that show differential expression between two conditions. Key features include shrinkage across both genes and studies, and flexible modeling that allows for interactions between platforms and the estimated effect, as well as concordant and discordant differential expression across studies. We evaluated the performance of our model in a comprehensive fashion, using both artificial data, and a “split-study” validation approach that provides an agnostic assessment of the model's behavior not only under the null hypothesis, but also under a realistic alternative. The simulation results from the artificial data demonstrate the advantages of the Bayesian model. The 1 – AUC values for the Bayesian model are roughly half of the corresponding values for a direct combination of t- and SAM-statistics. Furthermore, the simulations provide guidelines for when the Bayesian model is most likely to be useful. Most noticeably, in small studies the Bayesian model generally outperforms other methods when evaluated by AUC, FDR, and MDR across a range of simulation parameters, and this difference diminishes for larger sample sizes in the individual studies. The split-study validation illustrates appropriate shrinkage of the Bayesian model in the absence of platform-, sample-, and annotation-differences that otherwise complicate experimental data analyses. Finally, we fit our model to four breast cancer studies employing different technologies (cDNA and Affymetrix) to estimate differential expression in estrogen receptor positive tumors versus negative ones. Software and data for reproducing our analysis are publicly available

PubMed Central

Carolina Digital Repository

Cross-platform Comparison of Two Pancreatic Cancer Phenotypes

Author: Campagna Domenico
Cope Leslie
Garrett-Mayer Elizabeth
Iacobuzio-Donahue Christine A.
Lakkur Sindhu
Parmigiani Giovanni
Ruczinski Ingo
Scharpf Robert B.
Publication venue: Libertas Academica
Publication date: 01/11/2010
Field of study

Model-based approaches for combining gene expression data from multiple high throughput platforms can be sensitive to technological artifacts when the number of samples in each platform is small. This paper proposes simple tools for quantifying concordance in a small study of pancreatic cancer cells lines with an emphasis on visualizations that uncover intra- and inter-platform variation. Using this approach, we identify several transcripts from the integrative analysis whose over-or under-expression in pancreatic cancer cell lines was validated by qPCR

Crossref

Directory of Open Access Journals

PubMed Central

Recommended from our members

Hemizygous Deletion on Chromosome 3p26.1 Is Associated with Heavy Smoking among African American Subjects in the COPDGene Study

Author: Beaty Terri H.
Begum Ferdouse
Cho Michael H.
Crapo James D.
Hetmanski Jacqueline B.
Hokanson John E.
Lutz Sharon M.
Parker Margaret M.
Ruczinski Ingo
Scharpf Robert B.
Silverman Edwin K.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2016
Field of study

Many well-powered genome-wide association studies have identified genetic determinants of self-reported smoking behaviors and measures of nicotine dependence, but most have not considered the role of structural variants, such as copy number variation (CNVs), influencing these phenotypes. Here, we included 2,889 African American and 6,187 non-Hispanic White subjects from the COPDGene cohort (http://www.copdgene.org) to carefully investigate the role of polymorphic CNVs across the genome on various measures of smoking behavior. We identified a CNV component (a hemizygous deletion) on chromosome 3p26.1 associated with two quantitative phenotypes related to smoking behavior among African Americans. This polymorphic hemizygous deletion is significantly associated with pack-years and cigarettes smoked per day among African American subjects in the COPDGene study. We sought evidence of replication in African Americans from the population based Atherosclerosis Risk in Communities (ARIC) study. While we observed similar CNV counts, the extent of exposure to cigarette smoking among ARIC subjects was quite different and the smaller sample size of heavy smokers in ARIC severely limited statistical power, so we were unable to replicate our findings from the COPDGene cohort. But meta-analyses of COPDGene and ARIC study subjects strengthened our association signal. However, a few linkage studies have reported suggestive linkage to the 3p26.1 region, and a few genome-wide association studies (GWAS) have reported markers in the gene (GRM7) nearest to this 3p26.1 area of polymorphic deletions are associated with measures of nicotine dependence among subjects of European ancestry

Harvard University - DASH

Directory of Open Access Journals

PubMed Central

FigShare