320 research outputs found

    A Bayesian calibration model for combining different pre-processing methods in Affymetrix chips

    Get PDF
    BACKGROUND: In gene expression studies a key role is played by the so called "pre-processing", a series of steps designed to extract the signal and account for the sources of variability due to the technology used rather than to biological differences between the RNA samples. At the moment there is no commonly agreed gold standard pre-processing method and each researcher has the responsibility to choose one method, incurring the risk of false positive and false negative features arising from the particular method chosen. RESULTS: We propose a Bayesian calibration model that makes use of the information provided by several pre-processing methods and we show that this model gives a better assessment of the 'true' unknown differential expression between two conditions. We demonstrate how to estimate the posterior distribution of the differential expression values of interest from the combined information. CONCLUSION: On simulated data and on the spike-in Latin Square dataset from Affymetrix the Bayesian calibration model proves to have more power than each pre-processing method. Its biological interest is demonstrated through an experimental example on publicly available data

    Methods for evaluating gene expression from Affymetrix microarray datasets

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Affymetrix high density oligonucleotide expression arrays are widely used across all fields of biological research for measuring genome-wide gene expression. An important step in processing oligonucleotide microarray data is to produce a single value for the gene expression level of an RNA transcript using one of a growing number of statistical methods. The challenge for the researcher is to decide on the most appropriate method to use to address a specific biological question with a given dataset. Although several research efforts have focused on assessing performance of a few methods in evaluating gene expression from RNA hybridization experiments with different datasets, the relative merits of the methods currently available in the literature for evaluating genome-wide gene expression from Affymetrix microarray data collected from real biological experiments remain actively debated.</p> <p>Results</p> <p>The present study reports a comprehensive survey of the performance of all seven commonly used methods in evaluating genome-wide gene expression from a well-designed experiment using Affymetrix microarrays. The experiment profiled eight genetically divergent barley cultivars each with three biological replicates. The dataset so obtained confers a balanced and idealized structure for the present analysis. The methods were evaluated on their sensitivity for detecting differentially expressed genes, reproducibility of expression values across replicates, and consistency in calling differentially expressed genes. The number of genes detected as differentially expressed among methods differed by a factor of two or more at a given false discovery rate (FDR) level. Moreover, we propose the use of genes containing single feature polymorphisms (SFPs) as an empirical test for comparison among methods for the ability to detect true differential gene expression on the basis that SFPs largely correspond to <it>cis</it>-acting expression regulators. The PDNN method demonstrated superiority over all other methods in every comparison, whilst the default Affymetrix MAS5.0 method was clearly inferior.</p> <p>Conclusion</p> <p>A comprehensive assessment of seven commonly used data extraction methods based on an extensive barley Affymetrix gene expression dataset has shown that the PDNN method has superior performance for the detection of differentially expressed genes.</p

    Empirical Bayes models for multiple probe type microarrays at the probe level

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>When analyzing microarray data a primary objective is often to find differentially expressed genes. With empirical Bayes and penalized t-tests the sample variances are adjusted towards a global estimate, producing more stable results compared to ordinary t-tests. However, for Affymetrix type data a clear dependency between variability and intensity-level generally exists, even for logged intensities, most clearly for data at the probe level but also for probe-set summarizes such as the MAS5 expression index. As a consequence, adjustment towards a global estimate results in an intensity-level dependent false positive rate.</p> <p>Results</p> <p>We propose two new methods for finding differentially expressed genes, Probe level Locally moderated Weighted median-t (PLW) and Locally Moderated Weighted-t (LMW). Both methods use an empirical Bayes model taking the dependency between variability and intensity-level into account. A global covariance matrix is also used allowing for differing variances between arrays as well as array-to-array correlations. PLW is specially designed for Affymetrix type arrays (or other multiple-probe arrays). Instead of making inference on probe-set summaries, comparisons are made separately for each perfect-match probe and are then summarized into one score for the probe-set.</p> <p>Conclusion</p> <p>The proposed methods are compared to 14 existing methods using five spike-in data sets. For RMA and GCRMA processed data, PLW has the most accurate ranking of regulated genes in four out of the five data sets, and LMW consistently performs better than all examined moderated t-tests when used on RMA, GCRMA, and MAS5 expression indexes.</p

    Identification and validation of copy number variants using SNP genotyping arrays from a large clinical cohort.

    Get PDF
    BACKGROUND: Genotypes obtained with commercial SNP arrays have been extensively used in many large case-control or population-based cohorts for SNP-based genome-wide association studies for a multitude of traits. Yet, these genotypes capture only a small fraction of the variance of the studied traits. Genomic structural variants (GSV) such as Copy Number Variation (CNV) may account for part of the missing heritability, but their comprehensive detection requires either next-generation arrays or sequencing. Sophisticated algorithms that infer CNVs by combining the intensities from SNP-probes for the two alleles can already be used to extract a partial view of such GSV from existing data sets. RESULTS: Here we present several advances to facilitate the latter approach. First, we introduce a novel CNV detection method based on a Gaussian Mixture Model. Second, we propose a new algorithm, PCA merge, for combining copy-number profiles from many individuals into consensus regions. We applied both our new methods as well as existing ones to data from 5612 individuals from the CoLaus study who were genotyped on Affymetrix 500K arrays. We developed a number of procedures in order to evaluate the performance of the different methods. This includes comparison with previously published CNVs as well as using a replication sample of 239 individuals, genotyped with Illumina 550K arrays. We also established a new evaluation procedure that employs the fact that related individuals are expected to share their CNVs more frequently than randomly selected individuals. The ability to detect both rare and common CNVs provides a valuable resource that will facilitate association studies exploring potential phenotypic associations with CNVs. CONCLUSION: Our new methodologies for CNV detection and their evaluation will help in extracting additional information from the large amount of SNP-genotyping data on various cohorts and use this to explore structural variants and their impact on complex traits

    The Shivplot: a graphical display for trend elucidation and exploratory analysis of microarray data

    Get PDF
    BACKGROUND: High-throughput systems are powerful tools for the life science research community. The complexity and volume of data from these systems, however, demand special treatment. Graphical tools are needed to evaluate many aspects of the data throughout the analysis process because plots can provide quality assessments for thousands of values simultaneously. The utility of a plot, in turn, is contingent on both its interpretability and its efficiency. RESULTS: The shivplot, a graphical technique motivated by microarrays but applicable to any replicated high-throughput data set, is described. The plot capitalizes on the strengths of three well-established plotting graphics – a boxplot, a distribution density plot, and a variability vs intensity plot – by effectively combining them into a single representation. CONCLUSION: The utility of the new display is illustrated with microarray data sets. The proposed graph, retaining all the information of its precursors, conserves space and minimizes redundancy, but also highlights features of the data that would be difficult to appreciate from the individual display components. We recommend the use of the shivplot both for exploratory data analysis and for the communication of experimental data in publications

    A methodology for global validation of microarray experiments

    Get PDF
    BACKGROUND: DNA microarrays are popular tools for measuring gene expression of biological samples. This ever increasing popularity is ensuring that a large number of microarray studies are conducted, many of which with data publicly available for mining by other investigators. Under most circumstances, validation of differential expression of genes is performed on a gene to gene basis. Thus, it is not possible to generalize validation results to the remaining majority of non-validated genes or to evaluate the overall quality of these studies. RESULTS: We present an approach for the global validation of DNA microarray experiments that will allow researchers to evaluate the general quality of their experiment and to extrapolate validation results of a subset of genes to the remaining non-validated genes. We illustrate why the popular strategy of selecting only the most differentially expressed genes for validation generally fails as a global validation strategy and propose random-stratified sampling as a better gene selection method. We also illustrate shortcomings of often-used validation indices such as overlap of significant effects and the correlation coefficient and recommend the concordance correlation coefficient (CCC) as an alternative. CONCLUSION: We provide recommendations that will enhance validity checks of microarray experiments while minimizing the need to run a large number of labour-intensive individual validation assays

    Estimation and correction of non-specific binding in a large-scale spike-in experiment

    Get PDF
    A combined statistical analysis using the MAS5 PM-MM, GC-NSB and PDNN methods to generate probeset values from microarray data results in an improved ability to detect differential expression and estimates of false discovery rates compared with the individual methods

    Microarray-based expression profiling : improving data mining and the links to biological knowledge pools

    Get PDF
    Having identified differentially regulated genes, the final and most labour intensive part of the analysis process is drawing biological conclusions and hypothesises about the data. A novel solution is presented which combines experimental data with a curated annotation sources along with analysis tools to assist the researcher in exploring the information contained within their dataset.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Statistical methodologies for the analysis and normalization of RIP-Chip data

    Get PDF
    Tese de doutoramento, Estatística e Investigação Operacional (Bioestatística e Bioinformática), Universidade de Lisboa, Faculdade de Ciências, 2011Pre-mRNA splicing is an essential step in the post-transcriptional gene expression control involving protein-splicing factors like PTB and U2AF65; the last one is exported to the cytoplasm and involved in some other cellular functions. The identification of PTB- and U2AF65-associated mRNAs under native conditions was performed by immunoprecipitation and hybridization on Chip (RIP-Chip) technology using the Affymetrix GeneChip R Human Genome U133 Plus 2.0. The aim of this thesis is to develop statistical methodologies for low level analysis and enriched gene selection in RIP-Chip experiments. When the most common methodologies for quality assessment, low level analysis (background adjustment, normalization and summarization) and detection of diferentially expressed genes (DEG), are applied to RIP-Chip data the obtained results difer. This probably happens because usually more than 20% of the mRNAs are enriched, while methods for normalization and identification of DEG are developed supposing that only a small proportion of genes (1% or 5%, say) express diferently. Also, methods for detecting diferentially expressed genes may not be the most adequate for gene enrichment selection. In this thesis is implemented a background correction method inspired in a non-specific hybridization method used for pre-processing ChIP-Chip data. Linear regression models are used in each array to model the non-specific hybridization. Probe intensities on the array are standardized using their predicted intensity and the variance of similar predicted intensities. The standardized probe intensities showed no need for further normalization, so the scores could be directly compared. It is proposed a probe set score, a probe set enrichment value and its p-value for enriched gene selection. The genes selected using this new method are practically the same as the ones found experimentally. Additionally, a new methodology based on ranks is presented for enriched gene selection, being applied to the probe set scores proposed. Both methodologies had high accuracy when applied to Spike-In U133 dataset, which is used to benchmark methodologies for analysing Affymetrix microarrays.Nos ultimos anos foram desenvolvidas técnicas de alto rendimento na investiga cão em biologia. Essas técnicas evoluíram fornecendo à comunidade científica instrumentos como: sequenciadores de alta capacidade, que permitem obter milhões de fragmentos de DNA ao mesmo tempo; espectómetros de massa em tandem que permitem a identificação de proteínas ou proteomas completos; ou hibridaçãode microarrays, usados para determinar a expressão dos genes através da identificação mRNAs presentes na célula num momento específico. Os microarrays constituem uma técnica usada para quantificar a expressão de genes e analisar fragmentos de genes, proteínas ou metabolitos. Também têm sido utilizados para clarificar elementos específicos do Dogma Central da Biologia Molecular, envolvidos no controle da transcrição; na busca de dados que expliquem como a expressão do gene começa a partir do DNA; ou como o mRNA em associação o com os ribossomas e traduzido em proteínas nas no citoplasma da célula. Dado o enquadramento biológico descrito acima, o Capítulo 1 introduz os aspectos da biologia relacionados com os dados RIP-Chip utilizados nesta tese, dados esses obtidos por Gama-Carvalho et al. [2006], em que se pretende identifi car os mRNAs associados a PTB e U2AF65 em condições nativas. Estas duas proteínas de ligação a RNA fazem parte do controle pós-transcripcional da expresscão genética em células eucariótas. Este capítulo começa por introduzir conceitos de biologia molecular da célula tal como o dogma central da biologia molecular, onde os processos de transcri cão e tradução são essenciais para manter a vida da célula e onde o controle de expressão genética é um aparelho fundamental na regulação da célula. Como parte do controle da expressão dos genes, o Capítulo 1 apresenta uma visão geral do controle pré- e pós-transcripcional da expressão dos genes. O splicing de pr e-mRNAs e um passo essencial no controle da expressão pós-transcripcional dos genes e envolve factores de splicing tais como as proteínas PTB e U2AF65, sendo U2AF65 exportada para o citoplasma e envolvida em outras funções celulares. O Capítulo 1 mostra como foram obtidos os dados RIP-Chip das proteínas PTB e U2AF65 e apresenta uma breve descri cão da metodologia utilizada por Gama-Carvalho et al. [2006] na sua experiência RIP-Chip. Mostra como a investigação de mRNAs associados a PTB e U2AF65, em condições nativas, foi realizada por imunoprecipitação (IP) após a adição de um anticorpo monoclonal específico (Bb7 anti-PTB mAb ou anti-U2AF65 MC3), seguido de extração de RNA, poliadenilacão, transcricão reversa, etiquetagem final e amplificação por PCR. Os cDNAs gerados foram hibridados com o GeneChip A ymetrix Human Genome U133 Plus 2.0 [Gama-Carvalho et al., 2006]. Este capítulo apresenta uma descricão da tecnologia de microarrays, em particular as características dos microarrays da Affymetrix utilizados na experiência RIP-Chip executada por Gama-Carvalho et al. [2006]. De seguida, o Capítulo 2 apresenta alguns dos métodos mais comuns de análise de dados de microarrays e os resultados de seu desempenho nos dados de Gama-Carvalho et al. [2006]. Para a correcão de background foi utilizado o modelo linear robusto (RMA) de Irizarry et al. [2003a] e uma modificação do mesmo (GCRMA) proposta por Wu et al. [2004], apenas sobre PM (Perfect Match). A normalização foi realizada através da normalização quartílica e a sumariacão das sondas foi feita usando a mediana polish [Irizarry et al., 2003a]. Alternativamente, os dados foram pré-processados usando o programa dChip: apenas para PM; usando o método de normalização invariant set [Li and Wong, 2001]; e o método baseado em modelos de Li and Wong [2001] para calcular os níveis de expressão. Para efeitos de comparacão foram utilizados os dados obtidos após a correcão de background com RMA, a normalização quartílica e sumariacão com a mediana polish. Com base nestes dados, foi feita a seleção de genes enriquecidos usando as seguintes bibliotecas do BioConductor: limma (ajusta um modelo linear para cada gene); eBayes (calcula a estatísticas T moderada, F e B - logaritmo das chances a posteriori); decideTests com um valor-p < 0:05 (baseia-se em testes múltiplos para determinar se cada estatística numa matriz de estatísticas T deve ser considerada significativamente diferente de zero [Smyth, 2004]); RankProd com FDR <0:05 (teste não-paramétrico que deteta itens que são consistentemente classificados como estando no topo da lista [Breitling et al., 2004]). Estes resultados foram comparados com os resultados obtidos com o programa dChip considerando uma taxa de falsas descobertas (FDR) <0:05 e um valor-p <0:05 [Li and Wong, 2003]. Os resultados apresentados no Capítulo 2 mostram como diferentes metodologias aplicadas aos dados de Gama-Carvalho et al. [2006] produziram resultados Diferentes. Parte das diferenças devem-se sobretudo ao facto de mais de 20% dos mRNAs serem enriquecidos e os métodos de normalização comuns terem por base pequenas diferenças entre eles. Como esta tese teve como principal objetivo o desenvolvimento de metodologias estatísticas para análise de baixo nível e selecão de genes enriquecidos em experiências RIP-Chip, o Capítulo 3 é dedicado a apresentar a implementa çao de um novo método de correção de background inspirado num método de hibridação não específica utilizado para pré-processamento de dados ChIp-Chip [Johnson et al., 2006]. Modelos de regressão linear foram usados para modelar em cada microarray a hibridação não específica, representando intera ções entre cada três nucleótidos consecutivos na sequência da sonda. As intensidades das sondas foram padronizadas usando sua intensidade prevista e a variância das sondas de intensidades previstas semelhantes. A nova abordagem aqui proposta utiliza a informação de cada microarray de forma independente, e os valores de intensidade padronizados não revelaram necessidade de normalização adicional. Assim, os microarrays podem ser directamente comparados [Barreto-Hernandez et al., 2011]. O Capítulo 3 apresenta também um score para a sonda; a definicão de um valor de enriquecimento da sonda (ENRval) e respectivos valores-p para a selecão de genes enriquecidos [Barreto-Hernandez et al., 2011]. Os genes enriquecidos obtidos usando esta metodologia, tanto para os dados RIP-Chip de PTB como de U2AF65, estão de acordo com os genes identi ficados experimentalmente por Gama-Carvalho et al. [2006]. Finalmente, o Capítulo 3 apresenta ao desenvolvimento da uma nova metodologia Não-paramétrica baseada em postos (ranks), implementada para selecão de genes enriquecidos e aplicada aos scores propostos en este Capítulo. Esta metodologia tem em conta a variabilidade da intensidade padronizada em cada sonda, em vez de usar o valor de sumariacão de cada sonda (ENRval). Ainda neste capítulo, as metodologias desenvolvidas nesta tese para a selecão de genes enriquecidos são aplicadas aos dados da experiência Spike-In. Esta base de dados foi construída há alguns anos e é usada no desenvolvimento e comparação de métodos de análise de expressão diferencial de genes [Irizarry et al., 2003b]. A experiência Spike-In U133 engloba 42 transcritos adicionados a um complexo transcriptoma humano em concentrações que variam de 0.125pM a 512pM, correspondendo a 14 hibridações separadas com três repetições técnicas. Os transcritos foram incluídos na experiência sob a forma de um quadrado latino clássico [Irizarry et al., 2003b]. Para a análise comparativa, três diferentes hibridações Spike-In foram selecionadas (hibridações 1, 8 e 14) e usadas para simular diferenças de enriquecimento em experiências RIP-Chip através do seguinte procedimento: 1 como Controle e 8 como IP; 1 como IP e 14 como Controle. As duas metodologias desenvolvidas nesta tese para sele cão de genes enriquecidos, apresentam elevada exatidão quando aplicadas aos dados Spike-In U133.Fundação para a Ciência e a Tecnologia (FCT por projectos PEst-OE/MAT/UI0006/2011 e PTDC/MAT/64353/2006
    corecore