1,115 research outputs found

    Stability and aggregation of ranked gene lists

    Get PDF
    Ranked gene lists are highly instable in the sense that similar measures of differential gene expression may yield very different rankings, and that a small change of the data set usually affects the obtained gene list considerably. Stability issues have long been under-considered in the literature, but they have grown to a hot topic in the last few years, perhaps as a consequence of the increasing skepticism on the reproducibility and clinical applicability of molecular research findings. In this article, we review existing approaches for the assessment of stability of ranked gene lists and the related problem of aggregation, give some practical recommendations, and warn against potential misuse of these methods. This overview is illustrated through an application to a recent leukemia data set using the freely available Bioconductor package GeneSelector

    Bayesian meta-analysis for identifying periodically expressed genes in fission yeast cell cycle

    Full text link
    The effort to identify genes with periodic expression during the cell cycle from genome-wide microarray time series data has been ongoing for a decade. However, the lack of rigorous modeling of periodic expression as well as the lack of a comprehensive model for integrating information across genes and experiments has impaired the effort for the accurate identification of periodically expressed genes. To address the problem, we introduce a Bayesian model to integrate multiple independent microarray data sets from three recent genome-wide cell cycle studies on fission yeast. A hierarchical model was used for data integration. In order to facilitate an efficient Monte Carlo sampling from the joint posterior distribution, we develop a novel Metropolis--Hastings group move. A surprising finding from our integrated analysis is that more than 40% of the genes in fission yeast are significantly periodically expressed, greatly enhancing the reported 10--15% of the genes in the current literature. It calls for a reconsideration of the periodically expressed gene detection problem.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS300 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Detecting disease-associated genes with confounding variable adjustment and the impact on genomic meta-analysis: With application to major depressive disorder

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Detecting candidate markers in transcriptomic studies often encounters difficulties in complex diseases, particularly when overall signals are weak and sample size is small. Covariates including demographic, clinical and technical variables are often confounded with the underlying disease effects, which further hampers accurate biomarker detection. Our motivating example came from an analysis of five microarray studies in major depressive disorder (MDD), a heterogeneous psychiatric illness with mostly uncharacterized genetic mechanisms.</p> <p>Results</p> <p>We applied a random intercept model to account for confounding variables and case-control paired design. A variable selection scheme was developed to determine the effective confounders in each gene. Meta-analysis methods were used to integrate information from five studies and post hoc analyses enhanced biological interpretations. Simulations and application results showed that the adjustment for confounding variables and meta-analysis improved detection of biomarkers and associated pathways.</p> <p>Conclusions</p> <p>The proposed framework simultaneously considers correction for confounding variables, selection of effective confounders, random effects from paired design and integration by meta-analysis. The approach improved disease-related biomarker and pathway detection, which greatly enhanced understanding of MDD neurobiology. The statistical framework can be applied to similar experimental design encountered in other complex and heterogeneous diseases.</p

    Robust Algorithms for Detecting Hidden Structure in Biological Data

    Get PDF
    Biological data, such as molecular abundance measurements and protein sequences, harbor complex hidden structure that reflects its underlying biological mechanisms. For example, high-throughput abundance measurements provide a snapshot the global state of a living cell, while homologous protein sequences encode the residue-level logic of the proteins\u27 function and provide a snapshot of the evolutionary trajectory of the protein family. In this work I describe algorithmic approaches and analysis software I developed for uncovering hidden structure in both kinds of data. Clustering is an unsurpervised machine learning technique commonly used to map the structure of data collected in high-throughput experiments, such as quantification of gene expression by DNA microarrays or short-read sequencing. Clustering algorithms always yield a partitioning of the data, but relying on a single partitioning solution can lead to spurious conclusions. In particular, noise in the data can cause objects to fall into the same cluster by chance rather than due to meaningful association. In the first part of this thesis I demonstrate approaches to clustering data robustly in the presence of noise and apply robust clustering to analyze the transcriptional response to injury in a neuron cell. In the second part of this thesis I describe identifying hidden specificity determining residues (SDPs) from alignments of protein sequences descended through gene duplication from a common ancestor (paralogs) and apply the approach to identify numerous putative SDPs in bacterial transcription factors in the LacI family. Finally, I describe and demonstrate a new algorithm for reconstructing the history of duplications by which paralogs descended from their common ancestor. This algorithm addresses the complexity of such reconstruction due to indeterminate or erroneous homology assignments made by sequence alignment algorithms and to the vast prevalence of divergence through speciation over divergence through gene duplication in protein evolution

    Complete gene expression profiling of Saccharopolyspora erythraea using GeneChip DNA microarrays

    Get PDF
    The Saccharopolyspora erythraea genome sequence, recently published, presents considerable divergence from those of streptomycetes in gene organization and function, confirming the remarkable potential of S. erythraea for producing many other secondary metabolites in addition to erythromycin. In order to investigate, at whole transcriptome level, how S. erythraea genes are modulated, a DNA microarray was specifically designed and constructed on the S. erythraea strain NRRL 2338 genome sequence, and the expression profiles of 6494 ORFs were monitored during growth in complex liquid medium

    Statistical methods for transcriptomics: From microarrays to RNA-seq

    Full text link
    La transcriptómica estudia el nivel de expresión de los genes en distintas condiciones experimentales para tratar de identificar los genes asociados a un fenotipo dado así como las relaciones de regulación entre distintos genes. Los datos ómicos se caracterizan por contener información de miles de variables en una muestra con pocas observaciones. Las tecnologías de alto rendimiento más comunes para medir el nivel de expresión de miles de genes simultáneamente son los microarrays y, más recientemente, la secuenciación de RNA (RNA-seq). Este trabajo de tesis versará sobre la evaluación, adaptación y desarrollo de modelos estadísticos para el análisis de datos de expresión génica, tanto si ha sido estimada mediante microarrays o bien con RNA-seq. El estudio se abordará con herramientas univariantes y multivariantes, así como con métodos tanto univariantes como multivariantes.Tarazona Campos, S. (2014). Statistical methods for transcriptomics: From microarrays to RNA-seq [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/48485TESISPremios Extraordinarios de tesis doctorale

    Distance-based methods for detecting associations in structured data with applications in bioinformatics

    Get PDF
    In bioinformatics applications samples of biological variables of interest can take a variety of structures. For instance, in this thesis we consider vector-valued observations of multiple gene expression and genetic markers, curve-valued gene expression time courses, and graph-valued functional connectivity networks within the brain. This thesis considers three problems routinely encountered when dealing with such variables: detecting differences between populations, detecting predictive relationships between variables, and detecting association between variables. Distance-based approaches to these problems are considered, offering great flexibility over alternative approaches, such as traditional multivariate approaches which may be inappropriate. The notion of distance has been widely adopted in recent years to quantify the dissimilarity between samples, and suitable distance measures can be applied depending on the nature of the data and on the specific objectives of the study. For instance, for gene expression time courses modeled as time-dependent curves, distance measures can be specified to capture biologically meaningful aspects of these curves which may differ. On obtaining a distance matrix containing all pairwise distances between the samples of a given variable, many distance-based testing procedures can then be applied. The main inhibitor of their effective use in bioinformatics is that p-values are typically estimated by using Monte Carlo permutations. Thousands or even millions of tests need to be performed simultaneously, and time/computational constraints lead to a low number of permutations being enumerated for each test. The contributions of this thesis include the proposal of two new distance-based statistics, the DBF statistic for the problem of detecting differences between populations, and the GRV coefficient for the problem of detecting association between variables. In each case approximate null distributions are derived, allowing estimation of p-values with reduced computational cost, and through simulation these are shown to work well for a range of distances and data types. The tests are also demonstrated to be competitive with existing approaches. For the problem of detecting predictive relationships between variables, the approximate null distribution is derived for the routinely used distance-based pseudo F test, and through simulation this is shown to work well for a range of distances and data types. All tests are applied to real datasets, including a longitudinal human immune cell M. tuberculosis dataset, an Alzheimer’s disease dataset, and an ovarian cancer dataset.Open Acces

    Kunnskapsbaserte metoder som håndterer komplekse avhengighetsstrukturer : anvendelser på genekspresjonsdata

    Get PDF
    Microarray gene expression data are usually associated with a large number of correlated variables measured on few samples. This type of data typically contain high levels of noise, and the biological signals may be difficult to extract. The classical approach for analysing gene expression data is to test individual genes for differential expression. This basically implies performing tests on possibly thousands of dependent variables while incorrectly assuming statistical independence. The probability of doing false positive discoveries is accordingly high, the results of the analysis may be difficult to reproduce, and the outcome may be a list of biologically unrelated genes that leaves very much to the imagination. An increasing number of publications have therefore started to focus on incorporating prior biological information about gene dependencies in the analysis of gene expression data. Vast amounts of knowledge about relationships between genes based on previous studies are available. The motivation behind analysing the data in light of this information, include increased sensitivity and robustness of the analysis, better reproducibility of the results and easier interpretation. The prior information can for example be groups of genes with a similar function, or gene networks that describe some relationship between genes. With this information in hand, the focus can be turned from identifying important individual genes, to identifying larger groups of important genes that are also related. The aim of this thesis has been to improve and adapt existing methods to accommodate gene expression data from various types of experimental designs, in addition to developing novel procedures that incorporate prior information. A central part of this work has been concerned with significance testing in data sets with few and dependent samples. Most existing methods in this field use permutation tests to assess significance when the distribution of the test statistics is unknown. This is however problematic in data sets with very small sample sizes and complex experimental designs. In paper I we adopt a popular method for analysing gene sets, and replace the permutation test with a rotation test to accommodate it to small sample sizes. Paper III and IV introduce improvements to the method in paper I by adapting it to data from complex experimental designs and time series data. In paper II we propose a novel method that uses gene networks to improve test statistics for individual genes.Genekspresjonsdata fra mikromatriser assosieres ofte med et stort antall korrelerte variabler målt på få observasjoner. Denne typen data inneholder vanligvis mye irrelevant variasjon, og de biologiske signalene kan være vanskelig å skille fra bakgrunnsstøyet. Den vanligste måten å analysere geneekspresjonsdata på, har vært å teste hvert enkelt gen for differensiell ekspresjon. Dette innebærer å utføre tester på potensielt tusenvis av avhengige variabler, samtidig som man antar statistisk uavhengighet. Sannsynligheten for å finne falske positive er tilsvarende høy, resultatene kan være vanskelig å reprodusere, og utfallet av analysen kan være en liste med gener uten biologisk relasjon som overlater veldig mye til fantasien. Et økende antall publikasjoner har derfor begynt å fokusere på inkludering av a priori informasjon om genavhengigheter i analyse av genekspresjonsdata. Fra tidligere studier finnes store mengder biologisk kunnskap om relasjoner mellom gener. Ved å analysere dataene i lys av denne informasjonen, ønsker man å oppnå en mer sensitiv og robust analyse med resultater som er enklere å reprodusere og tolke. Forhåndsinformasjonen kan for eksempel bestå av grupper av gener med lignende funksjon eller gennettverk som beskriver relasjoner mellom gener. Med denne informasjonen for hånden, kan fokuset flyttes fra viktige enkeltgener, til grupper av viktige gener som også har noe felles. Målet med denne avhandlingen har vært å forbedre og tilpasse eksisterende metoder til genekspresjonsdata med forkjellige typer forsøksdesign, samt utvikling av nye metoder som benytter seg av a priori informasjon. En sentral del av dette arbeidet har vært knyttet til testing av signifikans i datasett med få og avhengige observasjoner. De fleste eksisterende metoder innenfor dette feltet bruker permutasjonstester for å evaluere signifikans når testobservatoren har en ukjent fordeling. Dette er imidlertid problematisk for datasett med veldig få observasjoner som ikke kan antas uavhengige grunnet forsøksdesignet. I artikkel I tar vi for oss en populær metode for å analysere gengrupper og bytter ut permutasjonstesten med en rotasjonstest for å tilpasse metoden til små utvalgsstørrelser. I artikkel III og IV introduseres forbedringer av metoden i artikkel I ved å tilpasse den til data med komplekse forsøksdesign og tidsseriedata. I artikkel II foreslår vi en ny metode som bruker gennettverk til å forbedre testobservatoren til enkeltgener
    corecore