1,115 research outputs found
Stability and aggregation of ranked gene lists
Ranked gene lists are highly instable in the sense that similar measures of differential gene expression may yield very different rankings, and that a small change of the data set usually affects the obtained gene list considerably. Stability issues have long been under-considered in the literature, but they have grown to a hot topic in the last few years, perhaps as a consequence of the increasing skepticism on the reproducibility and clinical applicability of molecular research findings. In this article, we review existing approaches for the assessment of stability of ranked gene lists and the related problem of aggregation, give some practical recommendations, and warn against potential misuse of these methods. This overview is illustrated through an application to a recent leukemia data set using the freely available Bioconductor package GeneSelector
Bayesian meta-analysis for identifying periodically expressed genes in fission yeast cell cycle
The effort to identify genes with periodic expression during the cell cycle
from genome-wide microarray time series data has been ongoing for a decade.
However, the lack of rigorous modeling of periodic expression as well as the
lack of a comprehensive model for integrating information across genes and
experiments has impaired the effort for the accurate identification of
periodically expressed genes. To address the problem, we introduce a Bayesian
model to integrate multiple independent microarray data sets from three recent
genome-wide cell cycle studies on fission yeast. A hierarchical model was used
for data integration. In order to facilitate an efficient Monte Carlo sampling
from the joint posterior distribution, we develop a novel Metropolis--Hastings
group move. A surprising finding from our integrated analysis is that more than
40% of the genes in fission yeast are significantly periodically expressed,
greatly enhancing the reported 10--15% of the genes in the current literature.
It calls for a reconsideration of the periodically expressed gene detection
problem.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS300 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Detecting disease-associated genes with confounding variable adjustment and the impact on genomic meta-analysis: With application to major depressive disorder
<p>Abstract</p> <p>Background</p> <p>Detecting candidate markers in transcriptomic studies often encounters difficulties in complex diseases, particularly when overall signals are weak and sample size is small. Covariates including demographic, clinical and technical variables are often confounded with the underlying disease effects, which further hampers accurate biomarker detection. Our motivating example came from an analysis of five microarray studies in major depressive disorder (MDD), a heterogeneous psychiatric illness with mostly uncharacterized genetic mechanisms.</p> <p>Results</p> <p>We applied a random intercept model to account for confounding variables and case-control paired design. A variable selection scheme was developed to determine the effective confounders in each gene. Meta-analysis methods were used to integrate information from five studies and post hoc analyses enhanced biological interpretations. Simulations and application results showed that the adjustment for confounding variables and meta-analysis improved detection of biomarkers and associated pathways.</p> <p>Conclusions</p> <p>The proposed framework simultaneously considers correction for confounding variables, selection of effective confounders, random effects from paired design and integration by meta-analysis. The approach improved disease-related biomarker and pathway detection, which greatly enhanced understanding of MDD neurobiology. The statistical framework can be applied to similar experimental design encountered in other complex and heterogeneous diseases.</p
Robust Algorithms for Detecting Hidden Structure in Biological Data
Biological data, such as molecular abundance measurements and protein
sequences, harbor complex hidden structure that reflects its underlying
biological mechanisms. For example, high-throughput abundance measurements
provide a snapshot the global state of a living cell, while homologous
protein sequences encode the residue-level logic of the proteins\u27 function
and provide a snapshot of the evolutionary trajectory of the protein family.
In this work I describe algorithmic approaches and analysis software I
developed for uncovering hidden structure in both kinds of data.
Clustering is an unsurpervised machine learning technique commonly used
to map the structure of data collected in high-throughput experiments,
such as quantification of gene expression by DNA microarrays or
short-read sequencing. Clustering algorithms always yield a partitioning
of the data, but relying on a single partitioning solution can lead to
spurious conclusions. In particular, noise in the data can cause objects
to fall into the same cluster by chance rather than due to meaningful
association. In the first part of this thesis I demonstrate approaches to
clustering data robustly in the presence of noise and apply robust clustering
to analyze the transcriptional response to injury in a neuron cell.
In the second part of this thesis I describe identifying hidden specificity
determining residues (SDPs) from alignments of protein sequences descended
through gene duplication from a common ancestor (paralogs) and apply the
approach to identify numerous putative SDPs in bacterial transcription
factors in the LacI family. Finally, I describe and demonstrate a new
algorithm for reconstructing the history of duplications by which paralogs
descended from their common ancestor. This algorithm addresses the
complexity of such reconstruction due to indeterminate or erroneous
homology assignments made by sequence alignment algorithms and to the
vast prevalence of divergence through speciation over divergence through
gene duplication in protein evolution
Complete gene expression profiling of Saccharopolyspora erythraea using GeneChip DNA microarrays
The Saccharopolyspora erythraea genome sequence, recently published, presents considerable divergence from those of streptomycetes in gene organization and function, confirming the remarkable potential of S. erythraea for producing many other secondary metabolites in addition to erythromycin. In order to investigate, at whole transcriptome level, how S. erythraea genes are modulated, a DNA microarray was specifically designed and constructed on the S. erythraea strain NRRL 2338 genome sequence, and the expression profiles of 6494 ORFs were monitored during growth in complex liquid medium
Statistical methods for transcriptomics: From microarrays to RNA-seq
La transcriptómica estudia el nivel de expresión de los genes en distintas condiciones experimentales para tratar de identificar los genes asociados a un fenotipo dado así como las relaciones de regulación entre distintos genes. Los datos ómicos se caracterizan por contener información de miles de variables en una muestra con pocas observaciones. Las tecnologías de alto rendimiento más comunes para medir el nivel de expresión de miles de genes simultáneamente son los microarrays y, más recientemente, la secuenciación de RNA (RNA-seq).
Este trabajo de tesis versará sobre la evaluación, adaptación y desarrollo de modelos estadísticos para el análisis de datos de expresión génica, tanto si ha sido estimada mediante microarrays o bien con RNA-seq. El estudio se abordará con herramientas univariantes y multivariantes, así como con métodos tanto univariantes como multivariantes.Tarazona Campos, S. (2014). Statistical methods for transcriptomics: From microarrays to RNA-seq [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/48485TESISPremios Extraordinarios de tesis doctorale
Distance-based methods for detecting associations in structured data with applications in bioinformatics
In bioinformatics applications samples of biological variables of interest can take a variety
of structures. For instance, in this thesis we consider vector-valued observations
of multiple gene expression and genetic markers, curve-valued gene expression time
courses, and graph-valued functional connectivity networks within the brain. This
thesis considers three problems routinely encountered when dealing with such variables:
detecting differences between populations, detecting predictive relationships
between variables, and detecting association between variables.
Distance-based approaches to these problems are considered, offering great flexibility
over alternative approaches, such as traditional multivariate approaches which
may be inappropriate. The notion of distance has been widely adopted in recent years
to quantify the dissimilarity between samples, and suitable distance measures can be
applied depending on the nature of the data and on the specific objectives of the study.
For instance, for gene expression time courses modeled as time-dependent curves, distance
measures can be specified to capture biologically meaningful aspects of these
curves which may differ. On obtaining a distance matrix containing all pairwise distances
between the samples of a given variable, many distance-based testing procedures
can then be applied. The main inhibitor of their effective use in bioinformatics is that
p-values are typically estimated by using Monte Carlo permutations. Thousands or
even millions of tests need to be performed simultaneously, and time/computational
constraints lead to a low number of permutations being enumerated for each test.
The contributions of this thesis include the proposal of two new distance-based
statistics, the DBF statistic for the problem of detecting differences between populations,
and the GRV coefficient for the problem of detecting association between
variables. In each case approximate null distributions are derived, allowing estimation
of p-values with reduced computational cost, and through simulation these are shown to work well for a range of distances and data types. The tests are also demonstrated
to be competitive with existing approaches. For the problem of detecting predictive
relationships between variables, the approximate null distribution is derived for the
routinely used distance-based pseudo F test, and through simulation this is shown to
work well for a range of distances and data types. All tests are applied to real datasets,
including a longitudinal human immune cell M. tuberculosis dataset, an Alzheimer’s
disease dataset, and an ovarian cancer dataset.Open Acces
Kunnskapsbaserte metoder som håndterer komplekse avhengighetsstrukturer : anvendelser på genekspresjonsdata
Microarray gene expression data are usually associated with a large number of correlated variables measured on few samples. This type of data typically contain high levels of noise, and the biological signals may be difficult to extract. The classical approach for analysing gene expression data is to test individual genes for differential expression. This basically implies performing tests on possibly thousands of dependent variables while incorrectly assuming statistical independence. The probability of doing false positive discoveries is accordingly high, the results of the analysis may be difficult to reproduce, and the outcome may be a list of biologically unrelated genes that leaves very much to the imagination.
An increasing number of publications have therefore started to focus on incorporating prior biological information about gene dependencies in the analysis of gene expression data. Vast amounts of knowledge about relationships between genes based on previous studies are available. The motivation behind analysing the data in light of this information, include increased sensitivity and robustness of the analysis, better reproducibility of the results and easier interpretation. The prior information can for example be groups of genes with a similar function, or gene networks that describe some relationship between genes. With this information in hand, the focus can be turned from identifying important individual genes, to identifying larger groups of important genes that are also related.
The aim of this thesis has been to improve and adapt existing methods to accommodate gene expression data from various types of experimental designs, in addition to developing novel procedures that incorporate prior information. A central part of this work has been concerned with significance testing in data sets with few and dependent samples. Most existing methods in this field use permutation tests to assess significance when the distribution of the test statistics is unknown. This is however problematic in data sets with very small sample sizes and complex experimental designs. In paper I we adopt a popular method for analysing gene sets, and replace the permutation test with a rotation test to accommodate it to small sample sizes. Paper III and IV introduce improvements to the method in paper I by adapting it to data from complex experimental designs and time series data. In paper II we propose a novel method that uses gene networks to improve test statistics for individual genes.Genekspresjonsdata fra mikromatriser assosieres ofte med et stort antall korrelerte variabler målt på få observasjoner. Denne typen data inneholder vanligvis mye irrelevant variasjon, og de biologiske signalene kan være vanskelig å skille fra bakgrunnsstøyet. Den vanligste måten å analysere geneekspresjonsdata på, har vært å teste hvert enkelt gen for differensiell ekspresjon. Dette innebærer å utføre tester på potensielt tusenvis av avhengige variabler, samtidig som man antar statistisk uavhengighet. Sannsynligheten for å finne falske positive er tilsvarende høy, resultatene kan være vanskelig å reprodusere, og utfallet av analysen kan være en liste med gener uten biologisk relasjon som overlater veldig mye til fantasien.
Et økende antall publikasjoner har derfor begynt å fokusere på inkludering av a priori informasjon om genavhengigheter i analyse av genekspresjonsdata. Fra tidligere studier finnes store mengder biologisk kunnskap om relasjoner mellom gener. Ved å analysere dataene i lys av denne informasjonen, ønsker man å oppnå en mer sensitiv og robust analyse med resultater som er enklere å reprodusere og tolke. Forhåndsinformasjonen kan for eksempel bestå av grupper av gener med lignende funksjon eller gennettverk som beskriver relasjoner mellom gener. Med denne informasjonen for hånden, kan fokuset flyttes fra viktige enkeltgener, til grupper av viktige gener som også har noe felles.
Målet med denne avhandlingen har vært å forbedre og tilpasse eksisterende metoder til genekspresjonsdata med forkjellige typer forsøksdesign, samt utvikling av nye metoder som benytter seg av a priori informasjon. En sentral del av dette arbeidet har vært knyttet til testing av signifikans i datasett med få og avhengige observasjoner. De fleste eksisterende metoder innenfor dette feltet bruker permutasjonstester for å evaluere signifikans når testobservatoren har en ukjent fordeling. Dette er imidlertid problematisk for datasett med veldig få observasjoner som ikke kan antas uavhengige grunnet forsøksdesignet. I artikkel I tar vi for oss en populær metode for å analysere gengrupper og bytter ut permutasjonstesten med en rotasjonstest for å tilpasse metoden til små utvalgsstørrelser. I artikkel III og IV introduseres forbedringer av metoden i artikkel I ved å tilpasse den til data med komplekse forsøksdesign og tidsseriedata. I artikkel II foreslår vi en ny metode som bruker gennettverk til å forbedre testobservatoren til enkeltgener
- …