23,267 research outputs found
Genome comparison using Gene Ontology (GO) with statistical testing
BACKGROUND: Automated comparison of complete sets of genes encoded in two genomes can provide insight on the genetic basis of differences in biological traits between species. Gene ontology (GO) is used as a common vocabulary to annotate genes for comparison. Current approaches calculate the fold of unweighted or weighted differences between two species at the high-level GO functional categories. However, to ensure the reliability of the differences detected, it is important to evaluate their statistical significance. It is also useful to search for differences at all levels of GO. RESULTS: We propose a statistical approach to find reliable differences between the complete sets of genes encoded in two genomes at all levels of GO. The genes are first assigned GO terms from BLAST searches against genes with known GO assignments, and for each GO term the abundance of genes in the two genomes is compared using a chi-squared test followed by false discovery rate (FDR) correction. We applied this method to find statistically significant differences between two cyanobacteria, Synechocystis sp. PCC6803 and Anabaena sp. PCC7120. We then studied how the set of identified differences vary when different BLAST cutoffs are used. We also studied how the results vary when only subsets of the genes were used in the comparison of human vs. mouse and that of Saccharomyces cerevisiae vs. Schizosaccharomyces pombe. CONCLUSION: There is a surprising lack of statistical approaches for comparing complete genomes at all levels of GO. With the rapid increase of the number of sequenced genomes, we hope that the approach we proposed and tested can make valuable contribution to comparative genomics
Multiple tests of association with biological annotation metadata
We propose a general and formal statistical framework for multiple tests of
association between known fixed features of a genome and unknown parameters of
the distribution of variable features of this genome in a population of
interest. The known gene-annotation profiles, corresponding to the fixed
features of the genome, may concern Gene Ontology (GO) annotation, pathway
membership, regulation by particular transcription factors, nucleotide
sequences, or protein sequences. The unknown gene-parameter profiles,
corresponding to the variable features of the genome, may be, for example,
regression coefficients relating possibly censored biological and clinical
outcomes to genome-wide transcript levels, DNA copy numbers, and other
covariates. A generic question of great interest in current genomic research
regards the detection of associations between biological annotation metadata
and genome-wide expression measures. This biological question may be translated
as the test of multiple hypotheses concerning association measures between
gene-annotation profiles and gene-parameter profiles. A general and rigorous
formulation of the statistical inference question allows us to apply the
multiple hypothesis testing methodology developed in [Multiple Testing
Procedures with Applications to Genomics (2008) Springer, New York] and related
articles, to control a broad class of Type I error rates, defined as
generalized tail probabilities and expected values for arbitrary functions of
the numbers of Type I errors and rejected hypotheses. The resampling-based
single-step and stepwise multiple testing procedures of [Multiple Testing
Procedures with Applications to Genomics (2008) Springer, New York] take into
account the joint distribution of the test statistics and provide Type I error
control in testing problems involving general data generating distributions
(with arbitrary dependence structures among variables), null hypotheses, and
test statistics.Comment: Published in at http://dx.doi.org/10.1214/193940307000000446 the IMS
Collections (http://www.imstat.org/publications/imscollections.htm) by the
Institute of Mathematical Statistics (http://www.imstat.org
Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis
A prespecified set of genes may be enriched, to varying degrees, for genes
that have altered expression levels relative to two or more states of a cell.
Knowing the enrichment of gene sets defined by functional categories, such as
gene ontology (GO) annotations, is valuable for analyzing the biological
signals in microarray expression data. A common approach to measuring
enrichment is by cross-classifying genes according to membership in a
functional category and membership on a selected list of significantly altered
genes. A small Fisher's exact test -value, for example, in this
table is indicative of enrichment. Other category analysis methods retain the
quantitative gene-level scores and measure significance by referring a
category-level statistic to a permutation distribution associated with the
original differential expression problem. We describe a class of random-set
scoring methods that measure distinct components of the enrichment signal. The
class includes Fisher's test based on selected genes and also tests that
average gene-level evidence across the category. Averaging and selection
methods are compared empirically using Affymetrix data on expression in
nasopharyngeal cancer tissue, and theoretically using a location model of
differential expression. We find that each method has a domain of superiority
in the state space of enrichment problems, and that both methods have benefits
in practice. Our analysis also addresses two problems related to
multiple-category inference, namely, that equally enriched categories are not
detected with equal probability if they are of different sizes, and also that
there is dependence among category statistics owing to shared genes. Random-set
enrichment calculations do not require Monte Carlo for implementation. They are
made available in the R package allez.Comment: Published at http://dx.doi.org/10.1214/07-AOAS104 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Recommended from our members
Common CHD8 Genomic Targets Contrast With Model-Specific Transcriptional Impacts of CHD8 Haploinsufficiency.
The packaging of DNA into chromatin determines the transcriptional potential of cells and is central to eukaryotic gene regulation. Case sequencing studies have revealed mutations to proteins that regulate chromatin state, known as chromatin remodeling factors, with causal roles in neurodevelopmental disorders. Chromodomain helicase DNA binding protein 8 (CHD8) encodes a chromatin remodeling factor with among the highest de novo loss-of-function mutation rates in patients with autism spectrum disorder (ASD). However, mechanisms associated with CHD8 pathology have yet to be elucidated. We analyzed published transcriptomic data across CHD8 in vitro and in vivo knockdown and knockout models and CHD8 binding across published ChIP-seq datasets to identify convergent mechanisms of gene regulation by CHD8. Differentially expressed genes (DEGs) across models varied, but overlap was observed between downregulated genes involved in neuronal development and function, cell cycle, chromatin dynamics, and RNA processing, and between upregulated genes involved in metabolism and immune response. Considering the variability in transcriptional changes and the cells and tissues represented across ChIP-seq analysis, we found a surprisingly consistent set of high-affinity CHD8 genomic interactions. CHD8 was enriched near promoters of genes involved in basic cell functions and gene regulation. Overlap between high-affinity CHD8 targets and DEGs shows that reduced dosage of CHD8 directly relates to decreased expression of cell cycle, chromatin organization, and RNA processing genes, but only in a subset of studies. This meta-analysis verifies CHD8 as a master regulator of gene expression and reveals a consistent set of high-affinity CHD8 targets across human, mouse, and rat in vivo and in vitro studies. These conserved regulatory targets include many genes that are also implicated in ASD. Our findings suggest a model where perturbation to dosage-sensitive CHD8 genomic interactions with a highly-conserved set of regulatory targets leads to model-specific downstream transcriptional impacts
Sex differences in DNA methylation assessed by 450 K BeadChip in newborns.
BackgroundDNA methylation is an important epigenetic mark that can potentially link early life exposures to adverse health outcomes later in life. Host factors like sex and age strongly influence biological variation of DNA methylation, but characterization of these relationships is still limited, particularly in young children.MethodsIn a sample of 111 Mexican-American subjects (58 girls , 53 boys), we interrogated DNA methylation differences by sex at birth using the 450 K BeadChip in umbilical cord blood specimens, adjusting for cell composition.ResultsWe observed that ~3% of CpG sites were differentially methylated between girls and boys at birth (FDR Pâ<â0.05). Of those CpGs, 3031 were located on autosomes, and 82.8% of those were hypermethylated in girls compared to boys. Beyond individual CpGs, we found 3604 sex-associated differentially methylated regions (DMRs) where the majority (75.8%) had higher methylation in girls. Using pathway analysis, we found that sex-associated autosomal CpGs were significantly enriched for gene ontology terms related to nervous system development and behavior. Among hits in our study, 35.9% had been previously reported as sex-associated CpG sites in other published human studies. Further, for replicated hits, the direction of the association with methylation was highly concordant (98.5-100%) with previous studies.ConclusionsTo our knowledge, this is the first reported epigenome-wide analysis by sex at birth that examined DMRs and adjusted for confounding by cell composition. We confirmed previously reported trends that methylation profiles are sex-specific even in autosomal genes, and also identified novel sex-associated CpGs in our methylome-wide analysis immediately after birth, a critical yet relatively unstudied developmental window
Yeast Features: Identifying Significant Features Shared Among Yeast Proteins for Functional Genomics
Background
High throughput yeast functional genomics experiments are revealing associations among tens to hundreds of genes using numerous experimental conditions. To fully understand how the identified genes might be involved in the observed system, it is essential to consider the widest range of biological annotation possible. Biologists often start their search by collating the annotation provided for each protein within databases such as the Saccharomyces Genome Database, manually comparing them for similar features, and empirically assessing their significance. Such tasks can be automated, and more precise calculations of the significance can be determined using established probability measures. 
Results
We developed Yeast Features, an intuitive online tool to help establish the significance of finding a diverse set of shared features among a collection of yeast proteins. A total of 18,786 features from the Saccharomyces Genome Database are considered, including annotation based on the Gene Ontology’s molecular function, biological process and cellular compartment, as well as conserved domains, protein-protein and genetic interactions, complexes, metabolic pathways, phenotypes and publications. The significance of shared features is estimated using a hypergeometric probability, but novel options exist to improve the significance by adding background knowledge of the experimental system. For instance, increased statistical significance is achieved in gene deletion experiments because interactions with essential genes will never be observed. We further demonstrate the utility by suggesting the functional roles of the indirect targets of an aminoglycoside with a known mechanism of action, and also the targets of an herbal extract with a previously unknown mode of action. The identification of shared functional features may also be used to propose novel roles for proteins of unknown function, including a role in protein synthesis for YKL075C.
Conclusions
Yeast Features (YF) is an easy to use web-based application (http://software.dumontierlab.com/yeastfeatures/) which can identify and prioritize features that are shared among a set of yeast proteins. This approach is shown to be valuable in the analysis of complex data sets, in which the extracted associations revealed significant functional relationships among the gene products.

GOexpress: an R/Bioconductor package for the identification and visualisation of robust gene ontology signatures through supervised learning of gene expression data
Background: Identification of gene expression profiles that differentiate experimental groups is critical for discovery and analysis of key molecular pathways and also for selection of robust diagnostic or prognostic biomarkers. While integration of differential expression statistics has been used to refine gene set enrichment analyses, such approaches are typically limited to single gene lists resulting from simple two-group comparisons or time-series analyses. In contrast, functional class scoring and machine learning approaches provide powerful alternative methods to leverage molecular measurements for pathway analyses, and to compare continuous and multi-level categorical factors. Results: We introduce GOexpress, a software package for scoring and summarising the capacity of gene ontology features to simultaneously classify samples from multiple experimental groups. GOexpress integrates normalised gene expression data (e.g., from microarray and RNA-seq experiments) and phenotypic information of individual samples with gene ontology annotations to derive a ranking of genes and gene ontology terms using a supervised learning approach. The default random forest algorithm allows interactions between all experimental factors, and competitive scoring of expressed genes to evaluate their relative importance in classifying predefined groups of samples. Conclusions: GOexpress enables rapid identification and visualisation of ontology-related gene panels that robustly classify groups of samples and supports both categorical (e.g., infection status, treatment) and continuous (e.g., time-series, drug concentrations) experimental factors. The use of standard Bioconductor extension packages and publicly available gene ontology annotations facilitates straightforward integration of GOexpress within existing computational biology pipelines.Department of Agriculture, Food and the MarineEuropean Commission - Seventh Framework Programme (FP7)Science Foundation IrelandUniversity College Dubli
Adaptations in energy metabolism and gene family expansions revealed by comparative transcriptomics of three Chagas disease triatomine vectors
Background: Chagas disease is a parasitic infection caused by Trypanosoma cruzi. It is an important public health problem affecting around seven to eight million people in the Americas. A large number of hematophagous triatomine insect species, occupying diverse natural and human-modified ecological niches transmit this disease. Triatomines are long-living hemipterans that have evolved to explode different habitats to associate with their vertebrate hosts. Understanding the molecular basis of the extreme physiological conditions including starvation tolerance and longevity could provide insights for developing novel control strategies. We describe the normalized cDNA, full body transcriptome analysis of three main vectors in North, Central and South America, Triatoma pallidipennis, T. dimidiata and T. infestans. Results: Two-thirds of the de novo assembled transcriptomes map to the Rhodnius prolixus genome and proteome. A Triatoma expansion of the calycin family and two types of protease inhibitors, pacifastins and cystatins were identified. A high number of transcriptionally active class I transposable elements was documented in T. infestans, compared with T. dimidiata and T. pallidipennis. Sequence identity in Triatoma-R. prolixus 1:1 orthologs revealed high sequence divergence in four enzymes participating in gluconeogenesis, glycogen synthesis and the pentose phosphate pathway, indicating high evolutionary rates of these genes. Also, molecular evidence suggesting positive selection was found for several genes of the oxidative phosphorylation I, III and V complexes. Conclusions: Protease inhibitors and calycin-coding gene expansions provide insights into rapidly evolving processes of protease regulation and haematophagy. Higher evolutionary rates in enzymes that exert metabolic flux control towards anabolism and evidence for positive selection in oxidative phosphorylation complexes might represent genetic adaptations, possibly related to prolonged starvation, oxidative stress tolerance, longevity, and hematophagy and flight reduction. Overall, this work generated novel hypothesis related to biological adaptations to extreme physiological conditions and diverse ecological niches that sustain Chagas disease transmission.Fil: MartĂnez Barnetche, JesĂșs. Instituto Nacional de Salud PĂșblica; MĂ©xicoFil: Lavore, Andres Esteban. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Centro de Investigaciones y Transferencia del Noroeste de la Provincia de Buenos Aires. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Centro de Investigaciones y Transferencia del Noroeste de la Provincia de Buenos Aires; Argentina. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Centro de Bioinvestigaciones (Sede Pergamino); ArgentinaFil: Beliera, Melina Daniela. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Centro de Investigaciones y Transferencia del Noroeste de la Provincia de Buenos Aires. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Centro de Investigaciones y Transferencia del Noroeste de la Provincia de Buenos Aires; Argentina. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Centro de Bioinvestigaciones (Sede Pergamino); ArgentinaFil: TĂ©llez Sosa, Juan. Instituto Nacional de Salud PĂșblica; MĂ©xicoFil: Zumaya Estrada, Federico A.. Instituto Nacional de Salud PĂșblica; MĂ©xicoFil: Palacio, Victorio Gabriel. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Centro de Investigaciones y Transferencia del Noroeste de la Provincia de Buenos Aires. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Centro de Investigaciones y Transferencia del Noroeste de la Provincia de Buenos Aires; Argentina. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Centro de Bioinvestigaciones (Sede Pergamino); ArgentinaFil: Godoy Lozano, Ernestina. Instituto Nacional de Salud PĂșblica; MĂ©xicoFil: Rivera Pomar, Rolando. Consejo Nacional de Investigaciones CientĂficas y TĂ©cnicas. Centro de Investigaciones y Transferencia del Noroeste de la Provincia de Buenos Aires. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Centro de Investigaciones y Transferencia del Noroeste de la Provincia de Buenos Aires; Argentina. Universidad Nacional del Noroeste de la Provincia de Buenos Aires. Centro de Bioinvestigaciones (Sede Pergamino); ArgentinaFil: RodrĂguez, Mario Henry. Instituto Nacional de Salud PĂșblica; MĂ©xic
- âŠ