11,018 research outputs found
Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis
A prespecified set of genes may be enriched, to varying degrees, for genes
that have altered expression levels relative to two or more states of a cell.
Knowing the enrichment of gene sets defined by functional categories, such as
gene ontology (GO) annotations, is valuable for analyzing the biological
signals in microarray expression data. A common approach to measuring
enrichment is by cross-classifying genes according to membership in a
functional category and membership on a selected list of significantly altered
genes. A small Fisher's exact test -value, for example, in this
table is indicative of enrichment. Other category analysis methods retain the
quantitative gene-level scores and measure significance by referring a
category-level statistic to a permutation distribution associated with the
original differential expression problem. We describe a class of random-set
scoring methods that measure distinct components of the enrichment signal. The
class includes Fisher's test based on selected genes and also tests that
average gene-level evidence across the category. Averaging and selection
methods are compared empirically using Affymetrix data on expression in
nasopharyngeal cancer tissue, and theoretically using a location model of
differential expression. We find that each method has a domain of superiority
in the state space of enrichment problems, and that both methods have benefits
in practice. Our analysis also addresses two problems related to
multiple-category inference, namely, that equally enriched categories are not
detected with equal probability if they are of different sizes, and also that
there is dependence among category statistics owing to shared genes. Random-set
enrichment calculations do not require Monte Carlo for implementation. They are
made available in the R package allez.Comment: Published at http://dx.doi.org/10.1214/07-AOAS104 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
htsint: a Python library for sequencing pipelines that combines data through gene set generation
Background: Sequencing technologies provide a wealth of details in terms of genes, expression, splice variants, polymorphisms, and other features. A standard for sequencing analysis pipelines is to put genomic or transcriptomic features into a context of known functional information, but the relationships between ontology terms are often ignored. For RNA-Seq, considering genes and their genetic variants at the group level enables a convenient way to both integrate annotation data and detect small coordinated changes between experimental conditions, a known caveat of gene level analyses.
Results: We introduce the high throughput data integration tool, htsint, as an extension to the commonly used gene set enrichment frameworks. The central aim of htsint is to compile annotation information from one or more taxa in order to calculate functional distances among all genes in a specified gene space. Spectral clustering is then used to partition the genes, thereby generating functional modules. The gene space can range from a targeted list of genes, like a specific pathway, all the way to an ensemble of genomes. Given a collection of gene sets and a count matrix of transcriptomic features (e.g. expression, polymorphisms), the gene sets produced by htsint can be tested for 'enrichment' or conditional differences using one of a number of commonly available packages.
Conclusion: The database and bundled tools to generate functional modules were designed with sequencing pipelines in mind, but the toolkit nature of htsint allows it to also be used in other areas of genomics. The software is freely available as a Python library through GitHub at https://github.com/ajrichards/htsint
MorphDB : prioritizing genes for specialized metabolism pathways and gene ontology categories in plants
Recent times have seen an enormous growth of "omics" data, of which high-throughput gene expression data are arguably the most important from a functional perspective. Despite huge improvements in computational techniques for the functional classification of gene sequences, common similarity-based methods often fall short of providing full and reliable functional information. Recently, the combination of comparative genomics with approaches in functional genomics has received considerable interest for gene function analysis, leveraging both gene expression based guilt-by-association methods and annotation efforts in closely related model organisms. Besides the identification of missing genes in pathways, these methods also typically enable the discovery of biological regulators (i.e., transcription factors or signaling genes). A previously built guilt-by-association method is MORPH, which was proven to be an efficient algorithm that performs particularly well in identifying and prioritizing missing genes in plant metabolic pathways. Here, we present MorphDB, a resource where MORPH-based candidate genes for large-scale functional annotations (Gene Ontology, MapMan bins) are integrated across multiple plant species. Besides a gene centric query utility, we present a comparative network approach that enables researchers to efficiently browse MORPH predictions across functional gene sets and species, facilitating efficient gene discovery and candidate gene prioritization. MorphDB is available at http://bioinformatics.psb.ugent.be/webtools/morphdb/morphDB/index/. We also provide a toolkit, named "MORPH bulk" (https://github.com/arzwa/morph-bulk), for running MORPH in bulk mode on novel data sets, enabling researchers to apply MORPH to their own species of interest
1st INCF Workshop on Genetic Animal Models for Brain Diseases
The INCF Secretariat organized a workshop to focus on the “role of neuroinformatics in the processes of building, evaluating, and using genetic animal models for brain diseases” in Stockholm, December 13–14, 2009. Eight scientists specialized in the fields of neuroinformatics, database, ontologies, and brain disease participated together with two representatives of the National Institutes of Health and the European Union, as well as three observers of the national INCF nodes of Norway, Poland, and the United Kingdom
A statistical framework for testing functional categories in microarray data
Ready access to emerging databases of gene annotation and functional pathways
has shifted assessments of differential expression in DNA microarray studies
from single genes to groups of genes with shared biological function. This
paper takes a critical look at existing methods for assessing the differential
expression of a group of genes (functional category), and provides some
suggestions for improved performance. We begin by presenting a general
framework, in which the set of genes in a functional category is compared to
the complementary set of genes on the array. The framework includes tests for
overrepresentation of a category within a list of significant genes, and
methods that consider continuous measures of differential expression. Existing
tests are divided into two classes. Class 1 tests assume gene-specific measures
of differential expression are independent, despite overwhelming evidence of
positive correlation. Analytic and simulated results are presented that
demonstrate Class 1 tests are strongly anti-conservative in practice. Class 2
tests account for gene correlation, typically through array permutation that by
construction has proper Type I error control for the induced null. However,
both Class 1 and Class 2 tests use a null hypothesis that all genes have the
same degree of differential expression. We introduce a more sensible and
general (Class 3) null under which the profile of differential expression is
the same within the category and complement. Under this broader null, Class 2
tests are shown to be conservative. We propose standard bootstrap methods for
testing against the Class 3 null and demonstrate they provide valid Type I
error control and more power than array permutation in simulated datasets and
real microarray experiments.Comment: Published in at http://dx.doi.org/10.1214/07-AOAS146 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Correlated fragile site expression allows the identification of candidate fragile genes involved in immunity and associated with carcinogenesis
Common fragile sites (cfs) are specific regions in the human genome that are
particularly prone to genomic instability under conditions of replicative
stress. Several investigations support the view that common fragile sites play
a role in carcinogenesis. We discuss a genome-wide approach based on graph
theory and Gene Ontology vocabulary for the functional characterization of
common fragile sites and for the identification of genes that contribute to
tumour cell biology. CFS were assembled in a network based on a simple measure
of correlation among common fragile site patterns of expression. By applying
robust measurements to capture in quantitative terms the non triviality of the
network, we identified several topological features clearly indicating
departure from the Erdos-Renyi random graph model. The most important outcome
was the presence of an unexpected large connected component far below the
percolation threshold. Most of the best characterized common fragile sites
belonged to this connected component. By filtering this connected component
with Gene Ontology, statistically significant shared functional features were
detected. Common fragile sites were found to be enriched for genes associated
to the immune response and to mechanisms involved in tumour progression such as
extracellular space remodeling and angiogenesis. Our results support the
hypothesis that fragile sites serve a function; we propose that fragility is
linked to a coordinated regulation of fragile genes expression.Comment: 18 pages, accepted for publication in BMC Bioinformatic
Conserved noncoding sequences highlight shared components of regulatory networks in dicotyledonous plants
Conserved noncoding sequences (CNSs) in DNA are reliable pointers to regulatory elements controlling gene expression. Using a comparative genomics approach with four dicotyledonous plant species (Arabidopsis thaliana, papaya [Carica papaya], poplar [Populus trichocarpa], and grape [Vitis vinifera]), we detected hundreds of CNSs upstream of Arabidopsis genes. Distinct positioning, length, and enrichment for transcription factor binding sites suggest these CNSs play a functional role in transcriptional regulation. The enrichment of transcription factors within the set of genes associated with CNS is consistent with the hypothesis that together they form part of a conserved transcriptional network whose function is to regulate other transcription factors and control development. We identified a set of promoters where regulatory mechanisms are likely to be shared between the model organism Arabidopsis and other dicots, providing areas of focus for further research
Statistical approaches of gene set analysis with quantitative trait loci for high-throughput genomic studies.
Recently, gene set analysis has become the first choice for gaining insights into the underlying complex biology of diseases through high-throughput genomic studies, such as Microarrays, bulk RNA-Sequencing, single cell RNA-Sequencing, etc. It also reduces the complexity of statistical analysis and enhances the explanatory power of the obtained results. Further, the statistical structure and steps common to these approaches have not yet been comprehensively discussed, which limits their utility. Hence, a comprehensive overview of the available gene set analysis approaches used for different high-throughput genomic studies is provided. The analysis of gene sets is usually carried out based on gene ontology terms, known biological pathways, etc., which may not establish any formal relation between genotype and trait specific phenotype. Further, in plant biology and breeding, gene set analysis with trait specific Quantitative Trait Loci data are considered to be a great source for biological knowledge discovery. Therefore, innovative statistical approaches are developed for analyzing, and interpreting gene expression data from Microarrays, RNA-sequencing studies in the context of gene sets with trait specific Quantitative Trait Loci. The utility of the developed approaches is studied on multiple real gene expression datasets obtained from various Microarrays and RNA-sequencing studies. The selection of gene sets through differential expression analysis is the primary step of gene set analysis, and which can be achieved through using gene selection methods. The existing methods for such analysis in high-throughput studies, such as Microarrays, RNA-sequencing studies, suffer from serious limitations. For instance, in Microarrays, most of the available methods are either based on relevancy or redundancy measures. Through these methods, the ranking of genes is done on single Microarray expression data, which leads to the selection of spuriously associated, and redundant gene sets. Therefore, newer, and innovative differential expression analytical methods have been developed for Microarrays, and single-cell RNA-sequencing studies for identification of gene sets to successfully carry out the gene set and other downstream analyses. Furthermore, several methods specifically designed for single-cell data have been developed in the literature for the differential expression analysis. To provide guidance on choosing an appropriate tool or developing a new one, it is necessary to review the performance of the existing methods. Hence, a comprehensive overview, classification, and comparative study of the available single-cell methods is hereby undertaken to study their unique features, underlying statistical models and their shortcomings on real applications. Moreover, to address one of the shortcomings (i.e., higher dropout events due to lower cell capture rates), an improved statistical method for downstream analysis of single-cell data has been developed. From the users’ point of view, the different developed statistical methods are implemented in various software tools and made publicly available. These methods and tools will help the experimental biologists and genome researchers to analyze their experimental data more objectively and efficiently. Moreover, the limitations and shortcomings of the available methods are reported in this study, and these need to be addressed by statisticians and biologists collectively to develop efficient approaches. These new approaches will be able to analyze high-throughput genomic data more efficiently to better understand the biological systems and increase the specificity, sensitivity, utility, and relevance of high-throughput genomic studies
- …