3,773 research outputs found
htsint: a Python library for sequencing pipelines that combines data through gene set generation
Background: Sequencing technologies provide a wealth of details in terms of genes, expression, splice variants, polymorphisms, and other features. A standard for sequencing analysis pipelines is to put genomic or transcriptomic features into a context of known functional information, but the relationships between ontology terms are often ignored. For RNA-Seq, considering genes and their genetic variants at the group level enables a convenient way to both integrate annotation data and detect small coordinated changes between experimental conditions, a known caveat of gene level analyses.
Results: We introduce the high throughput data integration tool, htsint, as an extension to the commonly used gene set enrichment frameworks. The central aim of htsint is to compile annotation information from one or more taxa in order to calculate functional distances among all genes in a specified gene space. Spectral clustering is then used to partition the genes, thereby generating functional modules. The gene space can range from a targeted list of genes, like a specific pathway, all the way to an ensemble of genomes. Given a collection of gene sets and a count matrix of transcriptomic features (e.g. expression, polymorphisms), the gene sets produced by htsint can be tested for 'enrichment' or conditional differences using one of a number of commonly available packages.
Conclusion: The database and bundled tools to generate functional modules were designed with sequencing pipelines in mind, but the toolkit nature of htsint allows it to also be used in other areas of genomics. The software is freely available as a Python library through GitHub at https://github.com/ajrichards/htsint
Probabilistic methods in the analysis of protein interaction networks
Imperial Users onl
The Functional Consequences of Variation in Transcription Factor Binding
One goal of human genetics is to understand how the information for precise
and dynamic gene expression programs is encoded in the genome. The interactions
of transcription factors (TFs) with DNA regulatory elements clearly play an
important role in determining gene expression outputs, yet the regulatory logic
underlying functional transcription factor binding is poorly understood. Many
studies have focused on characterizing the genomic locations of TF binding, yet
it is unclear to what extent TF binding at any specific locus has functional
consequences with respect to gene expression output. To evaluate the context of
functional TF binding we knocked down 59 TFs and chromatin modifiers in one
HapMap lymphoblastoid cell line. We then identified genes whose expression was
affected by the knockdowns. We intersected the gene expression data with
transcription factor binding data (based on ChIP-seq and DNase-seq) within 10
kb of the transcription start sites of expressed genes. This combination of
data allowed us to infer functional TF binding. On average, 14.7% of genes
bound by a factor were differentially expressed following the knockdown of that
factor, suggesting that most interactions between TF and chromatin do not
result in measurable changes in gene expression levels of putative target
genes. We found that functional TF binding is enriched in regulatory elements
that harbor a large number of TF binding sites, at sites with predicted higher
binding affinity, and at sites that are enriched in genomic regions annotated
as active enhancers.Comment: 30 pages, 6 figures (7 supplemental figures and 6 supplemental tables
available upon request to [email protected]). Submitted to PLoS
Genetic
Modelling the evolution of transcription factor binding preferences in complex eukaryotes
Transcription factors (TFs) exert their regulatory action by binding to DNA
with specific sequence preferences. However, different TFs can partially share
their binding sequences due to their common evolutionary origin. This
`redundancy' of binding defines a way of organizing TFs in `motif families' by
grouping TFs with similar binding preferences. Since these ultimately define
the TF target genes, the motif family organization entails information about
the structure of transcriptional regulation as it has been shaped by evolution.
Focusing on the human TF repertoire, we show that a one-parameter evolutionary
model of the Birth-Death-Innovation type can explain the TF empirical
ripartition in motif families, and allows to highlight the relevant
evolutionary forces at the origin of this organization. Moreover, the model
allows to pinpoint few deviations from the neutral scenario it assumes: three
over-expanded families (including HOX and FOX genes), a set of `singleton' TFs
for which duplication seems to be selected against, and a higher-than-average
rate of diversification of the binding preferences of TFs with a Zinc Finger
DNA binding domain. Finally, a comparison of the TF motif family organization
in different eukaryotic species suggests an increase of redundancy of binding
with organism complexity.Comment: 14 pages, 5 figures. Minor changes. Final version, accepted for
publicatio
Co-evolutionary networks of genes and cellular processes across fungal species
Two new measures of evolution are used to study co-evolutionary networks of fungal genes and cellular processes; links between co-evolution and co-functionality are revealed
A scale of functional divergence for yeast duplicated genes revealed from analysis of the protein-protein interaction network
BACKGROUND: Studying the evolution of the function of duplicated genes usually implies an estimation of the extent of functional conservation/divergence between duplicates from comparison of actual sequences. This only reveals the possible molecular function of genes without taking into account their cellular function(s). We took into consideration this latter dimension of gene function to approach the functional evolution of duplicated genes by analyzing the protein-protein interaction network in which their products are involved. For this, we derived a functional classification of the proteins using PRODISTIN, a bioinformatics method allowing comparison of protein function. Our work focused on the duplicated yeast genes, remnants of an ancient whole-genome duplication. RESULTS: Starting from 4,143 interactions, we analyzed 41 duplicated protein pairs with the PRODISTIN method. We showed that duplicated pairs behaved differently in the classification with respect to their interactors. The different observed behaviors allowed us to propose a functional scale of conservation/divergence for the duplicated genes, based on interaction data. By comparing our results to the functional information carried by GO annotations and sequence comparisons, we showed that the interaction network analysis reveals functional subtleties, which are not discernible by other means. Finally, we interpreted our results in terms of evolutionary scenarios. CONCLUSIONS: Our analysis might provide a new way to analyse the functional evolution of duplicated genes and constitutes the first attempt of protein function evolutionary comparisons based on protein-protein interactions
Emerging model spedies driven by transciptomics
This work is focused on 'emerging model species', i.e. question-driven model species which have sufficient molecular resources to investigate a specific phenomenon in molecular biology, developmental biology, molecular ecology and evolution or related molecular fields. This thesis shows how transcriptomic data can be generated, analyzed, and used to investigate such phenomena of interest even in species lacking a reference genome. The initial ButterflyBase resource has proven to be useful to researchers of species without a reference genome but is limited to the Lepidoptera and supports only the older Sanger sequencing technologies. Thanks to Next Generation Sequencing, transcriptome sequencing is more cost effective but the bottleneck of transcriptomic projects is now the bioinformatic analysis and data mining/dissemination. Therefore, this work continues with presenting novel and innovative approaches which effectively overcome this bottleneck. The est2assembly software produces deeply annotated reference transcriptomes stored in the Chado database. The Drupal Bioinformatic Server Framework and genes4all provide species-neutral and an innovative approach in building standardized online databases and associated web services. All public insect mRNA data were analyzed with est2assembly and genes4all to produce the InsectaCentral. With InsectaCentral, a powerful resource is now available to assist molecular biology in any question-driven model insect species. The software presented here was developed according to specifications of the General Model Organism Database (GMOD) community. All software specifications are species-neutral and can be seamlessly deployed to assist any research community. Further through a case studies chapter, it becomes apparent that the transcriptomic approach is more cost-effective than a genomic approach and therefore sequence-driven evolutionary biology will benefit faster with this field
Chapter Functional Annotation of Rare Genetic Variants
Genome-wide association studies have successfully identified a growing number of
common variants that robustly associate with a wide range of complex diseases and
phenotypes. In the majority of cases though, the variants are predicted to have small to
modest effect sizes, and, due to the technologies used, many of the signals discovered
so far may not be the causal loci. As rare variation studies begin to explore the lower
ranges of the allele frequency spectrum, using whole genome or whole exome
sequencing to capture a larger proportion of variants, we expect to find variants with a
more direct causal role in the phenotype(s) of interest. Interpreting possible functional
mechanisms linking variants with phenotypes will become increasingly important
Computational functional annotation of crop genomics using hierarchical orthologous groups
Improving agronomically important traits, such as yield, is important in order to meet the ever growing demands of increased crop production. Knowledge of the genes that have an effect on a given trait can be used to enhance genomic selection by prediction of biologically interesting loci. Candidate genes that are strongly linked to a desired trait can then be targeted by transformation or genome editing. This application of prioritisation of genetic material can accelerate crop improvement. However, the application of this is currently limited due to the lack of accurate annotations and methods to integrate experimental data with evolutionary relationships. Hierarchical orthologous groups (HOGs) provide nested groups of genes that enable the comparison of highly diverged and similar species in a consistent manner. Over 2,250 species are included in the OMA project, resulting in over 600,000 HOGs. This thesis provides the required methodology and a tool to exploit this rich source of information, in the HOGPROP algorithm. The potential of this is then demonstrated in mining crop genome data, from metabolic QTL studies and utilising Gene Ontology (GO) annotations as well as ChEBI terms (Chemical Entities of Biological Interest) in order to prioritise candidate causal genes. Gauging the performance of the tool is also important. When considering GO annotations, the CAFA series of community experiments has provided the most extensive benchmarking to-date. However, this has not fully taken into account the incomplete knowledge of protein function – the open world assumption (OWA). This will require extra negative annotations, for which one such source has been identified based on expertly curated gene phylogenies. These negative annotations are then utilised in the proposed, OWA-compliant, improved framework for benchmarking. The results show that current benchmarks tend to focus on the general terms, which means that conclusions are not merely uninformative, but misleading
- …