99 research outputs found
Domestic chickens activate a piRNA defense against avian leukosis virus
PIWI-interacting RNAs (piRNAs) protect the germ line by targeting transposable elements (TEs) through the base-pair complementarity. We do not know how piRNAs co-evolve with TEs in chickens. Here we reported that all active TEs in the chicken germ line are targeted by piRNAs, and as TEs lose their activity, the corresponding piRNAs erode away. We observed de novo piRNA birth as host responds to a recent retroviral invasion. Avian leukosis virus (ALV) has endogenized prior to chicken domestication, remains infectious, and threatens poultry industry. Domestic fowl produce piRNAs targeting ALV from one ALV provirus that was known to render its host ALV resistant. This proviral locus does not produce piRNAs in undomesticated wild chickens. Our findings uncover rapid piRNA evolution reflecting contemporary TE activity, identify a new piRNA acquisition modality by activating a pre-existing genomic locus, and extend piRNA defense roles to include the period when endogenous retroviruses are still infectious. DOI: http://dx.doi.org/10.7554/eLife.24695.00
Computational regulatory genomics : motifs, networks, and dynamics
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Cataloged from student-submitted PDF version of thesis.Includes bibliographical references (p. 147-169).Gene regulation, the process responsible for taking a static genome and producing the diversity and complexity of life, is largely mediated through the sequence specific binding of regulators. The short, degenerate nature of the recognized elements and the unknown rules through which they interact makes deciphering gene regulation a significant challenge. In this thesis, we utilize comparative genomics and other approaches to exploit large-scale experimental datasets and better understand the sequence elements and regulators responsible for regulatory programs. In particular, we develop new computational approaches to (1) predict the binding sites of regulators using the genomes of many, closely related species; (2) understand the sequence motifs associated with transcription factors; (3) discover and characterize microRNAs, an important class of regulators; (4) use static predictions for binding sites in conjunction with chromatin modifications to better understand the dynamics of regulation; and (5) systematically validate the predicted motif instances using a massively parallel reporter assay. We find that the predictions made by our algorithms are of high quality and are comparable to those made by leading experimental approaches. Moreover, we find that experimental and computational approaches are often complementary. Regions experimentally identified to be bound by a factor can be species and cell line specific, but they lack the resolution and unbiased nature of our predictions. Experimentally identified miRNAs have unmistakable signs of being processed, but cannot provide the same insights our machine learning framework does. Further emphasizing the importance of integration, combining chromatin mark annotations and gene expression from multiple cell types with our static motif instances allows for increasing our power and making additional biologically relevant insights. We successfully apply the algorithms in this thesis to 29 mammals and 12 flies and expect them to be applicable to other clades of eukaryotic species. Moreover, we find that our performance has not yet plateaued and believe these methods will continue to be relevant as sequencing becomes increasingly commonplace and thousands of genomes become available.by Pouya Kheradpour.Ph.D
Exploiting gene expression and protein data for predicting remote homology and tissue specificity
In this thesis I describe my investigations of applying machine learning methods to high throughput experimental and predicted biological data. The importance of such analysis as a means of making inferences about biological functions is widely acknowledged in the bioinformatics community. Specifically, this work makes three novel contributions based on the systematic analysis of publicly archived data of protein sequences, three dimensional structures, gene expression and functional annotations: (a) remote homology detection based on amino acid sequences and secondary structures; (b) the analysis of tissue-specific gene expression for predictive signals in the sequence and secondary structure of the resulting protein product; and (c) a study of ageing in the fruit fly, a commonly used model organism, in which tissue specific and whole-organism gene expression changes are contrasted. In the problem of remote homology detection, a kernel-based method that combines pairwise alignment scores of amino acid sequences and secondary structures is shown to improve the prediction accuracies in a benchmark task defined using the Structural Classification of Proteins (SCOP) database. While the task of predicting SCOP superfamilies should be regarded as an easy one, with not much room for performance improvement, it is still widely accepted as the gold standard due to careful manual annotation by experts in the subject of protein evolution.A similar method is introduced to investigate whether tissue specificity of gene expression is correlated with the sequence and secondary structure of the resulting protein product. An information theoretic approach is adopted for sorting fruit fly and mouse genes according to their tissue specificity based on gene expression data. A classifier is then trained to predict the degree of specificity for these genes. The study concludes that the tissue specificity of gene expression is correlated with the sequence, and to a certain extent, with the secondary structure of the gene’s protein product.The sorted list of genes introduced in the previous chapter is used to investigate the tissue specificity of transcript profiles obtained from a study of ageing in the fruit fly. The same list is utilised to investigate how filtering tissue-restricted genes affects gene set enrichment analysis in the ageing study, and to examine the specificity of age-associated genes identified in the literature. The conclusion drawn in this chapter is that categorisation of genes according to their tissue specificity using Shannon’s information theory is useful for the interpretation of whole-fly gene expression data
Putting the Pieces Together: Exons and piRNAs: A Dissertation
Analysis of gene expression has undergone a technological revolution. What was impossible 6 years ago is now routine. High-throughput DNA sequencing machines capable of generating hundreds of millions of reads allow, indeed force, a major revision toward the study of the genome’s functional output—the transcriptome. This thesis examines the history of DNA sequencing, measurement of gene expression by sequencing, isoform complexity driven by alternative splicing and mammalian piRNA precursor biogenesis. Examination of these topics is framed around development of a novel RNA-templated DNA-DNA ligation assay (SeqZip) that allows for efficient analysis of abundant, complex, and functional long RNAs. The discussion focuses on the future of transcriptome analysis, development and applications of SeqZip, and challenges presented to biomedical researchers by extremely large and rich datasets
Recommended from our members
Systematically Mapping the Epigenetic Context Dependence of Transcription Factor Binding
At the core of gene regulatory networks are transcription factors (TFs) that recognize specific DNA sequences and target distinct gene sets. Characterizing the DNA binding specificity of all TFs is a prerequisite for understanding global gene regulatory logic, which in recent years has resulted in the development of high-throughput methods that probe TF specificity in vitro and are now routinely used to inform or interpret in vivo studies. Despite the broad success of such methods, several challenges remain, two of which are addressed in this thesis.
Genomic DNA can harbor different epigenetic marks that have the potential to alter TF binding, the most prominent being CpG methylation. Given the vast number of modified CpGs in the human genome and an increasing body of literature suggesting a link between epigenetic changes and genome instability, or the onset of disease such as cancer, methods that can characterize the sensitivity of TFs to DNA methylation are needed to mechanistically interpret its impact on gene expression. We developed a high-throughput in vitro method (EpiSELEX-seq) that probes TF binding to unmodified and modified DNA sequences in competition, resulting in high-resolution maps of TF binding preferences. We found that methylation sensitivity can vary between TFs of the the same structural family and is dependent on the position of the 5mCpG within the TF binding site. The importance of our in vitro profiling of methylation sensitivity is demonstrated by the preference of human p53 tetramers for 5mCpGs within its binding site core. This previously unknown, stabilizing effect is also detectable in p53 ChIP-seq data when comparing methylated and unmethylated sites genome-wide.
A second impediment to predicting TF binding is our limited understanding of i) how cooperative participation of a TF in different complexes can alter their binding preference, and ii) how the detailed shape of DNA aids in creating a substrate for adaptive multi-TF binding. To address these questions in detail, we studied the in vitro binding preferences of three D. melanogaster homeodomain TFs: Homothorax (Hth), Extradenticle(Exd) and one of the eight Hox proteins. In vivo, Hth occurs in two splice forms: with (HthFL) and without (HthHM) the DNA binding domain (DBD). HthHM-Exd itself is a Hox cofactor that has been shown to induce latent sequence specificity upon complex formation with Hox proteins. There are three possible complexes that can be formed, all potentially having specific target genes: HthHM-Exd-Hox, HthFL-Exd-Hox, and HthFL-Exd. We characterized the in vitro binding preferences of each of these by developing new computational approaches to analyze high-throughput SELEX-seq data. We found distinct orientation and spacing preference for HthFL-Exd-Hox, alternative recognition modes that depend on the affinity class a sequence falls into, and a strong preference for a narrow DNA minor grove near Exd's N-terminal DBD. Strikingly, this shape readout is crucial to stabilize the HthHM-Exd-Hox complex in the absence of a Hth DBD and can thus be used to distinguish HthHM from HthFL isoform binding. Mutating the amino acids responsible for the shape readout by Exd and reinserting the engineered protein into the fly genome allowed us to classify in vivo binding sites based on ChIP-seq signal comparison between “shape-mutant” and wild-type Exd.
In summary, the research presented here has investigated TF binding preferences beyond sequence context by combining novel high-throughput experimental and computational methods. This interdisciplinary approach has enabled us to study binding preferences of TF complexes with respect to the epigenetic landscape of their cognate binding sites. Our novel mechanistic insights into DNA shape readout have provided a new avenue of exploiting guided protein engineering to probe how specific TFs interact with their co-factors in a cellular context, and how flanking genomic sequence helps determine which multi-TF complexes will form and which binding mode a complex adopts
Molecular Mechanisms of piRNA Biogenesis and Function in Drosophila: A Dissertation
In the Drosophila germ line, PIWI-interacting RNAs (piRNAs) ensure genomic stability by silencing endogenous selfish genetic elements such as retrotransposons and repetitive sequences.
We examined the genetic requirements for the biogenesis and function of piRNAs in both female and male germ line. We found that piRNAs function through the PIWI, rather than the AGO, family Argonaute proteins, and the production of piRNAs requires neither microRNA (miRNA) nor small interfering RNA (siRNA) pathway machinery. These findings allowed the discovery of the third conserved small RNA silencing pathway, which is distinct from both the miRNA and RNAi pathways in its mechanisms of biogenesis and function.
We also found piRNAs in flies are modified. We determined that the chemical structure of the 3´-terminal modification is a 2´-O-methyl group, and also demonstrated that the same modification occurs on the 3´ termini of siRNAs in flies. Furthermore, we identified the RNA methyltransferase Drosophila Hen1, which catalyzes 2´-O-methylation on both siRNAs and piRNAs. Our data suggest that 2´-O-methylation by Hen1 is the final step of biogenesis of both the siRNA pathway and piRNA pathway.
Studies from the Hannon Lab and the Siomi Lab suggest a ping-pong amplification loop for piRNA biogenesis and function in the Drosophila germline. In this model, an antisense piRNA, bound to Aubergine or Piwi, triggers production of a sense piRNA bound to the PIWI protein Argonaute3 (Ago3). In turn, the new piRNA is envisioned to produce a second antisense piRNA. We isolated the loss-of-function mutations in ago3, allowing a direct genetic test of this model. We found that Ago3 acts to amplify piRNA pools and to enforce on them an antisense bias, increasing the number of piRNAs that can act to silence transposons. Moreover, we also discovered a second Ago3-independent piRNA pathway in somatic ovarian follicle cells, suggesting a role for piRNAs beyond the germ line
Applied Bioinformatics for ncRNA Characterization - Case Studies Combining Next Generation Sequencing & Genomics
Non-coding RNAs (ncRNAs) present a diverse class of functional molecules inherent in virtually all forms of cellular life. Besides the canonical protein-encoding mRNAs the role of these abundant transcripts has been overlooked for decades. Defined by their highly conserved structure ncRNAs are resistant to degradation and perform various regulatory functions. Despite the poor sequence conservation, comparative genomics can be employed to identify homologous ncRNAs based on their structure in related species. Through the availability of next generation sequencing techniques, a rich corpus of datasets is available which grants a detailed look into cellular processes. The combination of genomic and transcriptomic data allows for a detailed understanding
of molecular mechanism as well as characterization of individual gene functions and their evolution. However, analytical processing of modern high-throughput data is only made viable through optimized bioinformatic algorithms and reproducible automation pipelines.
This thesis consists of four major parts highlighting the diverse roles of ncRNAs concerning the transcription process viewed from different vantage points. The first part concerns an unusually long untranslated region in Rhodobacter which harbors a ncRNA that regulates the expression of the downstream division cell wall cluster. Second, the degradation of 6S RNA in Bacillus subtilis is experimentally reconstructed to shed light on
this final part of the RNA life cycle. This ncRNA is ubiquitous among bacteria and known to be a global transcription regulator itself. Next, the focus moves to the eukaryotic system and RNase P, an ancient ribozyme that is involved in tRNA maturation. Due to differences in composition with an optional RNA and multiple protein subunits, its phylogenetic distribution and deviant characteristics throughout the eukaryotic lineage
are examined in order to trace its evolution. Finally, a diverse subgroup of non-translated RNAs are circRNAs which recently received increased attention due to their abundance in neural tissue. Resulting from post-transcriptional back-splicing events circRNAs compete with their host gene for expression. In a zoological study of social insects circRNA were for the first time identified in honeybees. The goal was to find task-related
differences in circRNA expression between nurse bees and foragers and thus pinpoint potential functions of these elusive ncRNAs.
The combination of genomic methods and transcriptomic data makes in-depth functional analysis of ncRNAs possible and enables us to understand the molecular mechanisms on multiple levels. Through structural predictions a riboswitch like transcriptional control of UpsM was revealed that is unique to Rhodobacteraceae. Transcriptomic analysis exposed that 6S RNA is primarily processed by RNase J1 for maturation and degraded
at internal loops by RNase Y. Evolutionary comparison of organellar RNase P revealed that the RNA subunit is potentially less conserved than thought while organellar proteinonly variants are widespread potentially due to horizontal gene transfer. In the case of circRNA, an entire group of ncRNAs was characterized in the social model organism of honeybees and evidence of at least one gene where circRNA levels are significantly
reduced during nurse-to-forager transition could be shown. Moreover, an unexpected link between elevated DNA methylation and RNA circularization was discovered. The bioinformatic findings in all of these cases provide a foundation for further experimental research and illustrate how scientific endeavors cannot be automated completely but require rigorous investigation with customized tools
Bioinformatic analysis of genome-scale data reveals insights into host-pathogen interactions in farm animals
This thesis documents the contribution of my bioinformatics research activities, including
novel software development, to a range of research projects aimed at investigating the
interactions between bacterial and viral pathogens and their hosts. The focus is largely on
farm animal species and their pathogens, although some of the research has a wider
scientific impact.
RNA interference (RNAi) refers to a variety of related regulatory pathways present in
animals, plants and insects. The major pathways are microRNAs (miRNAs), small-interfering
RNAs (siRNAs) and PIWI-interacting RNAs (piRNAs). Marek’s disease virus is an important
pathogen of poultry, causing T-cell lymphoma. We identified the presence and expression
patterns of several MDV-encoded microRNAs, including the identification of 5 novel
microRNAs. We also showed that not only do virus-encoded microRNAs dominate the
mirNome within chicken cells, but also that specific host-microRNAs are down-regulated.
We also identify novel virus-encoded microRNAs in other Herpesviridae and provide the
first evidence of miRNA evolution by duplication in viruses. In related work, we present a
novel microRNA generated by the canonical miRNA biogenesis pathway in Avian Leukosis
Virus, another avian oncogenic virus, and publish data showing the expression pattern of
known chicken microRNAs across a range of important avian cells. Two of the other RNAi
pathways (siRNA and piRNA) form an important part of the antiviral response in
arthropods. We have published work demonstrating an siRNA antiviral response to
bluetongue virus and Schmallenberg virus in cells from the Culicoides midge, an important
insect vector, as well as work demonstrating the importance of the piRNA pathway in the
antiviral response to Semliki forest virus (SFV). Further work on flaviviruses in ticks
demonstrates the active suppression of the siRNA response by Langat Virus, as well as a key
difference between the siRNA responses in Mosquitos compared to ticks.
Salmonella is one of the most important zoonoses, with an estimated 1.4 million cases of
human salmonellosis per annum in the USA alone. Salmonella infections of farm animals
are an important route into the human food chain. This thesis presents work on the
comparative structure and function of 13 fimbrial operons within Salmonella enterica
serovar Enteritidis as well as a genomic comparison of that serovar with Salmonella
enterica serovar Gallinarum, a chicken-specific serovar. We characterised the global
expression profile of Salmonella enterica serovar Typhimurium during colonization of the
chicken intestine, and we have published the genomes of four strains of Salmonella
eneterica serovars of well-defined virulence in food-producing animals. Our work in this
area led to us publishing an important and comprehensive review of the automatic
annotation of bacterial genomes.
Finally, I present work on novel software development. ProGenExpress, a software tool
that allows the easy and accurate integration and visualisation of quantitative data with the
genome annotation of bacteria; Meta4 is a web application that allows data sharing of
bacterial genome annotations from metagenomes; CORNA, a software tool that allows
scientists to link together microRNA targets, gene expression and functional annotation;
viRome, a software tool for the analysis of siRNA and piRNA responses in virus-infection
studies; DetectiV, a software tool for the analysis of pathogen-detection microarray data;
and poRe, a software tool that enables users to organise and analyse nanopore sequencing
dat
- …