28 research outputs found

    SeqAn An efficient, generic C++ library for sequence analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> would not have been possible without advanced assembly algorithms. However, owing to the high speed of technological progress and the urgent need for bioinformatics tools, there is a widening gap between state-of-the-art algorithmic techniques and the actual algorithmic components of tools that are in widespread use.</p> <p>Results</p> <p>To remedy this trend we propose the use of SeqAn, a library of efficient data types and algorithms for sequence analysis in computational biology. SeqAn comprises implementations of existing, practical state-of-the-art algorithmic components to provide a sound basis for algorithm testing and development. In this paper we describe the design and content of SeqAn and demonstrate its use by giving two examples. In the first example we show an application of SeqAn as an experimental platform by comparing different exact string matching algorithms. The second example is a simple version of the well-known MUMmer tool rewritten in SeqAn. Results indicate that our implementation is very efficient and versatile to use.</p> <p>Conclusion</p> <p>We anticipate that SeqAn greatly simplifies the rapid development of new bioinformatics tools by providing a collection of readily usable, well-designed algorithmic components which are fundamental for the field of sequence analysis. This leverages not only the implementation of new algorithms, but also enables a sound analysis and comparison of existing algorithms.</p

    Systematic identification of conserved motif modules in the human genome

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The identification of motif modules, groups of multiple motifs frequently occurring in DNA sequences, is one of the most important tasks necessary for annotating the human genome. Current approaches to identifying motif modules are often restricted to searches within promoter regions or rely on multiple genome alignments. However, the promoter regions only account for a limited number of locations where transcription factor binding sites can occur, and multiple genome alignments often cannot align binding sites with their true counterparts because of the short and degenerative nature of these transcription factor binding sites.</p> <p>Results</p> <p>To identify motif modules systematically, we developed a computational method for the entire non-coding regions around human genes that does not rely upon the use of multiple genome alignments. First, we selected orthologous DNA blocks approximately 1-kilobase in length based on discontiguous sequence similarity. Next, we scanned the conserved segments in these blocks using known motifs in the TRANSFAC database. Finally, a frequent pattern mining technique was applied to identify motif modules within these blocks. In total, with a false discovery rate cutoff of 0.05, we predicted 3,161,839 motif modules, 90.8% of which are supported by various forms of functional evidence. Compared with experimental data from 14 ChIP-seq experiments, on average, our methods predicted 69.6% of the ChIP-seq peaks with TFBSs of multiple TFs. Our findings also show that many motif modules have distance preference and order preference among the motifs, which further supports the functionality of these predictions.</p> <p>Conclusions</p> <p>Our work provides a large-scale prediction of motif modules in mammals, which will facilitate the understanding of gene regulation in a systematic way.</p

    Mouse Transgenesis Identifies Conserved Functional Enhancers and cis-Regulatory Motif in the Vertebrate LIM Homeobox Gene Lhx2 Locus

    Get PDF
    The vertebrate Lhx2 is a member of the LIM homeobox family of transcription factors. It is essential for the normal development of the forebrain, eye, olfactory system and liver as well for the differentiation of lymphoid cells. However, despite the highly restricted spatio-temporal expression pattern of Lhx2, nothing is known about its transcriptional regulation. In mammals and chicken, Crb2, Dennd1a and Lhx2 constitute a conserved linkage block, while the intervening Dennd1a is lost in the fugu Lhx2 locus. To identify functional enhancers of Lhx2, we predicted conserved noncoding elements (CNEs) in the human, mouse and fugu Crb2-Lhx2 loci and assayed their function in transgenic mouse at E11.5. Four of the eight CNE constructs tested functioned as tissue-specific enhancers in specific regions of the central nervous system and the dorsal root ganglia (DRG), recapitulating partial and overlapping expression patterns of Lhx2 and Crb2 genes. There was considerable overlap in the expression domains of the CNEs, which suggests that the CNEs are either redundant enhancers or regulating different genes in the locus. Using a large set of CNEs (810 CNEs) associated with transcription factor-encoding genes that express predominantly in the central nervous system, we predicted four over-represented 8-mer motifs that are likely to be associated with expression in the central nervous system. Mutation of one of them in a CNE that drove reporter expression in the neural tube and DRG abolished expression in both domains indicating that this motif is essential for expression in these domains. The failure of the four functional enhancers to recapitulate the complete expression pattern of Lhx2 at E11.5 indicates that there must be other Lhx2 enhancers that are either located outside the region investigated or divergent in mammals and fishes. Other approaches such as sequence comparison between multiple mammals are required to identify and characterize such enhancers

    Pregnane X Receptor and Yin Yang 1 Contribute to the Differential Tissue Expression and Induction of CYP3A5 and CYP3A4

    Get PDF
    The hepato-intestinal induction of the detoxifying enzymes CYP3A4 and CYP3A5 by the xenosensing pregnane X receptor (PXR) constitutes a key adaptive response to oral drugs and dietary xenobiotics. In contrast to CYP3A4, CYP3A5 is additionally expressed in several, mostly steroidogenic organs, which creates potential for induction-driven disturbances of the steroid homeostasis. Using cell lines and mice transgenic for a CYP3A5 promoter we demonstrate that the CYP3A5 expression in these organs is non-inducible and independent from PXR. Instead, it is enabled by the loss of a suppressing yin yang 1 (YY1)-binding site from the CYP3A5 promoter which occurred in haplorrhine primates. This YY1 site is conserved in CYP3A4, but its inhibitory effect can be offset by PXR acting on response elements such as XREM. Taken together, the loss of YY1 binding site from promoters of the CYP3A5 gene lineage during primate evolution may have enabled the utilization of CYP3A5 both in the adaptive hepato-intestinal response to xenobiotics and as a constitutively expressed gene in other organs. Our results thus constitute a first description of uncoupling induction from constitutive expression for a major detoxifying enzyme. They also suggest an explanation for the considerable tissue expression differences between CYP3A5 and CYP3A4

    Melanism in Peromyscus Is Caused by Independent Mutations in Agouti

    Get PDF
    Identifying the molecular basis of phenotypes that have evolved independently can provide insight into the ways genetic and developmental constraints influence the maintenance of phenotypic diversity. Melanic (darkly pigmented) phenotypes in mammals provide a potent system in which to study the genetic basis of naturally occurring mutant phenotypes because melanism occurs in many mammals, and the mammalian pigmentation pathway is well understood. Spontaneous alleles of a few key pigmentation loci are known to cause melanism in domestic or laboratory populations of mammals, but in natural populations, mutations at one gene, the melanocortin-1 receptor (Mc1r), have been implicated in the vast majority of cases, possibly due to its minimal pleiotropic effects. To investigate whether mutations in this or other genes cause melanism in the wild, we investigated the genetic basis of melanism in the rodent genus Peromyscus, in which melanic mice have been reported in several populations. We focused on two genes known to cause melanism in other taxa, Mc1r and its antagonist, the agouti signaling protein (Agouti). While variation in the Mc1r coding region does not correlate with melanism in any population, in a New Hampshire population, we find that a 125-kb deletion, which includes the upstream regulatory region and exons 1 and 2 of Agouti, results in a loss of Agouti expression and is perfectly associated with melanic color. In a second population from Alaska, we find that a premature stop codon in exon 3 of Agouti is associated with a similar melanic phenotype. These results show that melanism has evolved independently in these populations through mutations in the same gene, and suggest that melanism produced by mutations in genes other than Mc1r may be more common than previously thought

    Assessing Computational Methods of Cis-Regulatory Module Prediction

    Get PDF
    Computational methods attempting to identify instances of cis-regulatory modules (CRMs) in the genome face a challenging problem of searching for potentially interacting transcription factor binding sites while knowledge of the specific interactions involved remains limited. Without a comprehensive comparison of their performance, the reliability and accuracy of these tools remains unclear. Faced with a large number of different tools that address this problem, we summarized and categorized them based on search strategy and input data requirements. Twelve representative methods were chosen and applied to predict CRMs from the Drosophila CRM database REDfly, and across the human ENCODE regions. Our results show that the optimal choice of method varies depending on species and composition of the sequences in question. When discriminating CRMs from non-coding regions, those methods considering evolutionary conservation have a stronger predictive power than methods designed to be run on a single genome. Different CRM representations and search strategies rely on different CRM properties, and different methods can complement one another. For example, some favour homotypical clusters of binding sites, while others perform best on short CRMs. Furthermore, most methods appear to be sensitive to the composition and structure of the genome to which they are applied. We analyze the principal features that distinguish the methods that performed well, identify weaknesses leading to poor performance, and provide a guide for users. We also propose key considerations for the development and evaluation of future CRM-prediction methods

    Intronic Cis-Regulatory Modules Mediate Tissue-Specific and Microbial Control of angptl4/fiaf Transcription

    Get PDF
    The intestinal microbiota enhances dietary energy harvest leading to increased fat storage in adipose tissues. This effect is caused in part by the microbial suppression of intestinal epithelial expression of a circulating inhibitor of lipoprotein lipase called Angiopoietin-like 4 (Angptl4/Fiaf). To define the cis-regulatory mechanisms underlying intestine-specific and microbial control of Angptl4 transcription, we utilized the zebrafish system in which host regulatory DNA can be rapidly analyzed in a live, transparent, and gnotobiotic vertebrate. We found that zebrafish angptl4 is transcribed in multiple tissues including the liver, pancreatic islet, and intestinal epithelium, which is similar to its mammalian homologs. Zebrafish angptl4 is also specifically suppressed in the intestinal epithelium upon colonization with a microbiota. In vivo transgenic reporter assays identified discrete tissue-specific regulatory modules within angptl4 intron 3 sufficient to drive expression in the liver, pancreatic islet β-cells, or intestinal enterocytes. Comparative sequence analyses and heterologous functional assays of angptl4 intron 3 sequences from 12 teleost fish species revealed differential evolution of the islet and intestinal regulatory modules. High-resolution functional mapping and site-directed mutagenesis defined the minimal set of regulatory sequences required for intestinal activity. Strikingly, the microbiota suppressed the transcriptional activity of the intestine-specific regulatory module similar to the endogenous angptl4 gene. These results suggest that the microbiota might regulate host intestinal Angptl4 protein expression and peripheral fat storage by suppressing the activity of an intestine-specific transcriptional enhancer. This study provides a useful paradigm for understanding how microbial signals interact with tissue-specific regulatory networks to control the activity and evolution of host gene transcription

    A Wide Extent of Inter-Strain Diversity in Virulent and Vaccine Strains of Alphaherpesviruses

    Get PDF
    Alphaherpesviruses are widespread in the human population, and include herpes simplex virus 1 (HSV-1) and 2, and varicella zoster virus (VZV). These viral pathogens cause epithelial lesions, and then infect the nervous system to cause lifelong latency, reactivation, and spread. A related veterinary herpesvirus, pseudorabies (PRV), causes similar disease in livestock that result in significant economic losses. Vaccines developed for VZV and PRV serve as useful models for the development of an HSV-1 vaccine. We present full genome sequence comparisons of the PRV vaccine strain Bartha, and two virulent PRV isolates, Kaplan and Becker. These genome sequences were determined by high-throughput sequencing and assembly, and present new insights into the attenuation of a mammalian alphaherpesvirus vaccine strain. We find many previously unknown coding differences between PRV Bartha and the virulent strains, including changes to the fusion proteins gH and gB, and over forty other viral proteins. Inter-strain variation in PRV protein sequences is much closer to levels previously observed for HSV-1 than for the highly stable VZV proteome. Almost 20% of the PRV genome contains tandem short sequence repeats (SSRs), a class of nucleic acids motifs whose length-variation has been associated with changes in DNA binding site efficiency, transcriptional regulation, and protein interactions. We find SSRs throughout the herpesvirus family, and provide the first global characterization of SSRs in viruses, both within and between strains. We find SSR length variation between different isolates of PRV and HSV-1, which may provide a new mechanism for phenotypic variation between strains. Finally, we detected a small number of polymorphic bases within each plaque-purified PRV strain, and we characterize the effect of passage and plaque-purification on these polymorphisms. These data add to growing evidence that even plaque-purified stocks of stable DNA viruses exhibit limited sequence heterogeneity, which likely seeds future strain evolution
    corecore