236 research outputs found

    Finding sequence motifs with Bayesian models incorporating positional information: an application to transcription factor binding sites

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Biologically active sequence motifs often have positional preferences with respect to a genomic landmark. For example, many known transcription factor binding sites (TFBSs) occur within an interval [-300, 0] bases upstream of a transcription start site (TSS). Although some programs for identifying sequence motifs exploit positional information, most of them model it only implicitly and with <it>ad hoc </it>methods, making them unsuitable for general motif searches.</p> <p>Results</p> <p>A-GLAM, a user-friendly computer program for identifying sequence motifs, now incorporates a Bayesian model systematically combining sequence and positional information. A-GLAM's predictions with and without positional information were compared on two human TFBS datasets, each containing sequences corresponding to the interval [-2000, 0] bases upstream of a known TSS. A rigorous statistical analysis showed that positional information significantly improved the prediction of sequence motifs, and an extensive cross-validation study showed that A-GLAM's model was robust against mild misspecification of its parameters. As expected, when sequences in the datasets were successively truncated to the intervals [-1000, 0], [-500, 0] and [-250, 0], positional information aided motif prediction less and less, but never hurt it significantly.</p> <p>Conclusion</p> <p>Although sequence truncation is a viable strategy when searching for biologically active motifs with a positional preference, a probabilistic model (used reasonably) generally provides a superior and more robust strategy, particularly when the sequence motifs' positional preferences are not well characterized.</p

    Scanning sequences after Gibbs sampling to find multiple occurrences of functional elements

    Get PDF
    BACKGROUND: Many DNA regulatory elements occur as multiple instances within a target promoter. Gibbs sampling programs for finding DNA regulatory elements de novo can be prohibitively slow in locating all instances of such an element in a sequence set. RESULTS: We describe an improvement to the A-GLAM computer program, which predicts regulatory elements within DNA sequences with Gibbs sampling. The improvement adds an optional "scanning step" after Gibbs sampling. Gibbs sampling produces a position specific scoring matrix (PSSM). The new scanning step resembles an iterative PSI-BLAST search based on the PSSM. First, it assigns an "individual score" to each subsequence of appropriate length within the input sequences using the initial PSSM. Second, it computes an E-value from each individual score, to assess the agreement between the corresponding subsequence and the PSSM. Third, it permits subsequences with E-values falling below a threshold to contribute to the underlying PSSM, which is then updated using the Bayesian calculus. A-GLAM iterates its scanning step to convergence, at which point no new subsequences contribute to the PSSM. After convergence, A-GLAM reports predicted regulatory elements within each sequence in order of increasing E-values, so users have a statistical evaluation of the predicted elements in a convenient presentation. Thus, although the Gibbs sampling step in A-GLAM finds at most one regulatory element per input sequence, the scanning step can now rapidly locate further instances of the element in each sequence. CONCLUSION: Datasets from experiments determining the binding sites of transcription factors were used to evaluate the improvement to A-GLAM. Typically, the datasets included several sequences containing multiple instances of a regulatory motif. The improvements to A-GLAM permitted it to predict the multiple instances

    G-Anchor:A novel approach for whole-genome comparative mapping utilising evolutionary conserved DNA sequences

    Get PDF
    Background Cross-species whole-genome sequence alignment is a critical first step for genome comparative analyses ranging from the detection of sequence variants to studies of chromosome evolution. Animal genomes are large and complex, and whole-genome alignment is a computationally intense process, requiring expensive high performance computing systems due to the need to explore extensive local alignments. With hundreds of sequenced animal genomes available now from multiple projects there is an increasing demand for genome comparative analyses. Results Here we introduce G-Anchor, a new, fast, and efficient pipeline that uses a strictly limited but highly effective set of local sequence alignments to anchor (or map) an animal genome to another species? reference genome. G-Anchor makes novel use of a databank of highly conserved DNA sequence elements. We demonstrate how these elements may be aligned to a pair of genomes, creating anchors. These anchors enable the rapid mapping of scaffolds from a de novo assembled genome to chromosome assemblies of a reference species. Our results demonstrate that G-Anchor can successfully anchor a vertebrate genome onto a phylogenetically related reference species genome using a desktop or laptop computer within a few hours, and with comparable accuracy to that achieved by a highly accurate whole-genome alignment tool such as LASTZ. G-Anchor thus makes whole-genome comparisons accessible to researchers with limited computational resources. Conclusions G-Anchor is a ready-to-use tool for anchoring a pair of vertebrate genomes. It may be used with large genomes that contain a significant fraction of evolutionally conserved DNA sequences, and that are not highly repetitive, polypoid or excessively fragmented. G-Anchor is not a substitute for whole-genome aligning software but can be used for fast and accurate initial genome comparisons. G-Anchor is freely available via https://github.com/vasilislenis/G-AnchorpublishersversionPeer reviewe

    Exopolysaccharide-associated protein sorting in environmental organisms: the PEP-CTERM/EpsH system. Application of a novel phylogenetic profiling heuristic

    Get PDF
    BACKGROUND: Protein translocation to the proper cellular destination may be guided by various classes of sorting signals recognizable in the primary sequence. Detection in some genomes, but not others, may reveal sorting system components by comparison of the phylogenetic profile of the class of sorting signal to that of various protein families. RESULTS: We describe a short C-terminal homology domain, sporadically distributed in bacteria, with several key characteristics of protein sorting signals. The domain includes a near-invariant motif Pro-Glu-Pro (PEP). This possible recognition or processing site is followed by a predicted transmembrane helix and a cluster rich in basic amino acids. We designate this domain PEP-CTERM. It tends to occur multiple times in a genome if it occurs at all, with a median count of eight instances; Verrucomicrobium spinosum has sixty-five. PEP-CTERM-containing proteins generally contain an N-terminal signal peptide and exhibit high diversity and little homology to known proteins. All bacteria with PEP-CTERM have both an outer membrane and exopolysaccharide (EPS) production genes. By a simple heuristic for screening phylogenetic profiles in the absence of pre-formed protein families, we discovered that a homolog of the membrane protein EpsH (exopolysaccharide locus protein H) occurs in a species when PEP-CTERM domains are found. The EpsH family contains invariant residues consistent with a transpeptidase function. Most PEP-CTERM proteins are encoded by single-gene operons preceded by large intergenic regions. In the Proteobacteria, most of these upstream regions share a DNA sequence, a probable cis-regulatory site that contains a sigma-54 binding motif. The phylogenetic profile for this DNA sequence exactly matches that of three proteins: a sigma-54-interacting response regulator (PrsR), a transmembrane histidine kinase (PrsK), and a TPR protein (PrsT). CONCLUSION: These findings are consistent with the hypothesis that PEP-CTERM and EpsH form a protein export sorting system, analogous to the LPXTG/sortase system of Gram-positive bacteria, and correlated to EPS expression. It occurs preferentially in bacteria from sediments, soils, and biofilms. The novel method that led to these findings, partial phylogenetic profiling, requires neither global sequence clustering nor arbitrary similarity cutoffs and appears to be a rapid, effective alternative to other profiling methods

    A clustering property of highly-degenerate transcription factor binding sites in the mammalian genome

    Get PDF
    Transcription factor binding sites (TFBSs) are short DNA sequences interacting with transcription factors (TFs), which regulate gene expression. Due to the relatively short length of such binding sites, it is largely unclear how the specificity of protein–DNA interaction is achieved. Here, we have performed a genome-wide analysis of TFBS-like sequences for the transcriptional repressor, RE1 Silencing Transcription Factor (REST), as well as for several other representative mammalian TFs (c-myc, p53, HNF-1 and CREB). We find a nonrandom distribution of inexact sites for these TFs, referred to as highly-degenerate TFBSs, that are enriched around the cognate binding sites. Comparisons among human, mouse and rat orthologous promoters reveal that these highly-degenerate sites are conserved significantly more than expected by random chance, suggesting their positive selection during evolution. We propose that this arrangement provides a favorable genomic landscape for functional target site selection

    Integration of host-pathogen functional genomics data into the chromosome-level genome assembly of turbot (Scophthalmus maximus)

    Get PDF
    Disease resilience is of utmost relevance for turbot aquaculture. Several infective diseases, covering a broad spectrum from viruses, bacteria to different parasites, have been identified by industry. Since they increase mortality rates, reduce feed conversion ratios and slow down growth rate, genetic breeding programs for increasing disease resilience are recognized as a useful alternative for controlling pathologies. For this, knowledge of the genetic basis underlying resilience using genomic tools is essential to develop the best effective breeding strategies. In the present study, we compiled the existing genomic information generated in the last decade to construct an integrated atlas of candidate genes and genomic regions involved in pathogen resistance against the main turbot industrial pathogens (Aeromonas salmonicida, Philasterides dicentrarchi, Enteromyxum scophthalmi and the VHS virus) within the chromosome-level turbot genome assembly recently released. Information comprehends reannotated differentially expressed genes (DEG) in different tissues along temporal series, QTL markers associated with important productive traits (disease resistance and growth) and signatures of domestic or wild selection, represented by runs of homozygosity (ROHi) islands and outlier markers for divergent selection. Most genetic features were successfully relocated in the turbot assembly including 81.1% of the total DEGs, plus all QTL markers, ROHi and outlier markers. The updated annotation of DEGs for resistance to each pathology demonstrated significant changes. While the new annotation of 53–83% of the DEGs was coherent with the original, roughly 10–24% showed imprecise annotations in both assembly versions, ∼5% lost their original annotation and 2–24% were now annotated. Functional enrichment revealed mostly functions related to immune response, such as chemotaxis, apoptosis regulation, leukocyte differentiation, cell adhesion, iron homeostasis and vascular permeability. Some DEGs, such as celsr1a (cadherin EGF LAG seen-pass G-type receptor 1), fgg (fibrinogen gamma chain) and c1qtnf9 (C1q and TNF related 9) were found near pathogen-associated QTL markers. Also, some shared DEGs for resistance to all pathogens were positioned near QTL markers or ROHi, such as hamp (hepcidin-1), plg (plasminogen) and a fibrinogen alpha chain-like gene. Overall, our results provide an integrative insight into the genetic architecture of turbot response to a range of pathogens that could prove useful for future genomic studies to benefit aquaculture breeding programsS
    • …
    corecore