1,203 research outputs found

    Discovering Patterns from Sequences with Applications to Protein-Protein and Protein-DNA Interaction

    Get PDF
    Understanding Protein-Protein and Protein-DNA interaction is of fundamental importance in deciphering gene regulation and other biological processes in living cells. Traditionally, new interaction knowledge is discovered through biochemical experiments that are often labor intensive, expensive and time-consuming. Thus, computational approaches are preferred. Due to the abundance of sequence data available today, sequence-based interaction analysis becomes one of the most readily applicable and cost-effective methods. One important problem in sequence-based analysis is to identify the functional regions from a set of sequences within the same family or demonstrating similar biological functions in experiments. The rationale is that throughout evolution the functional regions normally remain conserved (intact), allowing them to be identified as patterns from a set of sequences. However, there are also mutations such as substitution, insertion, deletion in these functional regions. Existing methods, such as those based on position weight matrices, assume that the functional regions have a fixed width and thus cannot not identify functional regions with mutations, particularly those with insertion or deletion mutations. Recently, Aligned Pattern Clustering (APCn) was introduced to identify functional regions as Aligned Pattern Clusters (APCs) by grouping and aligning patterns with variable width. Nevertheless, APCn cannot discover functional regions with substitution, insertion and/or deletion mutations, since their frequencies of occurrences are too low to be considered as patterns. To overcome such an impasse, this thesis proposes a new APC discovery algorithm known as Pattern-Directed Aligned Pattern Clustering (PD-APCn). By first discovering seed patterns from the input sequence data, with their sequence positions located and recorded on an address table, PD-APCn can use the seed patterns to direct the incremental extension of functional regions with minor mutations. By grouping the aligned extended patterns, PD-APCn can recruit patterns adaptively and efficiently with variable width without relying on exhaustive optimal search. Experiments on synthetic datasets with different sizes and noise levels showed that PD-APCn can identify the implanted pattern with mutations, outperforming the popular existing motif-finding software MEME with much higher recall and Fmeasure over a computational speed-up of up to 665 times. When applying PD-APCn on datasets from Cytochrome C and Ubiquitin protein families, all key binding sites conserved in the families were captured in the APC outputs. In sequence-based interaction analysis, there is also a lack of a model for co-occurring functional regions with mutations, where co-occurring functional regions between interaction sequences are indicative of binding sites. This thesis proposes a new representation model Co-Occurrence APCs to capture co-occurring functional regions with mutations from interaction sequences in database transaction format. Applications on Protein-DNA and Protein-Protein interaction validated the capability of Co-Occurrence APCs. In Protein-DNA interaction, a new representation model, Protein-DNA Co-Occurrence APC, was developed for modeling Protein-DNA binding cores. The new model is more compact than the traditional one-to-one pattern associations, as it packs many-to-many associations in one model, yet it is detailed enough to allow site-specific variants. An algorithm, based on Co-Support Score, was also developed to discover Protein-DNA Co-Occurrence APCs from Protein-DNA interaction sequences. This algorithm is 1600x faster in run-time than its contemporaries. New Protein-DNA binding cores indicated by Protein-DNA Co-Occurrence APCs were also discovered via homology modeling as a proof-of-concept. In Protein-Protein interaction, a new representation model, Protein-Protein Co-Occurrence APC, was developed for modeling the co-occurring sequence patterns in Protein-Protein Interaction between two protein sequences. A new algorithm, WeMine-P2P, was developed for sequence-based Protein-Protein Interaction machine learning prediction by constructing feature vectors leveraging Protein-Protein Co-Occurrence APCs, based on novel scores such as Match Score, MaxMatch Score and APC-PPI score. Through 40 independent experiments, it outperformed the well-known algorithm, PIPE2, which also uses co-occurring functional regions while not allowing variable widths and mutations. Both applications on Protein-Protein and Protein-DNA interaction have indicated the potential use of Co-Occurrence APC for exploring other types of biosequence interaction in the future

    NNAlign: A Web-Based Prediction Method Allowing Non-Expert End-User Discovery of Sequence Motifs in Quantitative Peptide Data

    Get PDF
    Recent advances in high-throughput technologies have made it possible to generate both gene and protein sequence data at an unprecedented rate and scale thereby enabling entirely new “omics”-based approaches towards the analysis of complex biological processes. However, the amount and complexity of data that even a single experiment can produce seriously challenges researchers with limited bioinformatics expertise, who need to handle, analyze and interpret the data before it can be understood in a biological context. Thus, there is an unmet need for tools allowing non-bioinformatics users to interpret large data sets. We have recently developed a method, NNAlign, which is generally applicable to any biological problem where quantitative peptide data is available. This method efficiently identifies underlying sequence patterns by simultaneously aligning peptide sequences and identifying motifs associated with quantitative readouts. Here, we provide a web-based implementation of NNAlign allowing non-expert end-users to submit their data (optionally adjusting method parameters), and in return receive a trained method (including a visual representation of the identified motif) that subsequently can be used as prediction method and applied to unknown proteins/peptides. We have successfully applied this method to several different data sets including peptide microarray-derived sets containing more than 100,000 data points

    CATCHprofiles: Clustering and Alignment Tool for ChIP Profiles

    Get PDF
    Chromatin Immuno Precipitation (ChIP) profiling detects in vivo protein-DNA binding, and has revealed a large combinatorial complexity in the binding of chromatin associated proteins and their post-translational modifications. To fully explore the spatial and combinatorial patterns in ChIP-profiling data and detect potentially meaningful patterns, the areas of enrichment must be aligned and clustered, which is an algorithmically and computationally challenging task. We have developed CATCHprofiles, a novel tool for exhaustive pattern detection in ChIP profiling data. CATCHprofiles is built upon a computationally efficient implementation for the exhaustive alignment and hierarchical clustering of ChIP profiling data. The tool features a graphical interface for examination and browsing of the clustering results. CATCHprofiles requires no prior knowledge about functional sites, detects known binding patterns “ab initio”, and enables the detection of new patterns from ChIP data at a high resolution, exemplified by the detection of asymmetric histone and histone modification patterns around H2A.Z-enriched sites. CATCHprofiles' capability for exhaustive analysis combined with its ease-of-use makes it an invaluable tool for explorative research based on ChIP profiling data

    ANALYSIS OF THE CIS-REGULATORY ELEMENT LEXICON IN UPSTREAM GENE PROMOTERS OF ARABIDOPSIS THALIANA AND ORYZA SATIVA

    Get PDF
    AN ABSTRACT OF THE DISSERTATION OF BELAN M. KHALIL, for the Doctor of Philosophy degree in Plant Biology, presented July 11, 2018, at Southern Illinois University Carbondale. TITLE: ANALYSIS OF THE CIS-REGULATORY ELEMENT LEXICON IN UPSTREAM GENE PROMOTERS OF ARABIDOPSIS THALIANA AND ORYZA SATIVA. MAJOR PROFESSOR: Dr Matt Geisler Gene expression in plants is partly regulated through an interaction of trans-acting factors with the promoter regions of the gene. Trans-acting factor binding sites consist of short nucleotide sequences most often present in the upstream promoter region. These binding sites, the cis-regulatory elements (CREs), vary in structure, complexity and function. In binding to trans-acting factors, CREs connect genes to signalling and regulatory pathways that affect plant growth, development, and response to the environment. As words in a language, CREs and thus promoters can be analyzed by looking for spelling (patterns of nucleotides) associated with meaning (functions). Considering CREs as words in a language, this kind of analysis provides a great opportunity for comprehensive understanding of promoter language. Identification and characterization of CREs are challenging either experimentally or bioinformatically, and has previously been accomplished by discovering degenerate words, with ambiguous nucleotides. This kind of result implicitly makes a hypothesis that binding of a specific trans-acting factor is somewhat promiscuous (or sloppy) and that all words represented by a degenerate pattern are equally good at binding. In this study, we unpack the “degeneracy hypothesis” by systematically considering each combination of letters independently for CRE function. Our results demonstrate that not all degenerate combinations of published CREs have the same effect on gene expression. A systematic search and comparison of all 65,536 possible 8 bp CRE words were searched in the 500 bp and 1000 bp upstream promoters of all genes in Arabidopsis thaliana and Oryza sativa, respectively. The function of each CRE was evaluated by statistically comparing the presence or absence of the element in the promoter with that genes response (induction or suppression) to stimuli in 1691 public availability transcriptomes of differential gene expression data. Arabidopsis, a model dicot plant had a much larger number of such data sets, than rice, however rice was chosen as a comparison as it had the largest number of datasets for a monocot, the most distantly related plant group with sufficient data available. A comprehensive list of 8 bp words associated with differential gene expression, linguistically known as lexicon, was retrieved for both species by establishing that the presence of a CRE significantly increased the likelihood for differential expression by at least one stimulus. The lexicons were composed of 641 and 856 CREs respectively in Arabidopsis and rice, and there were only 78 shared CREs between the two lexicons. The CRE lexicon was then characterized for their strength and breadth of response, occurrence frequency, sequence complexity, and sequence conservation between two species. In Arabidopsis, evening element (EE) showed the strongest response to a cold stress transcriptome (p-value 10-99). In rice, the element AAACCCTA showed strongest response to a tissue specific transcriptome (p-value 10-79). The breadth of response varied between the two species due to number of transcriptomes used in the study. The element AAACCCTA and GCGGCGGA significantly correlated to 197 and 58 transcriptomes in both Arabidopsis and rice, respectively. On the other side of the breadth scale there were also many CREs with very restricted response. There were 291 and 258 CREs in Arabidopsis and rice, respectively, significantly correlated to a single stimulus. Occurrence frequency revealed that the most abundant CREs in Arabidopsis and rice genes were TATA box and TATA box like CREs. The structure of the CREs in the lexicon was also varied. CREs were distributed on seven levels of complexity. Level one comprised CREs having 8 copies of the same nucleotide, level seven comprised CREs having two copies of the same nucleotide. In Arabidopsis, out of 641 CREs, 314 were of level 6 complexity, which means having 3 copies of the same nucleotide. In rice, the majority of the lexicon, 263 CREs were of level 5 complexity, which means having 4 copies of the same nucleotide. Each CRE of the lexicon was correlated to at least one experimental condition in the differential gene expression data, but many were correlated to multiple and often related conditions such as drought, temperature and salinity. Therefore, each CRE was assigned a “meaning”, i.e. the associated stimuli, thus providing a sort of CRE function dictionary in addition to the lexicon itself. Many CREs possessed different meanings (termed homographs in language), and in many cases the meanings of different CREs overlapped like language synonyms. Sharing meanings (synonyms) was often among CREs with strong sequence similarity (homonyms or homophones), however, not in all cases. Analyzed as a linguistic aspect, CRE homonymity and synonymity was applied to explore the hypothesis “all CRE synonyms are also homonyms and all CRE homonyms are also synonyms.” To the end a single CRE was compared to all possible CREs with only one letter mismatch in their sequences are considered as homonyms. The CREs meaning was converted to a matrix of stimuli to generate clusters of synonyms that were analyzed for similarity of spelling (sequence). This analysis showed that not all homonyms are synonyms, however most synonyms are homonyms. Furthermore, despite a search of all one letter mismatches among homonyms, many of the functional homonyms shared smaller 4-5bp core sequence and only varied at the flanks. Synonyms being homonyms in the language of promoters raises a question, how did this evolve? Duplication of transcription factors in the genome generated transcription factor families where each family member shares the same core domain, usually a DNA recognition site. We here propose that CREs also duplicate during gene duplication process building CRE families in parallel. Members of CRE families may show different connectivity and affinity to individual members of transcription factors in a transcription factor family. In environmental sensors and developmental decision panel, this association of two families of interaction factors is called dense overlapping region (or DOR) and is a highly overrepresented network topology in biological systems. This also explains the degeneracy of initially discovered CREs. The fact is only a portion of nucleotide combinations implied by a degenerate CRE is bioactive, it represents an overlap of different members of a CRE family which is part of the process of family expansion and diversification and done as compensatory mutations as the family of transcription factors expanded and diversified. We also extensively studied CREs involved abiotic stress and identifies shared elements among abiotic stresses as well as abiotic stress specific CREs. Furthermore, CREs follow a time-sensitive response rule, which means some CREs participates in gene expression regulation only at a certain period during the course of exposure to the abiotic stress

    STAMP: a web tool for exploring DNA-binding motif similarities

    Get PDF
    STAMP is a newly developed web server that is designed to support the study of DNA-binding motifs. STAMP may be used to query motifs against databases of known motifs; the software aligns input motifs against the chosen database (or alternatively against a user-provided dataset), and lists of the highest-scoring matches are returned. Such similarity-search functionality is expected to facilitate the identification of transcription factors that potentially interact with newly discovered motifs. STAMP also automatically builds multiple alignments, familial binding profiles and similarity trees when more than one motif is inputted. These functions are expected to enable evolutionary studies on sets of related motifs and fixed-order regulatory modules, as well as illustrating similarities and redundancies within the input motif collection. STAMP is a highly flexible alignment platform, allowing users to ‘mix-and-match’ between various implemented comparison metrics, alignment methods (local or global, gapped or ungapped), multiple alignment strategies and tree-building methods. Motifs may be inputted as frequency matrices (in many of the commonly used formats), consensus sequences, or alignments of known binding sites. STAMP also directly accepts the output files from 12 supported motif-finders, enabling quick interpretation of motif-discovery analyses. STAMP is available at http://www.benoslab.pitt.edu/stam

    Finding subtypes of transcription factor motif pairs with distinct regulatory roles

    Get PDF
    DNA sequences bound by a transcription factor (TF) are presumed to contain sequence elements that reflect its DNA binding preferences and its downstream-regulatory effects. Experimentally identified TF binding sites (TFBSs) are usually similar enough to be summarized by a ‘consensus’ motif, representative of the TF DNA binding specificity. Studies have shown that groups of nucleotide TFBS variants (subtypes) can contribute to distinct modes of downstream regulation by the TF via differential recruitment of cofactors. A TFA may bind to TFBS subtypes a1 or a2 depending on whether it associates with cofactors TFB or TFC, respectively. While some approaches can discover motif pairs (dyads), none address the problem of identifying ‘variants’ of dyads. TFs are key components of multiple regulatory pathways targeting different sets of genes perhaps with different binding preferences. Identifying the discriminating TF–DNA associations that lead to the differential downstream regulation is thus essential. We present DiSCo (Discovery of Subtypes and Cofactors), a novel approach for identifying variants of dyad motifs (and their respective target sequence sets) that are instrumental for differential downstream regulation. Using both simulated and experimental datasets, we demonstrate how current motif discovery can be successfully leveraged to address this question
    corecore