54 research outputs found

    Dynamic use of multiple parameter sets in sequence alignment

    Get PDF
    The level of conservation between two homologous sequences often varies among sequence regions; functionally important domains are more conserved than the remaining regions. Thus, multiple parameter sets should be used in alignment of homologous sequences with a stringent parameter set for highly conserved regions and a moderate parameter set for weakly conserved regions. We describe an alignment algorithm to allow dynamic use of multiple parameter sets with different levels of stringency in computation of an optimal alignment of two sequences. The algorithm dynamically considers various candidate alignments, partitions each candidate alignment into sections, and determines the most appropriate set of parameter values for each section of the alignment. The algorithm and its local alignment version are implemented in a computer program named GAP4. The local alignment algorithm in GAP4, that in its predecessor GAP3, and an ordinary local alignment program SIM were evaluated on 257 716 pairs of homologous sequences from 100 protein families. On 168 475 of the 257 716 pairs (a rate of 65.4%), alignments from GAP4 were more statistically significant than alignments from GAP3 and SIM

    A graph-based motif detection algorithm models complex nucleotide dependencies in transcription factor binding sites

    Get PDF
    Given a set of known binding sites for a specific transcription factor, it is possible to build a model of the transcription factor binding site, usually called a motif model, and use this model to search for other sites that bind the same transcription factor. Typically, this search is performed using a position-specific scoring matrix (PSSM), also known as a position weight matrix. In this paper we analyze a set of eukaryotic transcription factor binding sites and show that there is extensive clustering of similar k-mers in eukaryotic motifs, owing to both functional and evolutionary constraints. The apparent limitations of probabilistic models in representing complex nucleotide dependencies lead us to a graph-based representation of motifs. When deciding whether a candidate k-mer is part of a motif or not, we base our decision not on how well the k-mer conforms to a model of the motif as a whole, but how similar it is to specific, known k-mers in the motif. We elucidate the reasons why we expect graph-based methods to perform well on motif data. Our MotifScan algorithm shows greatly improved performance over the prevalent PSSM-based method for the detection of eukaryotic motifs

    Genomics and Computational Molecular Biology

    No full text
    this article permits us to mention only the most recent and major advances in techniques for gene identification. However, there are a number of other reviews and compendiums that cover this area in more depth [25, **26, 27, **28, 29, 30, 31, 32, 33]. In addition, Table 1 includes a list of Web pointers to most of the bioinformatics methods that are presented in this review. Computational methods for gene identification The first step in gene identification is the location of coding regions or open reading frames (ORFs). This task is simplified in bacteria due to the absence of splicing. Sequencing errors and translational frameshifting [34] can lead to partial protein sequences or interrupted open reading frames but these are often resolved during the early steps of gene identification by sequence similarity with proteins from other organisms [35, 36, 37, 38, 39, 40]. In the absence of homologous sequences in other organisms, and especially with short bacterial genes, probabilistic gene models (hidden Markov models) one can often identify biologically significant coding regions [41, 42]. Pairwise sequence homology Given a database of potential open reading frames, a large number of methods can be used to define the biological function of the putative proteins. The most commonly applied methods search for sequence similarity of the translated open reading frames with a database of known protein sequences [43, 44, **45, 46]. The search for gene function is usually carried out at the protein level to eliminate the redundancy of the genetic code. In addition, the use of amino acid substitution matrices that describe the acceptable replacements permits the discovery of even distantly related protein homologies [47, 48]. One of the most sensitive methods for comparing two ..

    Highly Specific Protein Sequence Motifs for Genome Analysis

    No full text
    We present a novel method for discovering conserved sequence motifs from families of aligned protein sequences. The method has been implemented as a computer program called EMOTIF (http://motif.stanford.edu/emotif). Given an aligned set of protein sequences, EMOTIF generates a set of motifs with a wide range of specificities and sensitivities. EMOTIF can also generate motifs that describe possible subfamilies of a protein superfamily. A disjunction of such motifs can often represent the entire superfamily with high specificity and sensitivity. We have used EMOTIF to generate sets of motifs from all 7,000 protein alignments in the BLOCKS and PRINTS databases. The resulting database, called IDENTIFY (http://motif.stanford.edu/identify), contains over 50,000 motifs. For each alignment, the database contains several motifs having a probability of matching a false positive that range from 10 -10 to 10 -5 . Highly specific motifs are well suited for searching entire proteomes, while gen..

    Discovering Empirically Conserved Amino Acid Substitution Groups in Databases of Protein Families

    No full text
    This paper introduces a method for identifying amino acid substitution groups that are conserved empirically in aligned positions from databases of protein families. Existing approaches view amino acid substitution as a pairwise phenomenon and characterizes it using substitution matrices. In contrast, the method presented here identifies subsets of amino acids that are conserved empirically using a conditional distribution matrix, which contains entries for every combination of individual amino acids and subsets of amino acids. Each row in the conditional distribution matrix contains the distribution of amino acids in those aligned positions that contain a given subset of amino acids. The algorithm converts a database of protein families into a conditional distribution matrix and then examines each possible substitution group for evidence of conservation. A substitution group is empirically conserved when it has characteristics of compactness and isolation, meaning that am..
    • …
    corecore