Search CORE

54 research outputs found

Dynamic use of multiple parameter sets in sequence alignment

Author: Brutlag Douglas L.
Huang Xiaoqiu
Publication venue: Oxford University Press
Publication date: 19/12/2006
Field of study

The level of conservation between two homologous sequences often varies among sequence regions; functionally important domains are more conserved than the remaining regions. Thus, multiple parameter sets should be used in alignment of homologous sequences with a stringent parameter set for highly conserved regions and a moderate parameter set for weakly conserved regions. We describe an alignment algorithm to allow dynamic use of multiple parameter sets with different levels of stringency in computation of an optimal alignment of two sequences. The algorithm dynamically considers various candidate alignments, partitions each candidate alignment into sections, and determines the most appropriate set of parameter values for each section of the alignment. The algorithm and its local alignment version are implemented in a computer program named GAP4. The local alignment algorithm in GAP4, that in its predecessor GAP3, and an ordinary local alignment program SIM were evaluated on 257 716 pairs of homologous sequences from 100 protein families. On 168 475 of the 257 716 pairs (a rate of 65.4%), alignments from GAP4 were more statistically significant than alignments from GAP3 and SIM

Digital Repository @ Iowa State University (ISU)

CiteSeerX

Crossref

PubMed Central

Bayesian Segmentation of Protein Secondary Structure

Author: Asai K.
Douglas L. Brutlag
Heringa J.
Jun S. Liu
Krylov D.
Scott C. Schmidler
Solovyev V.V.
Publication venue: 'Mary Ann Liebert Inc'
Publication date
Field of study

Crossref

A graph-based motif detection algorithm models complex nucleotide dependencies in transcription factor binding sites

Author: Achaz
Bailey
Barash
Benjamini
Benos
Bertone
Brauer
Brian T. Naughton
Bulyk
Douglas L. Brutlag
Durbin
Eugene Fratkin
Fratkin
Garten
Gojobori
Gribskov
Gu
Harbison
Hughes
King
Lapidot
Liu
Matys
Pavesi
Sandelin
Schug
Segal
Serafim Batzoglou
Stolovicki
Stone
Storey
Stormo
Story
Thomas
Wang
Watts
Xing
Zhou
Publication venue: Oxford University Press
Publication date: 13/11/2006
Field of study

Given a set of known binding sites for a specific transcription factor, it is possible to build a model of the transcription factor binding site, usually called a motif model, and use this model to search for other sites that bind the same transcription factor. Typically, this search is performed using a position-specific scoring matrix (PSSM), also known as a position weight matrix. In this paper we analyze a set of eukaryotic transcription factor binding sites and show that there is extensive clustering of similar k-mers in eukaryotic motifs, owing to both functional and evolutionary constraints. The apparent limitations of probabilistic models in representing complex nucleotide dependencies lead us to a graph-based representation of motifs. When deciding whether a candidate k-mer is part of a motif or not, we base our decision not on how well the k-mer conforms to a model of the motif as a whole, but how similar it is to specific, known k-mers in the motif. We elucidate the reasons why we expect graph-based methods to perform well on motif data. Our MotifScan algorithm shows greatly improved performance over the prevalent PSSM-based method for the detection of eukaryotic motifs

Crossref

PubMed Central

Genomics and Computational Molecular Biology

Author: Douglas L. Brutlag
Publication venue
Publication date: 01/01/1998
Field of study

this article permits us to mention only the most recent and major advances in techniques for gene identification. However, there are a number of other reviews and compendiums that cover this area in more depth [25, **26, 27, **28, 29, 30, 31, 32, 33]. In addition, Table 1 includes a list of Web pointers to most of the bioinformatics methods that are presented in this review. Computational methods for gene identification The first step in gene identification is the location of coding regions or open reading frames (ORFs). This task is simplified in bacteria due to the absence of splicing. Sequencing errors and translational frameshifting [34] can lead to partial protein sequences or interrupted open reading frames but these are often resolved during the early steps of gene identification by sequence similarity with proteins from other organisms [35, 36, 37, 38, 39, 40]. In the absence of homologous sequences in other organisms, and especially with short bacterial genes, probabilistic gene models (hidden Markov models) one can often identify biologically significant coding regions [41, 42]. Pairwise sequence homology Given a database of potential open reading frames, a large number of methods can be used to define the biological function of the putative proteins. The most commonly applied methods search for sequence similarity of the translated open reading frames with a database of known protein sequences [43, 44, **45, 46]. The search for gene function is usually carried out at the protein level to eliminate the redundancy of the genetic code. In addition, the use of amino acid substitution matrices that describe the acceptable replacements permits the discovery of even distantly related protein homologies [47, 48]. One of the most sensitive methods for comparing two ..

CiteSeerX

Highly Specific Protein Sequence Motifs for Genome Analysis

Author: Anddouglas L. Brutlag
Craig Nevill-Manning
Douglas Brutlag
Thomas D. Wu
Publication venue
Publication date: 01/01/1998
Field of study

We present a novel method for discovering conserved sequence motifs from families of aligned protein sequences. The method has been implemented as a computer program called EMOTIF (http://motif.stanford.edu/emotif). Given an aligned set of protein sequences, EMOTIF generates a set of motifs with a wide range of specificities and sensitivities. EMOTIF can also generate motifs that describe possible subfamilies of a protein superfamily. A disjunction of such motifs can often represent the entire superfamily with high specificity and sensitivity. We have used EMOTIF to generate sets of motifs from all 7,000 protein alignments in the BLOCKS and PRINTS databases. The resulting database, called IDENTIFY (http://motif.stanford.edu/identify), contains over 50,000 motifs. For each alignment, the database contains several motifs having a probability of matching a false positive that range from 10 -10 to 10 -5 . Highly specific motifs are well suited for searching entire proteomes, while gen..

CiteSeerX

Recommended from our members

2004 Structural, Function and Evolutionary Genomics

Author: Gray Douglas L. Brutlag Nancy Ryan
Publication venue: Gordon Research Conferences
Publication date: 23/03/2005
Field of study

This Gordon conference will cover the areas of structural, functional and evolutionary genomics. It will take a systematic approach to genomics, examining the evolution of proteins, protein functional sites, protein-protein interactions, regulatory networks, and metabolic networks. Emphasis will be placed on what we can learn from comparative genomics and entire genomes and proteomes

UNT Digital Library

Discovering Empirically Conserved Amino Acid Substitution Groups in Databases of Protein Families

Author: Douglas L. Brutlag
Thomas D. Wu
Publication venue
Publication date
Field of study

This paper introduces a method for identifying amino acid substitution groups that are conserved empirically in aligned positions from databases of protein families. Existing approaches view amino acid substitution as a pairwise phenomenon and characterizes it using substitution matrices. In contrast, the method presented here identifies subsets of amino acids that are conserved empirically using a conditional distribution matrix, which contains entries for every combination of individual amino acids and subsets of amino acids. Each row in the conditional distribution matrix contains the distribution of amino acids in those aligned positions that contain a given subset of amino acids. The algorithm converts a database of protein families into a conditional distribution matrix and then examines each possible substitution group for evidence of conservation. A substitution group is empirically conserved when it has characteristics of compactness and isolation, meaning that am..

CiteSeerX