189 research outputs found

    Generalized affine gap costs for protein sequence alignment

    Get PDF
    ABSTRACT Based on the observation that a single mutational event can delete or insert multiple residues, affine gap costs for sequence alignment charge a penalty for the existence of a gap, and a further length-dependent penalty. From structural or multiple alignments of distantly related proteins, it has been observed that conserved residues frequently fall into ungapped blocks separated by relatively nonconserved regions. To take advantage of this structure, a simple generalization of affine gap costs is proposed that allows nonconserved regions to be effectively ignored. The distribution of scores from local alignments using these generalized gap costs is shown empirically to follow an extreme value distribution. Examples are presented for which generalized affine gap costs yield superior alignments from the standpoints both of statistical significance and of alignment accuracy. Guidelines for selecting generalized affine gap costs are discussed, as is their possible application to multiple alignment. Proteins 32:88-96, 1998. 1998 Wiley-Liss, Inc.

    Retrieval accuracy, statistical significance and compositional similarity in protein sequence database searches

    Get PDF
    Protein sequence database search programs may be evaluated both for their retrieval accuracy—the ability to separate meaningful from chance similarities—and for the accuracy of their statistical assessments of reported alignments. However, methods for improving statistical accuracy can degrade retrieval accuracy by discarding compositional evidence of sequence relatedness. This evidence may be preserved by combining essentially independent measures of alignment and compositional similarity into a unified measure of sequence similarity. A version of the BLAST protein database search program, modified to employ this new measure, outperforms the baseline program in both retrieval and statistical accuracy on ASTRAL, a SCOP-based test set

    PSI-BLAST pseudocounts and the minimum description length principle

    Get PDF
    Position specific score matrices (PSSMs) are derived from multiple sequence alignments to aid in the recognition of distant protein sequence relationships. The PSI-BLAST protein database search program derives the column scores of its PSSMs with the aid of pseudocounts, added to the observed amino acid counts in a multiple alignment column. In the absence of theory, the number of pseudocounts used has been a completely empirical parameter. This article argues that the minimum description length principle can motivate the choice of this parameter. Specifically, for realistic alignments, the principle supports the practice of using a number of pseudocounts essentially independent of alignment size. However, it also implies that more highly conserved columns should use fewer pseudocounts, increasing the inter-column contrast of the implied PSSMs. A new method for calculating pseudocounts that significantly improves PSI-BLAST's; retrieval accuracy is now employed by default

    Composition-based statistics and translated nucleotide searches: Improving the TBLASTN module of BLAST

    Get PDF
    BACKGROUND: TBLASTN is a mode of operation for BLAST that aligns protein sequences to a nucleotide database translated in all six frames. We present the first description of the modern implementation of TBLASTN, focusing on new techniques that were used to implement composition-based statistics for translated nucleotide searches. Composition-based statistics use the composition of the sequences being aligned to generate more accurate E-values, which allows for a more accurate distinction between true and false matches. Until recently, composition-based statistics were available only for protein-protein searches. They are now available as a command line option for recent versions of TBLASTN and as an option for TBLASTN on the NCBI BLAST web server. RESULTS: We evaluate the statistical and retrieval accuracy of the E-values reported by a baseline version of TBLASTN and by two variants that use different types of composition-based statistics. To test the statistical accuracy of TBLASTN, we ran 1000 searches using scrambled proteins from the mouse genome and a database of human chromosomes. To test retrieval accuracy, we modernize and adapt to translated searches a test set previously used to evaluate the retrieval accuracy of protein-protein searches. We show that composition-based statistics greatly improve the statistical accuracy of TBLASTN, at a small cost to the retrieval accuracy. CONCLUSION: TBLASTN is widely used, as it is common to wish to compare proteins to chromosomes or to libraries of mRNAs. Composition-based statistics improve the statistical accuracy, and therefore the reliability, of TBLASTN results. The algorithms used by TBLASTN are not widely known, and some of the most important are reported here. The data used to test TBLASTN are available for download and may be useful in other studies of translated search algorithms

    Comparative analysis of ammonia monooxygenase (amoA) genes in the water column and sediment-water interface of two lakes and the Baltic Sea

    Get PDF
    The functional gene amoA was used to compare the diversity of ammonia-oxidizing bacteria (AOB) in the water column and sediment-water interface of the two freshwater lakes Plusssee and Schöhsee and the Baltic Sea. Nested amplifications were used to increase the sensitivity of amoA detection, and to amplify a 789-bp fragment from which clone libraries were prepared. The larger part of the sequences was only distantly related to any of the cultured AOB and is considered to represent new clusters of AOB within the Nitrosomonas/Nitrosospira group. Almost all sequences from the water column of the Baltic Sea and from 1-m depth of Schöhsee were related to different Nitrosospira clusters 0 and 2, respectively. The majority of sequences from Plusssee and Schöhsee were associated with sequences from Chesapeake Bay, from a previous study of Plusssee and from rice roots in Nitrosospira-like cluster A, which lacks sequences from Baltic Sea. Two groups of sequences from Baltic Sea sediment were related to clonal sequences from other brackish/marine habitats in the purely environmental Nitrosospira-like cluster B and the Nitrosomonas-like cluster. This confirms previous results from 16S rRNA gene libraries that indicated the existence of hitherto uncultivated AOB in lake and Baltic Sea samples, and showed a differential distribution of AOB along the water column and sediment of these environment

    Chromosomal-level assembly of the Asian Seabass genome using long sequence reads and multi-layered scaffolding

    Get PDF
    We report here the ~670 Mb genome assembly of the Asian seabass (Lates calcarifer), a tropical marine teleost. We used long-read sequencing augmented by transcriptomics, optical and genetic mapping along with shared synteny from closely related fish species to derive a chromosome-level assembly with a contig N50 size over 1 Mb and scaffold N50 size over 25 Mb that span ~90% of the genome. The population structure of L. calcarifer species complex was analyzed by re-sequencing 61 individuals representing various regions across the species' native range. SNP analyses identified high levels of genetic diversity and confirmed earlier indications of a population stratification comprising three clades with signs of admixture apparent in the South-East Asian population. The quality of the Asian seabass genome assembly far exceeds that of any other fish species, and will serve as a new standard for fish genomics

    Specificity of DNA-binding by the FAX-1 and NHR-67 nuclear receptors of Caenorhabditis elegans is partially mediated via a subclass-specific P-box residue

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The nuclear receptors of the NR2E class play important roles in pattern formation and nervous system development. Based on a phylogenetic analysis of DNA-binding domains, we define two conserved groups of orthologous NR2E genes: the NR2E1 subclass, which includes <it>C. elegans nhr-67, Drosophila tailless </it>and <it>dissatisfaction</it>, and vertebrate Tlx (NR2E2, NR2E4, NR2E1), and the NR2E3 subclass, which includes <it>C. elegans fax-1 </it>and vertebrate PNR (NR2E5, NR2E3). PNR and Tll nuclear receptors have been shown to bind the hexamer half-site AAGTCA, instead of the hexamer AGGTCA recognized by most other nuclear receptors, suggesting unique DNA-binding properties for NR2E class members.</p> <p>Results</p> <p>We show that NR2E3 subclass member FAX-1, unlike NHR-67 and other NR2E1 subclass members, binds to hexamer half-sites with relaxed specificity: it will bind hexamers with the sequence ANGTCA, although it prefers a purine to a pyrimidine at the second position. We use site-directed mutagenesis to demonstrate that the difference between FAX-1 and NHR-67 binding preference is partially mediated by a conserved subclass-specific asparagine or aspartate residue at position 19 of the DNA-binding domain. This amino acid position is part of the "P box" that plays a critical role in defining binding site specificity and has been shown to make hydrogen-bond contacts to the second position of the hexamer in co-crystal structures for other nuclear receptors. The relaxed specificity allows FAX-1 to bind a much larger repertoire of half-sites than NHR-67. While NR2E1 class proteins bind both monomeric and dimeric sites, the NR2E3 class proteins bind only dimeric sites. The presence of a single strong site adjacent to a very weak site allows dimeric FAX-1 binding, further increasing the number of dimeric binding sites to which FAX-1 may bind <it>in vivo</it>.</p> <p>Conclusion</p> <p>These findings identify subclass-specific DNA-binding specificities and dimerization properties for the NR2E1 and NR2E3 subclasses. For the NR2E1 protein NHR-67, Asp-19 permits binding to AAGTCA half-sites, while Asn-19 permits binding to AGGTCA half-sites. The apparent conservation of DNA-binding properties between vertebrate and nematode NR2E receptors allows for the possibility of evolutionarily-conserved regulatory patterns.</p

    Arsenic resistance in the archaeon "Ferroplasma acidarmanus" : new insights into the structure and evolution of the ars genes

    Full text link
    Arsenic resistance in the acidophilic iron-oxidizing archaeon " Ferroplasma acidarmanus " was investigated. F. acidarmanus is native to arsenic-rich environments, and culturing experiments confirm a high level of resistance to both arsenite and arsenate. Analyses of the complete genome revealed protein-encoding regions related to known arsenic-resistance genes. Genes encoding for ArsR (arsenite-sensitive regulator) and ArsB (arsenite-efflux pump) homologues were found located on a single operon. A gene encoding for an ArsA relative (anion-translocating ATPase) located apart from the arsRB operon was also identified. Arsenate-resistance genes encoding for proteins homologous to the arsenate reductase ArsC and the phosphate-specific transporter Pst were not found, indicating that additional unknown arsenic-resistance genes exist for arsenate tolerance. Phylogenetic analyses of ArsA-related proteins suggest separate evolutionary lines for these proteins and offer new insights into the formation of the arsA gene. The ArsB-homologous protein of F. acidarmanus had a high degree of similarity to known ArsB proteins. An evolutionary analysis of ArsB homologues across a number of species indicated a clear relationship in close agreement with 16S rRNA evolutionary lines. These results support a hypothesis of arsenic resistance developing early in the evolution of life.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/42444/1/s00792-002-0303-6.pd
    • 

    corecore