2,230 research outputs found
Distances and classification of amino acids for different protein secondary structures
Window profiles of amino acids in protein sequences are taken as a
description of the amino acid environment. The relative entropy or
Kullback-Leibler distance derived from profiles is used as a measure of
dissimilarity for comparison of amino acids and secondary structure
conformations. Distance matrices of amino acid pairs at different conformations
are obtained, which display a non-negligible dependence of amino acid
similarity on conformations. Based on the conformation specific distances
clustering analysis for amino acids is conducted.Comment: 15 pages, 8 figure
High-throughput discovery of rare human nucleotide polymorphisms by Ecotilling
Human individuals differ from one another at only ∼0.1% of nucleotide positions, but these single nucleotide differences account for most heritable phenotypic variation. Large-scale efforts to discover and genotype human variation have been limited to common polymorphisms. However, these efforts overlook rare nucleotide changes that may contribute to phenotypic diversity and genetic disorders, including cancer. Thus, there is an increasing need for high-throughput methods to robustly detect rare nucleotide differences. Toward this end, we have adapted the mismatch discovery method known as Ecotilling for the discovery of human single nucleotide polymorphisms. To increase throughput and reduce costs, we developed a universal primer strategy and implemented algorithms for automated band detection. Ecotilling was validated by screening 90 human DNA samples for nucleotide changes in 5 gene targets and by comparing results to public resequencing data. To increase throughput for discovery of rare alleles, we pooled samples 8-fold and found Ecotilling to be efficient relative to resequencing, with a false negative rate of 5% and a false discovery rate of 4%. We identified 28 new rare alleles, including some that are predicted to damage protein function. The detection of rare damaging mutations has implications for models of human disease
A methodology for determining amino-acid substitution matrices from set covers
We introduce a new methodology for the determination of amino-acid
substitution matrices for use in the alignment of proteins. The new methodology
is based on a pre-existing set cover on the set of residues and on the
undirected graph that describes residue exchangeability given the set cover.
For fixed functional forms indicating how to obtain edge weights from the set
cover and, after that, substitution-matrix elements from weighted distances on
the graph, the resulting substitution matrix can be checked for performance
against some known set of reference alignments and for given gap costs. Finding
the appropriate functional forms and gap costs can then be formulated as an
optimization problem that seeks to maximize the performance of the substitution
matrix on the reference alignment set. We give computational results on the
BAliBASE suite using a genetic algorithm for optimization. Our results indicate
that it is possible to obtain substitution matrices whose performance is either
comparable to or surpasses that of several others, depending on the particular
scenario under consideration
Simplified amino acid alphabets based on deviation of conditional probability from random background
The primitive data for deducing the Miyazawa-Jernigan contact energy or
BLOSUM score matrix consists of pair frequency counts. Each amino acid
corresponds to a conditional probability distribution. Based on the deviation
of such conditional probability from random background, a scheme for reduction
of amino acid alphabet is proposed. It is observed that evident discrepancy
exists between reduced alphabets obtained from raw data of the
Miyazawa-Jernigan's and BLOSUM's residue pair counts. Taking homologous
sequence database SCOP40 as a test set, we detect homology with the obtained
coarse-grained substitution matrices. It is verified that the reduced alphabets
obtained well preserve information contained in the original 20-letter
alphabet.Comment: 9 pages,3figure
Discovery of chemically induced mutations in rice by TILLING
BACKGROUND: Rice is both a food source for a majority of the world's population and an important model system. Available functional genomics resources include targeted insertion mutagenesis and transgenic tools. While these can be powerful, a non-transgenic, unbiased targeted mutagenesis method that can generate a range of allele types would add considerably to the analysis of the rice genome. TILLING (Targeting Induced Local Lesions in Genomes), a general reverse genetic technique that combines traditional mutagenesis with high throughput methods for mutation discovery, is such a method. RESULTS: To apply TILLING to rice, we developed two mutagenized rice populations. One population was developed by treatment with the chemical mutagen ethyl methanesulphonate (EMS), and the other with a combination of sodium azide plus methyl-nitrosourea (Az-MNU). To find induced mutations, target regions of 0.7–1.5 kilobases were PCR amplified using gene specific primers labeled with fluorescent dyes. Heteroduplexes were formed through denaturation and annealing of PCR products, mismatches digested with a crude preparation of CEL I nuclease and cleaved fragments visualized using denaturing polyacrylamide gel electrophoresis. In 10 target genes screened, we identified 27 nucleotide changes in the EMS-treated population and 30 in the Az-MNU population. CONCLUSION: We estimate that the density of induced mutations is two- to threefold higher than previously reported rice populations (about 1/300 kb). By comparison to other plants used in public TILLING services, we conclude that the populations described here would be suitable for use in a large scale TILLING project
VENN, a tool for titrating sequence conservation onto protein structures
Residue conservation is an important, established method for inferring protein function, modularity and specificity. It is important to recognize that it is the 3D spatial orientation of residues that drives sequence conservation. Considering this, we have built a new computational tool, VENN that allows researchers to interactively and graphically titrate sequence homology onto surface representations of protein structures. Our proposed titration strategies reveal critical details that are not readily identified using other existing tools. Analyses of a bZIP transcription factor and receptor recognition of Fibroblast Growth Factor using VENN revealed key specificity determinants. Weblink: http://sbtools.uchc.edu/venn/
Optimal neighborhood indexing for protein similarity search
Background: Similarity inference, one of the main bioinformatics tasks, has to face an exponential growth of the biological data. A classical approach used to cope with this data flow involves heuristics with large seed indexes. In order to speed up this technique, the index can be enhanced by storing additional information to limit the number of random memory accesses. However, this improvement leads to a larger index that may become a bottleneck. In the case of protein similarity search, we propose to decrease the index size by reducing the amino acid alphabet.\ud
\ud
Results: The paper presents two main contributions. First, we show that an optimal neighborhood indexing combining an alphabet reduction and a longer neighborhood leads to a reduction of 35% of memory involved into the process, without sacrificing the quality of results nor the computational time. Second, our approach led us to develop a new kind of substitution score matrices and their associated e-value parameters. In contrast to usual matrices, these matrices are rectangular since they compare amino acid groups from different alphabets. We describe the method used for computing those matrices and we provide some typical examples that can be used in such comparisons. Supplementary data can be found on the website http://bioinfo.lifl.fr/reblosum.\ud
\ud
Conclusions: We propose a practical index size reduction of the neighborhood data, that does not negatively affect the performance of large-scale search in protein sequences. Such an index can be used in any study involving large protein data. Moreover, rectangular substitution score matrices and their associated statistical parameters can have applications in any study involving an alphabet reduction
Towards Reliable Automatic Protein Structure Alignment
A variety of methods have been proposed for structure similarity calculation,
which are called structure alignment or superposition. One major shortcoming in
current structure alignment algorithms is in their inherent design, which is
based on local structure similarity. In this work, we propose a method to
incorporate global information in obtaining optimal alignments and
superpositions. Our method, when applied to optimizing the TM-score and the GDT
score, produces significantly better results than current state-of-the-art
protein structure alignment tools. Specifically, if the highest TM-score found
by TMalign is lower than (0.6) and the highest TM-score found by one of the
tested methods is higher than (0.5), there is a probability of (42%) that
TMalign failed to find TM-scores higher than (0.5), while the same probability
is reduced to (2%) if our method is used. This could significantly improve the
accuracy of fold detection if the cutoff TM-score of (0.5) is used.
In addition, existing structure alignment algorithms focus on structure
similarity alone and simply ignore other important similarities, such as
sequence similarity. Our approach has the capacity to incorporate multiple
similarities into the scoring function. Results show that sequence similarity
aids in finding high quality protein structure alignments that are more
consistent with eye-examined alignments in HOMSTRAD. Even when structure
similarity itself fails to find alignments with any consistency with
eye-examined alignments, our method remains capable of finding alignments
highly similar to, or even identical to, eye-examined alignments.Comment: Peer-reviewed and presented as part of the 13th Workshop on
Algorithms in Bioinformatics (WABI2013
The Gumbel pre-factor k for gapped local alignment can be estimated from simulations of global alignment
The optimal gapped local alignment score of two random sequences follows a Gumbel distribution. The Gumbel distribution has two parameters, the scale parameter λ and the pre-factor k. Presently, the basic local alignment search tool (BLAST) programs (BLASTP (BLAST for proteins), PSI-BLAST, etc.) use all time-consuming computer simulations to determine the Gumbel parameters. Because the simulations must be done offline, BLAST users are restricted in their choice of alignment scoring schemes. The ultimate aim of this paper is to speed the simulations, to determine the Gumbel parameters online, and to remove the corresponding restrictions on BLAST users. Simulations for the scale parameter λ can be as much as five times faster, if they use global instead of local alignment [R. Bundschuh (2002) J. Comput. Biol., 9, 243–260]. Unfortunately, the acceleration does not extend in determining the Gumbel pre-factor k, because k has no known mathematical relationship to global alignment. This paper relates k to global alignment and exploits the relationship to show that for the BLASTP defaults, 10 000 realizations with sequences of average length 140 suffice to estimate both Gumbel parameters λ and k within the errors required (λ, 0.8%; k, 10%). For the BLASTP defaults, simulations for both Gumbel parameters now take less than 30 s on a 2.8 GHz Pentium 4 processor
Candida albicans repetitive elements display epigenetic diversity and plasticity
Transcriptionally silent heterochromatin is associated with repetitive DNA. It is poorly understood whether and how heterochromatin differs between different organisms and whether its structure can be remodelled in response to environmental signals. Here, we address this question by analysing the chromatin state associated with DNA repeats in the human fungal pathogen Candida albicans. Our analyses indicate that, contrary to model systems, each type of repetitive element is assembled into a distinct chromatin state. Classical Sir2-dependent hypoacetylated and hypomethylated chromatin is associated with the rDNA locus while telomeric regions are assembled into a weak heterochromatin that is only mildly hypoacetylated and hypomethylated. Major Repeat Sequences, a class of tandem repeats, are assembled into an intermediate chromatin state bearing features of both euchromatin and heterochromatin. Marker gene silencing assays and genome-wide RNA sequencing reveals that C. albicans heterochromatin represses expression of repeat-associated coding and non-coding RNAs. We find that telomeric heterochromatin is dynamic and remodelled upon an environmental change. Weak heterochromatin is associated with telomeres at 30?°C, while robust heterochromatin is assembled over these regions at 39?°C, a temperature mimicking moderate fever in the host. Thus in C. albicans, differential chromatin states controls gene expression and epigenetic plasticity is linked to adaptation
- …