2,462 research outputs found

    Epigenetics & chromatin: Interactions and processes

    Get PDF
    On 11 to 13 March 2013, BioMed Central will be hosting its inaugural conference, Epigenetics & Chromatin: Interactions and Processes, at Harvard Medical School, Cambridge, MA, USA. Epigenetics & Chromatin has now launched a special article series based on the general themes of the conference

    Positive Selection of Iris, a Retroviral Envelope–Derived Host Gene in Drosophila melanogaster

    Get PDF
    Eukaryotic genomes can usurp enzymatic functions encoded by mobile elements for their own use. A particularly interesting kind of acquisition involves the domestication of retroviral envelope genes, which confer infectious membrane-fusion ability to retroviruses. So far, these examples have been limited to vertebrate genomes, including primates where the domesticated envelope is under purifying selection to assist placental function. Here, we show that in Drosophila genomes, a previously unannotated gene (CG4715, renamed Iris) was domesticated from a novel, active Kanga lineage of insect retroviruses at least 25 million years ago, and has since been maintained as a host gene that is expressed in all adult tissues. Iris and the envelope genes from Kanga retroviruses are homologous to those found in insect baculoviruses and gypsy and roo insect retroviruses. Two separate envelope domestications from the Kanga and roo retroviruses have taken place, in fruit fly and mosquito genomes, respectively. Whereas retroviral envelopes are proteolytically cleaved into the ligand-interaction and membrane-fusion domains, Iris appears to lack this cleavage site. In the takahashii/suzukii species groups of Drosophila, we find that Iris has tandemly duplicated to give rise to two genes (Iris-A and Iris-B). Iris-B has significantly diverged from the Iris-A lineage, primarily because of the “invention” of an intron de novo in what was previously exonic sequence. Unlike domesticated retroviral envelope genes in mammals, we find that Iris has been subject to strong positive selection between Drosophila species. The rapid, adaptive evolution of Iris is sufficient to unambiguously distinguish the phylogenies of three closely related sibling species of Drosophila (D. simulans, D. sechellia, and D. mauritiana), a discriminative power previously described only for a putative “speciation gene.” Iris represents the first instance of a retroviral envelope–derived host gene outside vertebrates. It is also the first example of a retroviral envelope gene that has been found to be subject to positive selection following its domestication. The unusual selective pressures acting on Iris suggest that it is an active participant in an ongoing genetic conflict. We propose a model in which Iris has “switched sides,” having been recruited by host genomes to combat baculoviruses and retroviruses, which employ homologous envelope genes to mediate infection

    Pairwise alignment incorporating dipeptide covariation

    Full text link
    Motivation: Standard algorithms for pairwise protein sequence alignment make the simplifying assumption that amino acid substitutions at neighboring sites are uncorrelated. This assumption allows implementation of fast algorithms for pairwise sequence alignment, but it ignores information that could conceivably increase the power of remote homolog detection. We examine the validity of this assumption by constructing extended substitution matrixes that encapsulate the observed correlations between neighboring sites, by developing an efficient and rigorous algorithm for pairwise protein sequence alignment that incorporates these local substitution correlations, and by assessing the ability of this algorithm to detect remote homologies. Results: Our analysis indicates that local correlations between substitutions are not strong on the average. Furthermore, incorporating local substitution correlations into pairwise alignment did not lead to a statistically significant improvement in remote homology detection. Therefore, the standard assumption that individual residues within protein sequences evolve independently of neighboring positions appears to be an efficient and appropriate approximation

    Distances and classification of amino acids for different protein secondary structures

    Full text link
    Window profiles of amino acids in protein sequences are taken as a description of the amino acid environment. The relative entropy or Kullback-Leibler distance derived from profiles is used as a measure of dissimilarity for comparison of amino acids and secondary structure conformations. Distance matrices of amino acid pairs at different conformations are obtained, which display a non-negligible dependence of amino acid similarity on conformations. Based on the conformation specific distances clustering analysis for amino acids is conducted.Comment: 15 pages, 8 figure

    Evidence of Influence of Genomic DNA Sequence on Human X Chromosome Inactivation

    Get PDF
    A significant number of human X-linked genes escape X chromosome inactivation and are thus expressed from both the active and inactive X chromosomes. The basis for escape from inactivation and the potential role of the X chromosome primary DNA sequence in determining a gene's X inactivation status is unclear. Using a combination of the X chromosome sequence and a comprehensive X inactivation profile of more than 600 genes, two independent yet complementary approaches were used to systematically investigate the relationship between X inactivation and DNA sequence features. First, statistical analyses revealed that a number of repeat features, including long interspersed nuclear element (LINE) and mammalian-wide interspersed repeat repetitive elements, are significantly enriched in regions surrounding transcription start sites of genes that are subject to inactivation, while Alu repetitive elements and short motifs containing ACG/CGT are significantly enriched in those that escape inactivation. Second, linear support vector machine classifiers constructed using primary DNA sequence features were used to correctly predict the X inactivation status for >80% of all X-linked genes. We further identified a small set of features that are important for accurate classification, among which LINE-1 and LINE-2 content show the greatest individual discriminatory power. Finally, as few as 12 features can be used for accurate support vector machine classification. Taken together, these results suggest that features of the underlying primary DNA sequence of the human X chromosome may influence the spreading and/or maintenance of X inactivation

    Simplified amino acid alphabets based on deviation of conditional probability from random background

    Get PDF
    The primitive data for deducing the Miyazawa-Jernigan contact energy or BLOSUM score matrix consists of pair frequency counts. Each amino acid corresponds to a conditional probability distribution. Based on the deviation of such conditional probability from random background, a scheme for reduction of amino acid alphabet is proposed. It is observed that evident discrepancy exists between reduced alphabets obtained from raw data of the Miyazawa-Jernigan's and BLOSUM's residue pair counts. Taking homologous sequence database SCOP40 as a test set, we detect homology with the obtained coarse-grained substitution matrices. It is verified that the reduced alphabets obtained well preserve information contained in the original 20-letter alphabet.Comment: 9 pages,3figure

    Optimal neighborhood indexing for protein similarity search

    Get PDF
    Background: Similarity inference, one of the main bioinformatics tasks, has to face an exponential growth of the biological data. A classical approach used to cope with this data flow involves heuristics with large seed indexes. In order to speed up this technique, the index can be enhanced by storing additional information to limit the number of random memory accesses. However, this improvement leads to a larger index that may become a bottleneck. In the case of protein similarity search, we propose to decrease the index size by reducing the amino acid alphabet.\ud \ud Results: The paper presents two main contributions. First, we show that an optimal neighborhood indexing combining an alphabet reduction and a longer neighborhood leads to a reduction of 35% of memory involved into the process, without sacrificing the quality of results nor the computational time. Second, our approach led us to develop a new kind of substitution score matrices and their associated e-value parameters. In contrast to usual matrices, these matrices are rectangular since they compare amino acid groups from different alphabets. We describe the method used for computing those matrices and we provide some typical examples that can be used in such comparisons. Supplementary data can be found on the website http://bioinfo.lifl.fr/reblosum.\ud \ud Conclusions: We propose a practical index size reduction of the neighborhood data, that does not negatively affect the performance of large-scale search in protein sequences. Such an index can be used in any study involving large protein data. Moreover, rectangular substitution score matrices and their associated statistical parameters can have applications in any study involving an alphabet reduction

    Convolutional LSTM Networks for Subcellular Localization of Proteins

    Get PDF
    Machine learning is widely used to analyze biological sequence data. Non-sequential models such as SVMs or feed-forward neural networks are often used although they have no natural way of handling sequences of varying length. Recurrent neural networks such as the long short term memory (LSTM) model on the other hand are designed to handle sequences. In this study we demonstrate that LSTM networks predict the subcellular location of proteins given only the protein sequence with high accuracy (0.902) outperforming current state of the art algorithms. We further improve the performance by introducing convolutional filters and experiment with an attention mechanism which lets the LSTM focus on specific parts of the protein. Lastly we introduce new visualizations of both the convolutional filters and the attention mechanisms and show how they can be used to extract biological relevant knowledge from the LSTM networks
    corecore