730 research outputs found

    Improving algorithms of gene prediction in prokaryotic genomes, metagenomes, and eukaryotic transcriptomes

    Get PDF
    Next-generation sequencing has generated enormous amount of DNA and RNA sequences that potentially carry volumes of genetic information, e.g. protein-coding genes. The thesis is divided into three main parts describing i) GeneMarkS-2, ii) GeneMarkS-T, and iii) MetaGeneTack. In prokaryotic genomes, ab initio gene finders can predict genes with high accuracy. However, the error rate is not negligible and largely species-specific. Most errors in gene prediction are made in genes located in genomic regions with atypical GC composition, e.g. genes in pathogenicity islands. We describe a new algorithm GeneMarkS-2 that uses local GC-specific heuristic models for scoring individual ORFs in the first step of analysis. Predicted atypical genes are retained and serve as ‘external’ evidence in subsequent runs of self-training. GeneMarkS-2 also controls the quality of training process by effectively selecting optimal orders of the Markov chain models as well as duration parameters in the hidden semi-Markov model. GeneMarkS-2 has shown significantly improved accuracy compared with other state-of-the-art gene prediction tools. Massive parallel sequencing of RNA transcripts by the next generation technology (RNA-Seq) provides large amount of RNA reads that can be assembled to full transcriptome. We have developed a new tool, GeneMarkS-T, for ab initio identification of protein-coding regions in RNA transcripts. Unsupervised estimation of parameters of the algorithm makes unnecessary several steps in the conventional gene prediction protocols, most importantly the manually curated preparation of training sets. We have demonstrated that the GeneMarkS-T self-training is robust with respect to the presence of errors in assembled transcripts and the accuracy of GeneMarkS-T in identifying protein-coding regions and, particularly, in predicting gene starts compares favorably to other existing methods. Frameshift prediction (FS) is important for analysis and biological interpretation of metagenomic sequences. Reads in metagenomic samples are prone to sequencing errors. Insertion and deletion errors that change the coding frame impair the accurate identification of protein coding genes. Accurate frameshift prediction requires sufficient amount of data to estimate parameters of species-specific statistical models of protein-coding and non-coding regions. However, this data is not available; all we have is metagenomic sequences of unknown origin. The challenge of ab initio FS detection is, therefore, twofold: (i) to find a way to infer necessary model parameters and (ii) to identify positions of frameshifts (if any). We describe a new tool, MetaGeneTack, which uses a heuristic method to estimate parameters of sequence models used in the FS detection algorithm. It was shown on several test sets that the performance of MetaGeneTack FS detection is comparable or better than the one of earlier developed program FragGeneScan.Ph.D

    Peptide vocabulary analysis reveals ultra-conservation and homonymity in protein sequences

    Get PDF
    A new algorithm is presented for vocabulary analysis (word detection) in texts of human origin. It performs at 60%–70% overall accuracy and greater than 80% accuracy for longer words, and approximately 85% sensitivity on Alice in Wonderland, a considerable improvement on previous methods. When applied to protein sequences, it detects short sequences analogous to words in human texts, i.e. intolerant to changes in spelling (mutation), and relatively contextindependent in their meaning (function). Some of these are homonyms of up to 7 amino acids, which can assume different structures in different proteins. Others are ultra-conserved stretches of up to 18 amino acids within proteins of less than 40% overall identity, reflecting extreme constraint or convergent evolution. Different species are found to have qualitatively different major peptide vocabularies, e.g. some are dominated by large gene families, while others are rich in simple repeats or dominated by internally repetitive proteins. This suggests the possibility of a peptide vocabulary signature, analogous to genome signatures in DNA. Homonyms may be useful in detecting convergent evolution and positive selection in protein evolution. Ultra-conserved words may be useful in identifying structures intolerant to substitution over long periods of evolutionary time

    A new census of protein tandem repeats and their relationship with intrinsic disorder

    Get PDF
    Protein tandem repeats (TRs) are often associated with immunity-related functions and diseases. Since that last census of protein TRs in 1999, the number of curated proteins increased more than seven-fold and new TR prediction methods were published. TRs appear to be enriched with intrinsic disorder and vice versa. The significance and the biological reasons for this association are unknown. Here, we characterize protein TRs across all kingdoms of life and their overlap with intrinsic disorder in unprecedented detail. Using state-of-the-art prediction methods, we estimate that 50.9% of proteins contain at least one TR, often located at the sequence flanks. Positive linear correlation between the proportion of TRs and the protein length was observed universally, with Eukaryotes in general having more TRs, but when the difference in length is taken into account the difference is quite small. TRs were enriched with disorder-promoting amino acids and were inside intrinsically disordered regions. Many such TRs were homorepeats. Our results support that TRs mostly originate by duplication and are involved in essential functions such as transcription processes, structural organization, electron transport and iron-binding. In viruses, TRs are found in proteins essential for virulence

    CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented motifs has been difficult because of interference by constraints at the protein level. Sampling-based approaches to solve this problem based on codon-shuffling have been limited to exploring only an infinitesimal fraction of the sequence space and by their use of parametric approximations.</p> <p>Results</p> <p>We present a novel <it>O</it>(<it>N</it>(log <it>N</it>)<sup>2</sup>)-time algorithm, CodingMotif, to identify nucleotide-level motifs of unusual copy number in protein-coding regions. Using a new dynamic programming algorithm we are able to exhaustively calculate the distribution of the number of occurrences of a motif over all possible coding sequences that encode the same amino acid sequence, given a background model for codon usage and dinucleotide biases. Our method takes advantage of the sparseness of loci where a given motif can occur, greatly speeding up the required convolution calculations. Knowledge of the distribution allows one to assess the exact non-parametric p-value of whether a given motif is over- or under- represented. We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.</p> <p>Conclusions</p> <p>CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences. We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions. A software implementation is available at <url>http://bioinformatics.bc.edu/chuanglab/codingmotif.tar</url></p

    Pan-archaeal analysis of C/D box sRNA biogenesis and methylation targets

    Get PDF
    Post-transcriptional modifications of RNA molecules occur in all three domains of life and influence RNA stability and functionality. The most numerous modifications are 2'-O-methylations at the ribose moiety and pseudouridylations. In archaea, modified bases are abundant in ribosomal RNAs (rRNAs) and transfer RNAs (tRNAs). The introduction of both modifications is guided by small RNAs that are incorporated into ribonucleoprotein complexes (RNPs). 2'-O-methylations are guided by C/D box sRNAs in archaea. C/D box sRNAs are characterized by the conserved sequence elements boxC/C' (consensus sequence: RUGAUGA) and boxD/D' (consensus sequence: CUGA). Upon C/D box sRNA folding, both sequence elements base-pair (boxC with boxD and boxC' with boxD'), which results in the formation of two kink-turn motifs that are stabilized by binding of the protein L7Ae. The sequences between the two kink-turn elements show complementarity to the sequences of the target RNA and thereby serve as guide sequences that determine the sites of 2'-O-methylation. The modifications are introduced site-specifically at the nucleotide of the target RNA that is complementary to the fifth nucleotide upstream of the boxD/D' motif by the methyltransferase fibrillarin. Based on the guide sequences, C/D box sRNA targets of seven archaea were predicted and mapped onto the consensus structure of the 16S and 23S rRNA. Conserved methylation hotspots were observed in ancient core regions of the rRNAs that are important for ribosome integrity and functionality and that are not protected by ribosomal proteins. Therefore, the modifications might contribute to the folding, structural stabilization and function of the rRNAs. The biogenesis of archaeal C/D box sRNAs is largely unknown as independent promoters cannot be identified for the majority of the C/D box sRNA genes. The analysis of C/D box sRNA genes in six archaeal model organisms revealed diverse genetic contexts, providing opportunities for transcription without the necessity of an independent promoter. C/D box sRNA genes localize e.g. in the 5' or 3'-UTR of flanking protein-coding regions and polycistronic C/D box sRNA transcripts exist. Plasmid-based C/D box sRNA in vivo analyses were performed in Sulfolobus acidocaldarius in which C/D box sRNA genes variants with their native or random upstream and downstream sequences were used to identify C/D box sRNA stabilization and maturation requirements. The analyses revealed that the maturation of C/D box sRNAs occurs independently of the upstream and downstream sequences. The integrity of the k-turn is important for C/D box sRNA stability. Archaeal C/D box sRNAs exhibit a transcriptional plasticity and their maturation is suggested to include the action of unspecific exoribonucleases. Complete degradation might be prevented by co-transcriptional L7Ae binding or complete C/D box sRNP assembly. Circular forms of the C/D box sRNAs were identified in several hyperthermophilic archaea and the circularization reaction should protect RNAs from degradation. C/D box sRNA gene upstream and downstream sequences were shown not to be required for circularization but the responsible RNA ligase remains to be identified. Thus, this thesis provides insights into the transcription and maturation of archaeal C/D box sRNAs and highlights conserved 2'-O-methylation pattern in archaeal rRNAs

    Genetics of Halophilic Microorganisms

    Get PDF
    Halophilic microorganisms are found in all domains of life and thrive in hypersaline (high salt content) environments. These unusual microbes have been a subject of study for many years due to their interesting properties and physiology. Studies of the genetics of halophilic microorganisms (from gene expression and regulation to genomics) have provided understanding into the mechanisms of how life can exist at high salinity levels. Here, we highlight recent studies that advance the knowledge of biological function through examination of the genetics of halophilic microorganisms and their viruses

    Functional Sites in Structure and Sequence. Protein Active Sites and miRNA Target Recognition -

    Get PDF
    The number of protein three-dimensional structures is increasing steeply, and structural genomics projects aim to solve the structures for all proteins as a means to understanding function. In the first part of my thesis, I developed a method for the comparison of local structural patterns (e.g. enzyme active sites) that provides a reliable statistical measure to discern meaningful matches from noise. The method is complementary to structural alignment as it is able to confirm functional similarities suggested by an overall similar structure but also detects functional similarities between different folds. An easy-to-use interface is available on the Internet for functional annotation of protein structures (http://pints.embl.de). In the second part of my thesis, I present a computational screen for microRNA (miRNA) targets in Drosophila. miRNAs are short RNAs that inhibit translation of target messenger RNAs in animals by binding to complementary sites in their 3� untranslated regions. Target predictions were urgently needed as targets were known for only three of the more than 700 miRNAs. Of my predictions, six were validated experimentally and others are likely to be functional, making the results a useful resource for miRNA research. The screen extended miRNA function to pathway control, nervous system development and regulation of metabolism, and revealed that one miRNA typically regulates several targets but also that one gene is likely to be targeted by several miRNAs

    SPPS: A Sequence-Based Method for Predicting Probability of Protein-Protein Interaction Partners

    Get PDF
    Background: The molecular network sustained by different types of interactions among proteins is widely manifested as the fundamental driving force of cellular operations. Many biological functions are determined by the crosstalk between proteins rather than by the characteristics of their individual components. Thus, the searches for protein partners in global networks are imperative when attempting to address the principles of biology. Results: We have developed a web-based tool ‘‘Sequence-based Protein Partners Search’ ’ (SPPS) to explore interacting partners of proteins, by searching over a large repertoire of proteins across many species. SPPS provides a database containing more than 60,000 protein sequences with annotations and a protein-partner search engine in two modes (Single Query and Multiple Query). Two interacting proteins of human FBXO6 protein have been found using the service in the study. In addition, users can refine potential protein partner hits by using annotations and possible interactive network in the SPPS web server. Conclusions: SPPS provides a new type of tool to facilitate the identification of direct or indirect protein partners which may guide scientists on the investigation of new signaling pathways. The SPPS server is available to the public a
    corecore