19,033 research outputs found

    Identification of protein coding genes in genomes with statistical functions based on the circular code

    Get PDF
    A new statistical approach using functions based on the circular code classifies correctly more than 93 % of bases in protein (coding) genes and non-coding genes of human sequences. Based on this statistical study, a research software called "Analysis of Coding Genes" (ACG) has been developed for identifying protein genes in the genomes and for determining their frame. Furthermore, the software ACG also allows an evaluation of the length of protein genes, their position in the genome, their relative position between themselves, and the prediction of internal frames in protein genes

    PhyloCSF: a comparative genomics method to distinguish protein-coding and non-coding regions

    Get PDF
    As high-throughput transcriptome sequencing provides evidence for novel transcripts in many species, there is a renewed need for accurate methods to classify small genomic regions as protein-coding or non-coding. We present PhyloCSF, a novel comparative genomics method that analyzes a multi-species nucleotide sequence alignment to determine whether it is likely to represent a conserved protein-coding region, based on a formal statistical comparison of phylogenetic codon models. We show that PhyloCSF's classification performance in 12-species _Drosophila_ genome alignments exceeds all other methods we compared in a previous study, and we provide a software implementation for use by the community. We anticipate that this method will be widely applicable as the transcriptomes of many additional species, tissues, and subcellular compartments are sequenced, particularly in the context of ENCODE and modENCODE

    Nucleotide sequence and genomic organization of an ophiovirus associated with lettuce big-vein disease

    Get PDF
    The complete nucleotide sequence of an ophiovirus associated with lettuce big-vein disease has been elucidated. The genome consisted of four RNA molecules of approximately 7ò8, 1ò7, 1ò5 and 1ò4 kb. Virus particles were shown to contain nearly equimolar amounts of RNA molecules of both polarities. The 5'- and 3'-terminal ends of the RNA molecules are largely, but not perfectly, complementary to each other. The virus genome contains seven open reading frames. Database searches with the putative viral products revealed homologies with the RNA-dependent RNA polymerases of rhabdoviruses and Ranunculus white mottle virus, and the capsid protein of Citrus psorosis virus. The gene encoding the viral polymerase appears to be located on the RNA segment 1, while the nucleocapsid protein is encoded by the RNA3. No significant sequence similarities were observed with other viral proteins. In spite of the morphological resemblance with species in the genus Tenuivirus, the ophioviruses appear not to be evolutionary closely related to this genus nor any other viral genus

    Evolution of the G+C content frontier in the rat cytomegalovirus genome

    Get PDF
    Within the 230138 bp of the rat cytomegalovirus (RCMV) genome, the G+C content changes abruptly at position 142644, constituting a G+C content frontier. To the left of this point, overall G+C content is 69.2%, and to the right it is only 47.6%. A region of extremely low G+C content (33.8%) is found in the 5 kb immediately to the right of the frontier, in which there are no predicted coding sequences. To the right of position 147501, the G+C content rises and predicted coding sequences reappear. However, these genes are much shorter (average 848bp, 50% G+C) than those in the left two-thirds of the genome (average 1462bp, 70% G+C). Whole genome alignment of several viruses indicates that the initial ultra-low G+C region appeared in the common ancestor of the genera Cytomegalovirus and Muromegalovirus, and that the lowering of G+C in the right third has been a subsequent process in the lineage leading to RCMV. The left two-thirds of RCMV has stop codon occurrences at 67.5% of their expected level, based on a modified Markov chain model of stop codon distribution, and the corresponding figure for the right third is 78%. Therefore, despite heavy mutation pressure, selective constraint has operated in the right third of the RCMV genome to maintain a degree of gene length unusual for such low G+C sequences

    Combining in silico prediction and ribosome profiling in a genome-wide search for novel putatively coding sORFs

    Get PDF
    Background: It was long assumed that proteins are at least 100 amino acids (AAs) long. Moreover, the detection of short translation products (e. g. coded from small Open Reading Frames, sORFs) is very difficult as the short length makes it hard to distinguish true coding ORFs from ORFs occurring by chance. Nevertheless, over the past few years many such non-canonical genes (with ORFs < 100 AAs) have been discovered in different organisms like Arabidopsis thaliana, Saccharomyces cerevisiae, and Drosophila melanogaster. Thanks to advances in sequencing, bioinformatics and computing power, it is now possible to scan the genome in unprecedented scrutiny, for example in a search of this type of small ORFs. Results: Using bioinformatics methods, we performed a systematic search for putatively functional sORFs in the Mus musculus genome. A genome-wide scan detected all sORFs which were subsequently analyzed for their coding potential, based on evolutionary conservation at the AA level, and ranked using a Support Vector Machine (SVM) learning model. The ranked sORFs are finally overlapped with ribosome profiling data, hinting to sORF translation. All candidates are visually inspected using an in-house developed genome browser. In this way dozens of highly conserved sORFs, targeted by ribosomes were identified in the mouse genome, putatively encoding micropeptides. Conclusion: Our combined genome-wide approach leads to the prediction of a comprehensive but manageable set of putatively coding sORFs, a very important first step towards the identification of a new class of bioactive peptides, called micropeptides

    Ribosome signatures aid bacterial translation initiation site identification

    Get PDF
    Background: While methods for annotation of genes are increasingly reliable, the exact identification of translation initiation sites remains a challenging problem. Since the N-termini of proteins often contain regulatory and targeting information, developing a robust method for start site identification is crucial. Ribosome profiling reads show distinct patterns of read length distributions around translation initiation sites. These patterns are typically lost in standard ribosome profiling analysis pipelines, when reads from footprints are adjusted to determine the specific codon being translated. Results: Utilising these signatures in combination with nucleotide sequence information, we build a model capable of predicting translation initiation sites and demonstrate its high accuracy using N-terminal proteomics. Applying this to prokaryotic translatomes, we re-annotate translation initiation sites and provide evidence of N-terminal truncations and extensions of previously annotated coding sequences. These re-annotations are supported by the presence of structural and sequence-based features next to N-terminal peptide evidence. Finally, our model identifies 61 novel genes previously undiscovered in the Salmonella enterica genome. Conclusions: Signatures within ribosome profiling read length distributions can be used in combination with nucleotide sequence information to provide accurate genome-wide identification of translation initiation sites

    Environmental shaping of codon usage and functional adaptation across microbial communities.

    Get PDF
    Microbial communities represent the largest portion of the Earth's biomass. Metagenomics projects use high-throughput sequencing to survey these communities and shed light on genetic capabilities that enable microbes to inhabit every corner of the biosphere. Metagenome studies are generally based on (i) classifying and ranking functions of identified genes; and (ii) estimating the phyletic distribution of constituent microbial species. To understand microbial communities at the systems level, it is necessary to extend these studies beyond the species' boundaries and capture higher levels of metabolic complexity. We evaluated 11 metagenome samples and demonstrated that microbes inhabiting the same ecological niche share common preferences for synonymous codons, regardless of their phylogeny. By exploring concepts of translational optimization through codon usage adaptation, we demonstrated that community-wide bias in codon usage can be used as a prediction tool for lifestyle-specific genes across the entire microbial community, effectively considering microbial communities as meta-genomes. These findings set up a 'functional metagenomics' platform for the identification of genes relevant for adaptations of entire microbial communities to environments. Our results provide valuable arguments in defining the concept of microbial species through the context of their interactions within the community

    Expansion of tandem repeats in sea anemone Nematostella vectensis proteome: A source for gene novelty?

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The complete proteome of the starlet sea anemone, <it>Nematostella vectensis</it>, provides insights into gene invention dating back to the Cnidarian-Bilaterian ancestor. With the addition of the complete proteomes of <it>Hydra magnipapillata </it>and <it>Monosiga brevicollis</it>, the investigation of proteins having unique features in early metazoan life has become practical. We focused on the properties and the evolutionary trends of tandem repeat (TR) sequences in Cnidaria proteomes.</p> <p>Results</p> <p>We found that 11-16% of <it>N. vectensis </it>proteins contain tandem repeats. Most TRs cover 150 amino acid segments that are comprised of basic units of 5-20 amino acids. In total, the <it>N. Vectensis </it>proteome has about 3300 unique TR-units, but only a small fraction of them are shared with <it>H. magnipapillata, M. brevicollis</it>, or mammalian proteomes. The overall abundance of these TRs stands out relative to that of 14 proteomes representing the diversity among eukaryotes and within the metazoan world. TR-units are characterized by a unique composition of amino acids, with cysteine and histidine being over-represented. Structurally, most TR-segments are associated with coiled and disordered regions. Interestingly, 80% of the TR-segments can be read in more than one open reading frame. For over 100 of them, translation of the alternative frames would result in long proteins. Most domain families that are characterized as repeats in eukaryotes are found in the TR-proteomes from Nematostella and Hydra.</p> <p>Conclusions</p> <p>While most TR-proteins have originated from prediction tools and are still awaiting experimental validations, supportive evidence exists for hundreds of TR-units in Nematostella. The existence of TR-proteins in early metazoan life may have served as a robust mode for novel genes with previously overlooked structural and functional characteristics.</p

    Bioinformatic Analyses of Unique (Orphan) Core Genes of the Genus Acidithiobacillus: Functional Inferences and Use As Molecular Probes for Genomic and Metagenomic/Transcriptomic Interrogation

    Get PDF
    Indexación: Web of Science.Using phylogenomic and gene compositional analyses, five highly conserved gene families have been detected in the core genome of the phylogenetically coherent genus Acidithiobacillus of the class Acidithiobacillia. These core gene families are absent in the closest extant genus Thermithiobacillus tepidarius that subtends the Acidithiobacillus genus and roots the deepest in this class. The predicted proteins encoded by these core gene families are not detected by a BLAST search in the NCBI non-redundant database of more than 90 million proteins using a relaxed cut-off of 1.0e(-5). None of the five families has a clear functional prediction. However, bioinformatic scrutiny, using pI prediction, motif/domain searches, cellular location predictions, genomic context analyses, and chromosome topology studies together with previously published transcriptomic and proteomic data, suggests that some may have functions associated with membrane remodeling during cell division perhaps in response to pH stress. Despite the high level of amino acid sequence conservation within each family, there is sufficient nucleotide variation of the respective genes to permit the use of the DNA sequences to distinguish different species of Acidithiobacillus, making them useful additions to the armamentarium of tools for phylogenetic analysis. Since the protein families are unique to the Acidithiobacillus genus, they can also be leveraged as probes to detect the genus in environmental metagenomes and metatranscriptomes, including industrial biomining operations, and acid mine drainage (AMD).http://journal.frontiersin.org/article/10.3389/fmicb.2016.02035/ful
    corecore