54 research outputs found

    Identification of putative regulatory upstream ORFs in the yeast genome using heuristics and evolutionary conservation.

    Get PDF
    BACKGROUND: The translational efficiency of an mRNA can be modulated by upstream open reading frames (uORFs) present in certain genes. A uORF can attenuate translation of the main ORF by interfering with translational reinitiation at the main start codon. uORFs also occur by chance in the genome, in which case they do not have a regulatory role. Since the sequence determinants for functional uORFs are not understood, it is difficult to discriminate functional from spurious uORFs by sequence analysis. RESULTS: We have used comparative genomics to identify novel uORFs in yeast with a high likelihood of having a translational regulatory role. We examined uORFs, previously shown to play a role in regulation of translation in Saccharomyces cerevisiae, for evolutionary conservation within seven Saccharomyces species. Inspection of the set of conserved uORFs yielded the following three characteristics useful for discrimination of functional from spurious uORFs: a length between 4 and 6 codons, a distance from the start of the main ORF between 50 and 150 nucleotides, and finally a lack of overlap with, and clear separation from, neighbouring uORFs. These derived rules are inherently associated with uORFs with properties similar to the GCN4 locus, and may not detect most uORFs of other types. uORFs with high scores based on these rules showed a much higher evolutionary conservation than randomly selected uORFs. In a genome-wide scan in S. cerevisiae, we found 34 conserved uORFs from 32 genes that we predict to be functional; subsequent analysis showed the majority of these to be located within transcripts. A total of 252 genes were found containing conserved uORFs with properties indicative of a functional role; all but 7 are novel. Functional content analysis of this set identified an overrepresentation of genes involved in transcriptional control and development. CONCLUSION: Evolutionary conservation of uORFs in yeasts can be traced up to 100 million years of separation. The conserved uORFs have certain characteristics with respect to length, distance from each other and from the main start codon, and folding energy of the sequence. These newly found characteristics can be used to facilitate detection of other conserved uORFs.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are

    Automated group assignment in large phylogenetic trees using GRUNT: GRouping, Ungrouping, Naming Tool

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Accurate taxonomy is best maintained if species are arranged as hierarchical groups in phylogenetic trees. This is especially important as trees grow larger as a consequence of a rapidly expanding sequence database. Hierarchical group names are typically manually assigned in trees, an approach that becomes unfeasible for very large topologies.</p> <p>Results</p> <p>We have developed an automated iterative procedure for delineating stable (monophyletic) hierarchical groups to large (or small) trees and naming those groups according to a set of sequentially applied rules. In addition, we have created an associated ungrouping tool for removing existing groups that do not meet user-defined criteria (such as monophyly). The procedure is implemented in a program called GRUNT (GRouping, Ungrouping, Naming Tool) and has been applied to the current release of the Greengenes (Hugenholtz) 16S rRNA gene taxonomy comprising more than 130,000 taxa.</p> <p>Conclusion</p> <p>GRUNT will facilitate researchers requiring comprehensive hierarchical grouping of large tree topologies in, for example, database curation, microarray design and pangenome assignments. The application is available at the greengenes website <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>.</p

    Estimating DNA coverage and abundance in metagenomes using a gamma approximation

    Get PDF
    Motivation: Shotgun sequencing generates large numbers of short DNA reads from either an isolated organism or, in the case of metagenomics projects, from the aggregate genome of a microbial community. These reads are then assembled based on overlapping sequences into larger, contiguous sequences (contigs). The feasibility of assembly and the coverage achieved (reads per nucleotide or distinct sequence of nucleotides) depend on several factors: the number of reads sequenced, the read length and the relative abundances of their source genomes in the microbial community. A low coverage suggests that most of the genomic DNA in the sample has not been sequenced, but it is often difficult to estimate either the extent of the uncaptured diversity or the amount of additional sequencing that would be most efficacious. In this work, we regard a metagenome as a population of DNA fragments (bins), each of which may be covered by one or more reads. We employ a gamma distribution to model this bin population due to its flexibility and ease of use. When a gamma approximation can be found that adequately fits the data, we may estimate the number of bins that were not sequenced and that could potentially be revealed by additional sequencing. We evaluated the performance of this model using simulated metagenomes and demonstrate its applicability on three recent metagenomic datasets

    ERBB3 is a marker of a ganglioneuroblastoma/ganglioneuroma-like expression profile in neuroblastic tumours

    Get PDF
    Background: Neuroblastoma (NB) tumours are commonly divided into three cytogenetic subgroups. However, by unsupervised principal components analysis of gene expression profiles we recently identified four distinct subgroups, r1-r4. In the current study we characterized these different subgroups in more detail, with a specific focus on the fourth divergent tumour subgroup (r4). Methods: Expression microarray data from four international studies corresponding to 148 neuroblastic tumour cases were subject to division into four expression subgroups using a previously described 6-gene signature. Differentially expressed genes between groups were identified using Significance Analysis of Microarray (SAM). Next, gene expression network modelling was performed to map signalling pathways and cellular processes representing each subgroup. Findings were validated at the protein level by immunohistochemistry and immunoblot analyses. Results: We identified several significantly up-regulated genes in the r4 subgroup of which the tyrosine kinase receptor ERBB3 was most prominent (fold change: 132ā€“240). By gene set enrichment analysis (GSEA) the constructed gene network of ERBB3 (n = 38 network partners) was significantly enriched in the r4 subgroup in all four independent data sets. ERBB3 was also positively correlated to the ErbB family members EGFR and ERBB2 in all data sets, and a concurrent overexpression was seen in the r4 subgroup. Further studies of histopathology categories using a fifth data set of 110 neuroblastic tumours, showed a striking similarity between the expression profile of r4 to ganglioneuroblastoma (GNB) and ganglioneuroma (GN) tumours. In contrast, the NB histopathological subtype was dominated by mitotic regulating genes, characterizing unfavourable NB subgroups in particular. The high ErbB3 expression in GN tumour types was verified at the protein level, and showed mainly expression in the mature ganglion cells. Conclusions: Conclusively, this study demonstrates the importance of performing unsupervised clustering and subtype discovery of data sets prior to analyses to avoid a mixture of tumour subtypes, which may otherwise give distorted results and lead to incorrect conclusions. The current study identifies ERBB3 as a clear-cut marker of a GNB/GN-like expression profile, and we suggest a 7-gene expression signature (including ERBB3) as a complement to histopathology analysis of neuroblastic tumours. Further studies of ErbB3 and other ErbB family members and their role in neuroblastic differentiation and pathogenesis are warranted

    Bridging the gap between systems biology and medicine

    Get PDF
    Systems biology has matured considerably as a discipline over the last decade, yet some of the key challenges separating current research efforts in systems biology and clinically useful results are only now becoming apparent. As these gaps are better defined, the new discipline of systems medicine is emerging as a translational extension of systems biology. How is systems medicine defined? What are relevant ontologies for systems medicine? What are the key theoretic and methodologic challenges facing computational disease modeling? How are inaccurate and incomplete data, and uncertain biologic knowledge best synthesized in useful computational models? Does network analysis provide clinically useful insight? We discuss the outstanding difficulties in translating a rapidly growing body of data into knowledge usable at the bedside. Although core-specific challenges are best met by specialized groups, it appears fundamental that such efforts should be guided by a roadmap for systems medicine drafted by a coalition of scientists from the clinical, experimental, computational, and theoretic domains

    Integration of phenotypic metadata and protein similarity in Archaea using a spectral bipartitioning approach

    Get PDF
    In order to simplify and meaningfully categorize large sets of protein sequence data, it is commonplace to cluster proteins based on the similarity of those sequences. However, it quickly becomes clear that the sequence flexibility allowed a given protein varies significantly among different protein families. The degree to which sequences are conserved not only differs for each protein family, but also is affected by the phylogenetic divergence of the source organisms. Clustering techniques that use similarity thresholds for protein families do not always allow for these variations and thus cannot be confidently used for applications such as automated annotation and phylogenetic profiling. In this work, we applied a spectral bipartitioning technique to all proteins from 53 archaeal genomes. Comparisons between different taxonomic levels allowed us to study the effects of phylogenetic distances on cluster structure. Likewise, by associating functional annotations and phenotypic metadata with each protein, we could compare our protein similarity clusters with both protein function and associated phenotype. Our clusters can be analyzed graphically and interactively online

    Inferring evolution in bacteria using Markov chains and genomic signatures

    No full text
    This thesis concerns the development of methods and models in evolutionary molecular biology. The techniques are also applicable to other similar biological problems. The first contribution is a novel classifier using fixed and variable length Markov chains that can discriminate between bacterial DNA of different species. The classifier assumes that the composition of oligomers, DNA words, is species-specific and represents global features of the species, a so called genomic signature. The direct applications of such a classifier are: identification of horizontal gene transfer and binning of metagenomic data. The former has been the primary focus as it is one of the central processes in the evolution of bacteria. We suggest a new method for locking the number of parameters in a variable length Markov model and propose a method for rejecting false candidates of horizontal gene transfer events. The second contribution is a novel estimator for finding the prediction suffix tree of a variable length Markov chain. This new estimator is highly efficient in finding the correct state-space and we show that it compares favorably to a popular estimator in terms of the predictive likelihood.The third contribution is to the analysis of gene order rearrangements in bacteria. We recapitulate previous results on expected distances and derive new ones for cases that have recently gained support in the literature, such as symmetrical and short reversals. We also describe new categories of gene order patterns and show how these can be explained with models using short, symmetric and uniformly distributed transpositions and reversals.The forth contribution is a part of the Greengenes project which is a chimera free database of 16S rDNA sequences

    Inferring evolution in bacteria using Markov chains and genomic signatures

    No full text
    This thesis concerns the development of methods and models in evolutionary molecular biology. The techniques are also applicable to other similar biological problems. The first contribution is a novel classifier using fixed and variable length Markov chains that can discriminate between bacterial DNA of different species. The classifier assumes that the composition of oligomers, DNA words, is species-specific and represents global features of the species, a so called genomic signature. The direct applications of such a classifier are: identification of horizontal gene transfer and binning of metagenomic data. The former has been the primary focus as it is one of the central processes in the evolution of bacteria. We suggest a new method for locking the number of parameters in a variable length Markov model and propose a method for rejecting false candidates of horizontal gene transfer events. The second contribution is a novel estimator for finding the prediction suffix tree of a variable length Markov chain. This new estimator is highly efficient in finding the correct state-space and we show that it compares favorably to a popular estimator in terms of the predictive likelihood.The third contribution is to the analysis of gene order rearrangements in bacteria. We recapitulate previous results on expected distances and derive new ones for cases that have recently gained support in the literature, such as symmetrical and short reversals. We also describe new categories of gene order patterns and show how these can be explained with models using short, symmetric and uniformly distributed transpositions and reversals.The forth contribution is a part of the Greengenes project which is a chimera free database of 16S rDNA sequences

    Expected Gene Order Distances and Model Selection in Bacteria

    No full text
    The most parsimonous distances calculated in pairwise gene order comparisons cannot accurately reflect the true number of events separating two species, unless the number of changes are few. Better is to use the expected distances. In this study we recapitulate previous results and derive new expected distances for models that have gained support in other studies, such as, symmetrical reversal distances and short reversals. Further, we investigate the patterns of dotplots between species of bacteria with the purpose of model selection in gene order problems. We find several categories of data which can be explained by carefully weighing the contributions of reversals, transpositions, symmetric reversals, single gene transpositions, and single gene reversals.
    • ā€¦
    corecore