    Ribosome signatures aid bacterial translation initiation site identification

    Background: While methods for annotation of genes are increasingly reliable, the exact identification of translation initiation sites remains a challenging problem. Since the N-termini of proteins often contain regulatory and targeting information, developing a robust method for start site identification is crucial. Ribosome profiling reads show distinct patterns of read length distributions around translation initiation sites. These patterns are typically lost in standard ribosome profiling analysis pipelines, when reads from footprints are adjusted to determine the specific codon being translated. Results: Utilising these signatures in combination with nucleotide sequence information, we build a model capable of predicting translation initiation sites and demonstrate its high accuracy using N-terminal proteomics. Applying this to prokaryotic translatomes, we re-annotate translation initiation sites and provide evidence of N-terminal truncations and extensions of previously annotated coding sequences. These re-annotations are supported by the presence of structural and sequence-based features next to N-terminal peptide evidence. Finally, our model identifies 61 novel genes previously undiscovered in the Salmonella enterica genome. Conclusions: Signatures within ribosome profiling read length distributions can be used in combination with nucleotide sequence information to provide accurate genome-wide identification of translation initiation sites

    NATsDB: Natural Antisense Transcripts DataBase

    Natural antisense transcripts (NATs) are reverse complementary at least in part to the sequences of other endogenous sense transcripts. Most NATs are transcribed from opposite strands of their sense partners. They regulate sense genes at multiple levels and are implicated in various diseases. Using an improved whole-genome computational pipeline, we identified abundant cis-encoded exon-overlapping sense–antisense (SA) gene pairs in human (7356), mouse (6806), fly (1554), and eight other eukaryotic species (total 6534). We developed NATsDB (Natural Antisense Transcripts DataBase, ) to enable efficient browsing, searching and downloading of this currently most comprehensive collection of SA genes, grouped into six classes based on their overlapping patterns. NATsDB also includes non-exon-overlapping bidirectional (NOB) genes and non-bidirectional (NBD) genes. To facilitate the study of functions, regulations and possible pathological implications, NATsDB includes extensive information about gene structures, poly(A) signals and tails, phastCons conservation, homologues in other species, repeat elements, expressed sequence tag (EST) expression profiles and OMIM disease association. NATsDB supports interactive graphical display of the alignment of all supporting EST and mRNA transcripts of the SA and NOB genes to the genomic loci. It supports advanced search by species, gene name, sequence accession number, chromosome location, coding potential, OMIM association and sequence similarity

    JProGO: a novel tool for the functional interpretation of prokaryotic microarray data using Gene Ontology information

    A novel program suite was implemented for the functional interpretation of high-throughput gene expression data based on the identification of Gene Ontology (GO) nodes. The focus of the analysis lies on the interpretation of microarray data from prokaryotes. The three well established statistical methods of the threshold value-based Fisher's exact test, as well as the threshold value-independent Kolmogorov–Smirnov and Student's t-test were employed in order to identify the groups of genes with a significantly altered expression profile. Furthermore, we provide the application of the rank-based unpaired Wilcoxon's test for a GO-based microarray data interpretation. Further features of the program include recognition of the alternative gene names and the correction for multiple testing. Obtained results are visualized interactively both as a table and as a GO subgraph including all significant nodes. Currently, JProGO enables the analysis of microarray data from more than 20 different prokaryotic species, including all important model organisms, and thus constitutes a useful web service for the microbial research community. JProGO is freely accessible via the web at the following address

    Taking into account nucleosomes for predicting gene expression

    The eukaryotic genome is organized in a chain of nucleosomes that consist of 145-147. bp of DNA wrapped around a histone octamer protein core. Binding of transcription factors (TF) to nucleosomal DNA is frequently impeded, which makes it a challenging task to calculate TF occupancy at a given regulatory genomic site for predicting gene expression. Here, we review methods to calculate TF binding to DNA in the presence of nucleosomes. The main theoretical problems are (i) the computation speed that is becoming a bottleneck when partial unwrapping of DNA from the nucleosome is considered, (ii) the perturbation of the binding equilibrium by the activity of ATP-dependent chromatin remodelers, which translocate nucleosomes along the DNA, and (iii) the model parameterization from high-throughput sequencing data and fluorescence microscopy experiments in living cells. We discuss strategies that address these issues to efficiently compute transcription factor binding in chromatin. © 2013 Elsevier Inc

    An Integrative Method for Accurate Comparative Genome Mapping

    We present MAGIC, an integrative and accurate method for comparative genome mapping. Our method consists of two phases: preprocessing for identifying “maximal similar segments,” and mapping for clustering and classifying these segments. MAGIC's main novelty lies in its biologically intuitive clustering approach, which aims towards both calculating reorder-free segments and identifying orthologous segments. In the process, MAGIC efficiently handles ambiguities resulting from duplications that occurred before the speciation of the considered organisms from their most recent common ancestor. We demonstrate both MAGIC's robustness and scalability: the former is asserted with respect to its initial input and with respect to its parameters' values. The latter is asserted by applying MAGIC to distantly related organisms and to large genomes. We compare MAGIC to other comparative mapping methods and provide detailed analysis of the differences between them. Our improvements allow a comprehensive study of the diversity of genetic repertoires resulting from large-scale mutations, such as indels and duplications, including explicitly transposable and phagic elements. The strength of our method is demonstrated by detailed statistics computed for each type of these large-scale mutations. MAGIC enabled us to conduct a comprehensive analysis of the different forces shaping prokaryotic genomes from different clades, and to quantify the importance of novel gene content introduced by horizontal gene transfer relative to gene duplication in bacterial genome evolution. We use these results to investigate the breakpoint distribution in several prokaryotic genomes

    Mass Spectrometry in the Elucidation of the Glycoproteome of Bacterial Pathogens

    Presently some three hundred post-translational modifications are known to occur in bacteria in vivo. Many of these modifications play critical roles in the regulation of proteins and control key biological processes. One of the most predominant modifications, N- and O-glycosylations are now known to be present in bacteria (and archaea) although they were long believed to be limited to eukaryotes. In a number of human pathogens these glycans have been found attached to the surfaces of pilin, flagellin and other surface and secreted proteins where it has been demonstrated that they play a role in the virulence of these bacteria. Mass spectrometry characterization of these glycosylation events has been the enabling key technology for these findings. This review will look at the use of mass spectrometry as a key technology for the detection and mapping of these modifications within microorganisms, with particular reference to the human pathogens, Campylobacter jejuni and Mycobacterium tuberculosis. The overall aim of this review will be to give a basic understanding of the current ‘state-of-the-art’ of the key techniques, principles and technologies, including bioinformatics tools, involved in the analysis of the glycosylation modifications

    Bacillus anthracis genome organization in light of whole transcriptome sequencing

    Emerging knowledge of whole prokaryotic transcriptomes could validate a number of theoretical concepts introduced in the early days of genomics. What are the rules connecting gene expression levels with sequence determinants such as quantitative scores of promoters and terminators? Are translation efficiency measures, e.g. codon adaptation index and RBS score related to gene expression? We used the whole transcriptome shotgun sequencing of a bacterial pathogen Bacillus anthracis to assess correlation of gene expression level with promoter, terminator and RBS scores, codon adaptation index, as well as with a new measure of gene translational efficiency, average translation speed. We compared computational predictions of operon topologies with the transcript borders inferred from RNA-Seq reads. Transcriptome mapping may also improve existing gene annotation. Upon assessment of accuracy of current annotation of protein-coding genes in the B. anthracis genome we have shown that the transcriptome data indicate existence of more than a hundred genes missing in the annotation though predicted by an ab initio gene finder. Interestingly, we observed that many pseudogenes possess not only a sequence with detectable coding potential but also promoters that maintain transcriptional activity