    Genome display tool: visualizing features in complex data sets

    BACKGROUND: The enormity of the information contained in large data sets makes it difficult to develop intuitive understanding. It would be useful to have software that allows visualization of possible correlations between properties that can be associated with a core data set. In the case of bacterial genomes, existing visualization tools focus on either global properties such as variations in composition or detailed local displays of the features that comprise the annotation. It is not easy to visualize other information in the context of this core information. RESULTS: A Java based software known as the Genome Display Tool (GDT), allows the user to simultaneously view the distribution of multiple attributes pertaining to genes and intragenic regions in a single bacterial genome using different colours and shapes on a single screen. The display represents each gene by small boxes that correlate with physical position in the genome. The size of the boxes is dynamically allocated based on the number of genes and a zoom feature allows close-up inspection of regions of interest. The display is interfaced with a MS-Access relational database and can display any feature in the database that can be represented by discrete values. Data is readily added to the database from an MS-Excel spread sheet. The functionality of GDT is demonstrated by comparing the results of two predictions of recent horizontal transfer events in the genome of Synechocystis PCC-6803. The resulting display allows the user to immediately see how much agreement exists between the two methods and also visualize how genes in various categories (e.g. predicted in both methods, one method etc) are distributed in the genome. CONCLUSION: The GDT software provides the user with a powerful tool that allows development of an intuitive understanding of the relative distribution of features in a large data set. As additional features are added to the data set, the number of possible correlations that can be visualized grows rapidly. Although described here for use in bacterial genomics, the principle is general and similar software might be useful in other contexts such as patient studies

    Microbial identification by mass cataloging

    BACKGROUND: The public availability of over 180,000 bacterial 16S ribosomal RNA (rRNA) sequences has facilitated microbial identification and classification using hybridization and other molecular approaches. In their usual format, such assays are based on the presence of unique subsequences in the target RNA and require a prior knowledge of what organisms are likely to be in a sample. They are thus limited in generality when analyzing an unknown sample. Herein, we demonstrate the utility of catalogs of masses to characterize the bacterial 16S rRNA(s) in any sample. Sample nucleic acids are digested with a nuclease of known specificity and the products characterized using mass spectrometry. The resulting catalogs of masses can subsequently be compared to the masses known to occur in previously-sequenced 16S rRNAs allowing organism identification. Alternatively, if the organism is not in the existing database, it will still be possible to determine its genetic affinity relative to the known organisms. RESULTS: Ribonuclease T(1 )and ribonuclease A digestion patterns were calculated for 1,921 complete 16S rRNAs. Oligoribonucleotides generated by RNase T(1 )of length 9 and longer produce sufficient diversity of masses to be informative. In addition, individual fragments or combinations thereof can be used to recognize the presence of specific organisms in a complex sample. In this regard, 140 strains out of 1,921 organisms (7.3%) could be identified by the presence of a unique RNase T(1)-generated oligoribonucleotide mass. Combinations of just two and three oligoribonucleotide masses allowed 54% and 72% of the specific strains to be identified, respectively. An initial algorithm for recovering likely organisms present in complex samples is also described. CONCLUSION: The use of catalogs of compositions (masses) of characteristic oligoribonucleotides for microbial identification appears extremely promising. RNase T(1 )is more useful than ribonuclease A in generating characteristic masses, though RNase A produces oligomers which are more readily distinguished due to the large mass difference between A and G. Identification of multiple species in mixtures is also feasible. Practical applicability of the method depends on high performance mass spectrometric determination, and/or use of methods that increase the one dalton (Da) mass difference between uracil and cytosine

    RECOVIR Software for Identifying Viruses

    Most single-stranded RNA (ssRNA) viruses mutate rapidly to generate a large number of strains with highly divergent capsid sequences. Determining the capsid residues or nucleotides that uniquely characterize these strains is critical in understanding the strain diversity of these viruses. RECOVIR (an acronym for "recognize viruses") software predicts the strains of some ssRNA viruses from their limited sequence data. Novel phylogenetic-tree-based databases of protein or nucleic acid residues that uniquely characterize these virus strains are created. Strains of input virus sequences (partial or complete) are predicted through residue-wise comparisons with the databases. RECOVIR uses unique characterizing residues to identify automatically strains of partial or complete capsid sequences of picorna and caliciviruses, two of the most highly diverse ssRNA virus families. Partition-wise comparisons of the database residues with the corresponding residues of more than 300 complete and partial sequences of these viruses resulted in correct strain identification for all of these sequences. This study shows the feasibility of creating databases of hitherto unknown residues uniquely characterizing the capsid sequences of two of the most highly divergent ssRNA virus families. These databases enable automated strain identification from partial or complete capsid sequences of these human and animal pathogens

    Bacterial genotyping by 16S rRNA mass cataloging

    BACKGROUND: It has recently been demonstrated that organism identifications can be recovered from mass spectra using various methods including base-specific fragmentation of nucleic acids. Because mass spectrometry is extremely rapid and widely available such techniques offer significant advantages in some applications. A key element in favor of mass spectrometric analysis of RNA fragmentation patterns is that a reference database for analysis of the results can be generated from sequence information. In contrast to hybridization approaches, the genetic affinity of any unknown isolate can in principle be determined within the context of all previously sequenced 16S rRNAs without prior knowledge of what the organism is. In contrast to the original RNase T(1 )cataloging method, when digestion products are analyzed by mass spectrometry, products with the same base composition cannot be distinguished. Hence, it is possible that organisms that are not closely related (having different underlying sequences) might be falsely identified by mass spectral coincidence. We present a convenient spectral coincidence function for expressing the degree of similarity (or distance) between any two mass-spectra. Trees constructed using this function are consistent with those produced by direct comparison of primary sequences, demonstrating that the inherent degeneracy in mass spectrometric analysis of RNA fragments does not preclude correct organism identification. RESULTS: Neighbor-joining trees for important bacterial pathogens were generated using distances based on mass spectrometric observables and the spectral coincidence function. These trees demonstrate that most pathogens will be readily distinguished using mass spectrometric analyses of RNA digestion products. A more detailed, genus-level analysis of pathogens and near relatives was also performed, and it was found that assignments of genetic affinity were consistent with those obtained by direct sequence comparisons. Finally, typical values of the coincidence between organisms were also examined with regard to phylogenetic level and sequence variability. CONCLUSION: Cluster analysis based on comparison of mass spectrometric observables using the spectral coincidence function is an extremely useful tool for determining the genetic affinity of an unknown bacterium. Additionally, fragmentation patterns can determine within hours if an unknown isolate is potentially a known pathogen among thousands of possible organisms, and if so, which one

    Methods for determining the genetic affinity of microorganisms and viruses

    Selecting which sub-sequences in a database of nucleic acid such as 16S rRNA are highly characteristic of particular groupings of bacteria, microorganisms, fungi, etc. on a substantially phylogenetic tree. Also applicable to viruses comprising viral genomic RNA or DNA. A catalogue of highly characteristic sequences identified by this method is assembled to establish the genetic identity of an unknown organism. The characteristic sequences are used to design nucleic acid hybridization probes that include the characteristic sequence or its complement, or are derived from one or more characteristic sequences. A plurality of these characteristic sequences is used in hybridization to determine the phylogenetic tree position of the organism(s) in a sample. Those target organisms represented in the original sequence database and sufficient characteristic sequences can identify to the species or subspecies level. Oligonucleotide arrays of many probes are especially preferred. A hybridization signal can comprise fluorescence, chemiluminescence, or isotopic labeling, etc.; or sequences in a sample can be detected by direct means, e.g. mass spectrometry. The method's characteristic sequences can also be used to design specific PCR primers. The method uniquely identifies the phylogenetic affinity of an unknown organism without requiring prior knowledge of what is present in the sample. Even if the organism has not been previously encountered, the method still provides useful information about which phylogenetic tree bifurcation nodes encompass the organism

    Visualization of ribosomal RNA operon copy number distribution

    <p>Abstract</p> <p>Background</p> <p>Results of microbial ecology studies using 16S rRNA sequence information can be deceiving due to differences in rRNA operon copy number and genome size of the detected organisms. It therefore will be useful for investigators to have a better understanding of how these two parameters differ in various organism types. In this study, the number of ribosomal operons and genome size were separately mapped onto a Bacterial phylogenetic tree.</p> <p>Results</p> <p>A representative Bacterial tree was constructed using 31 marker genes found in 578 bacterial genome sequences. Organism names are displayed on the trees using graduations of color such that similar colors indicate similar numbers of operons or genome size. The resulting images provide an intuitive understanding of how copy number and genome size vary in different Bacterial phyla.</p> <p>Conclusion</p> <p>Once the phylogenetic position of a novel organism is known the number of rRNA operons, and to a lesser extent the genome size, can be estimated by examination of the colored maps. Further detail can then be obtained for members of relevant taxa from the rrnDB database.</p

    Stress-Driven Selection of Novel Phenotypes

    A process has been developed that can confer novel properties, such as metal resistance, to a host bacterium. This same process can also be used to produce RNAs and peptides that have novel properties, such as the ability to bind particular compounds. It is inherent in the method that the peptide or RNA will behave as expected in the target organism. Plasmid-born mini-gene libraries coding for either a population of combinatorial peptides or stable, artificial RNAs carrying random inserts are produced. These libraries, which have no bias towards any biological function, are used to transform the organism of interest and to serve as an initial source of genetic variation for stress-driven evolution. The transformed bacteria are propagated under selective pressure in order to obtain variants with the desired properties. The process is highly distinct from in vitro methods because the variants are selected in the context of the cell while it is experiencing stress. Hence, the selected peptide or RNA will, by definition, work as expected in the target cell as the cell adapts to its presence during the selection process. Once the novel gene, which produces the sought phenotype, is obtained, it can be transferred to the main genome to increase the genetic stability in the organism. Alternatively, the cell line can be used to produce novel RNAs or peptides with selectable properties in large quantity for separate purposes. The system allows for easy, large-scale purification of the RNAs or peptide products. The process has been reduced to practice by imposing sub-inhibitory concentrations of NiCl2 on cells of the bacterium Escherichia coli that were transformed separately with the peptide library and RNA library. The evolved resistant clones were isolated, and sequences of the selected mini-gene variants were established. Clones resistant to NiCl2 were found to carry identical plasmid variants with a functional mini-gene that specifically conferred significant nickel tolerance on the host cells. Sequencing of the selected mini-gene revealed a propensity of the encoded peptide to bind transient metal ions. Expression of the mini-gene markedly improved growth parameters of the evolved clones at sub-inhibitory concentrations of NiCl2 while being slightly detrimental in the absence of stress. Similar results have been obtained with the RNA libraries. Overall, the results demonstrate a very natural outcome of the selection experiments in which the mini-genes were expected to be either successfully integrated into bacterial genetic networks, or rejected depending upon their effect on host fitness. This described approach can be useful as a laboratory model to study the dynamics of bacterial adaptive evolution on the molecular level. It can also provide a strategy for screening expressed DNA libraries in search of novel genes with desirable properties