    GC-Profile: a web-based tool for visualizing and analyzing the variation of GC content in genomic sequences

    In order to understand the evolution, structure and function of genomes, it is important to know the general compositional features of DNA sequences. Based on the quadratic divergence, a new segmentation algorithm to partition a given genome or DNA sequence into compositionally distinct domains has been put forward. With the aid of the technique of cumulative GC profile, the distribution of segmentation points can be displayed intuitively. We have therefore developed them into GC-Profile, an interactive web-based software system, which can be used to segment prokaryotic and eukaryotic genomes. GC-Profile provides a quantitative and qualitative view of genome organization. Based on the obtained results, the relationships between the G+C content and other genomic features, such as distributions of genes and CpG islands, can be analyzed in a perceivable manner. It shows that GC-Profile would be an appropriate starting point for analyzing the isochore structure of higher eukaryotic genomes, and an intuitive tool for identifying genomic islands in prokaryotic genomes. GC-Profile is freely available at the website . In addition, precompiled binaries, together with examples and documentation, can also be freely downloaded for a local execution

    Towards pathogenomics: a web-based resource for pathogenicity islands

    Pathogenicity islands (PAIs) are genetic elements whose products are essential to the process of disease development. They have been horizontally (laterally) transferred from other microbes and are important in evolution of pathogenesis. In this study, a comprehensive database and search engines specialized for PAIs were established. The pathogenicity island database (PAIDB) is a comprehensive relational database of all the reported PAIs and potential PAI regions which were predicted by a method that combines feature-based analysis and similarity-based analysis. Also, using the PAI Finder search application, a multi-sequence query can be analyzed onsite for the presence of potential PAIs. As of April 2006, PAIDB contains 112 types of PAIs and 889 GenBank accessions containing either partial or all PAI loci previously reported in the literature, which are present in 497 strains of pathogenic bacteria. The database also offers 310 candidate PAIs predicted from 118 sequenced prokaryotic genomes. With the increasing number of prokaryotic genomes without functional inference and sequenced genetic regions of suspected involvement in diseases, this web-based, user-friendly resource has the potential to be of significant use in pathogenomics. PAIDB is freely accessible at

    The Wavelet-Based Cluster Analysis for Temporal Gene Expression Data

    A variety of high-throughput methods have made it possible to generate detailed temporal expression data for a single gene or large numbers of genes. Common methods for analysis of these large data sets can be problematic. One challenge is the comparison of temporal expression data obtained from different growth conditions where the patterns of expression may be shifted in time. We propose the use of wavelet analysis to transform the data obtained under different growth conditions to permit comparison of expression patterns from experiments that have time shifts or delays. We demonstrate this approach using detailed temporal data for a single bacterial gene obtained under 72 different growth conditions. This general strategy can be applied in the analysis of data sets of thousands of genes under different conditions

    Application of Wavelet Packet Transform to detect genetic polymorphisms by the analysis of inter-Alu PCR patterns

    <p>Abstract</p> <p>Background</p> <p>The analysis of Inter-Alu PCR patterns obtained from human genomic DNA samples is a promising technique for a simultaneous analysis of many genomic loci flanked by Alu repetitive sequences in order to detect the presence of genetic polymorphisms. Inter-Alu PCR products may be separated and analyzed by capillary electrophoresis using an automatic sequencer that generates a complex pattern of peaks. We propose an algorithmic method based on the Haar-Walsh Wavelet Packet Transformation (WPT) for an efficient detection of fingerprint-type patterns generated by PCR-based methodologies. We have tested our algorithmic approach on inter-Alu patterns obtained from the genomic DNA of three couples of monozygotic twins, expecting that the inter-Alu patterns of each twins couple will show differences due to unavoidable experimental variability. On the contrary the differences among samples of different twins are supposed to originate from genetic variability. Our goal is to automatically detect regions in the inter-Alu pattern likely associated to the presence of genetic polymorphisms.</p> <p>Results</p> <p>We show that the WPT algorithm provides a reliable tool to identify sample to sample differences in complex peak patterns, reducing the possible errors and limits associated to a subjective evaluation. The redundant decomposition of the WPT algorithm allows for a procedure of best basis selection which maximizes the pattern differences at the lowest possible scale. Our analysis points out few classifying signal regions that could indicate the presence of possible genetic polymorphisms.</p> <p>Conclusions</p> <p>The WPT algorithm based on the Haar-Walsh wavelet is an efficient tool for a non-supervised pattern classification of inter-ALU signals provided by a genetic analyzer, even if it was not possible to estimate the power and false positive rate due to the lacking of a suitable data base. The identification of non-reproducible peaks is usually accomplished comparing different experimental replicates of each sample. Moreover, we remark that, albeit we developed and optimized an algorithm able to analyze patterns obtained through inter-Alu PCR, the method is theoretically applicable to whatever fingerprint-type pattern obtained analyzing anonymous DNA fragments through capillary electrophoresis, and it could be usefully applied on a wide range of fingerprint-type methodologies.</p

    Genome Scale Comparison of Mycobacterium avium subsp. paratuberculosis with Mycobacterium avium subsp. avium Reveals Potential Diagnostic Sequences

    The genetic similarity between Mycobacterium avium subsp. paratuberculosis and other mycobacterial species has confounded the development of M. avium subsp. paratuberculosis-specific diagnostic reagents. Random shotgun sequencing of the M. avium subsp. paratuberculosis genome in our laboratories has shown \u3e98% sequence identity with Mycobacterium avium subsp. avium in some regions. However, an in silico comparison of the largest annotated M. avium subsp. paratuberculosis contigs, totaling 2,658,271 bp, with the unfinished M. avium subsp. avium genome has revealed 27 predicted M. avium subsp. paratuberculosis coding sequences that do not align with M. avium subsp. avium sequences. BLASTP analysis of the 27 predicted coding sequences (genes) shows that 24 do not match sequences in public sequence databases, such as GenBank. These novel sequences were examined by PCR amplification with genomic DNA from eight mycobacterial species and ten independent isolates of M. avium subsp. paratuberculosis. From these analyses, 21 genes were found to be present in all M. avium subsp. paratuberculosis isolates and absent from all other mycobacterial species tested. One region of the M. avium subsp. paratuberculosis genome contains a cluster of eight genes, arranged in tandem, that is absent in other mycobacterial species. This region spans 4.4 kb and is separated from other predicted coding regions by 1,408 bp upstream and 1,092 bp downstream. The gene upstream of this eight-gene cluster has strong similarity to mycobacteriophage integrase sequences. The GC content of this 4.4-kb region is 66%, which is similar to the rest of the genome, indicating that this region was not horizontally acquired recently. Southern hybridization analysis confirmed that this gene cluster is present only in M. avium subsp. paratuberculosis. Collectively, these studies suggest that a genomics approach will help in identifying novel M. avium subsp. paratuberculosis genes as candidate diagnostic sequences

    A Benchmark of Parametric Methods for Horizontal Transfers Detection

    Horizontal gene transfer (HGT) has appeared to be of importance for prokaryotic species evolution. As a consequence numerous parametric methods, using only the information embedded in the genomes, have been designed to detect HGTs. Numerous reports of incongruencies in results of the different methods applied to the same genomes were published. The use of artificial genomes in which all HGT parameters are controlled allows testing different methods in the same conditions. The results of this benchmark concerning 16 representative parametric methods showed a great variety of efficiencies. Some methods work very poorly whatever the type of HGTs and some depend on the conditions or on the metrics used. The best methods in terms of total errors were those using tetranucleotides as criterion for the window methods or those using codon usage for gene based methods and the Kullback-Leibler divergence metric. Window methods are very sensitive but less specific and detect badly lone isolated gene. On the other hand gene based methods are often very specific but lack of sensitivity. We propose using two methods in combination to get the best of each category, a gene based one for specificity and a window based one for sensitivity

    A computational approach for identifying pathogenicity islands in prokaryotic genomes

    BACKGROUND: Pathogenicity islands (PAIs), distinct genomic segments of pathogens encoding virulence factors, represent a subgroup of genomic islands (GIs) that have been acquired by horizontal gene transfer event. Up to now, computational approaches for identifying PAIs have been focused on the detection of genomic regions which only differ from the rest of the genome in their base composition and codon usage. These approaches often lead to the identification of genomic islands, rather than PAIs. RESULTS: We present a computational method for detecting potential PAIs in complete prokaryotic genomes by combining sequence similarities and abnormalities in genomic composition. We first collected 207 GenBank accessions containing either part or all of the reported PAI loci. In sequenced genomes, strips of PAI-homologs were defined based on the proximity of the homologs of genes in the same PAI accession. An algorithm reminiscent of sequence-assembly procedure was then devised to merge overlapping or adjacent genomic strips into a large genomic region. Among the defined genomic regions, PAI-like regions were identified by the presence of homolog(s) of virulence genes. Also, GIs were postulated by calculating G+C content anomalies and codon usage bias. Of 148 prokaryotic genomes examined, 23 pathogenic and 6 non-pathogenic bacteria contained 77 candidate PAIs that partly or entirely overlap GIs. CONCLUSION: Supporting the validity of our method, included in the list of candidate PAIs were thirty four PAIs previously identified from genome sequencing papers. Furthermore, in some instances, our method was able to detect entire PAIs for those only partial sequences are available. Our method was proven to be an efficient method for demarcating the potential PAIs in our study. Also, the function(s) and origin(s) of a candidate PAI can be inferred by investigating the PAI queries comprising it. Identification and analysis of potential PAIs in prokaryotic genomes will broaden our knowledge on the structure and properties of PAIs and the evolution of bacterial pathogenesis

    PIPS: Pathogenicity Island Prediction Software

    The adaptability of pathogenic bacteria to hosts is influenced by the genomic plasticity of the bacteria, which can be increased by such mechanisms as horizontal gene transfer. Pathogenicity islands play a major role in this type of gene transfer because they are large, horizontally acquired regions that harbor clusters of virulence genes that mediate the adhesion, colonization, invasion, immune system evasion, and toxigenic properties of the acceptor organism. Currently, pathogenicity islands are mainly identified in silico based on various characteristic features: (1) deviations in codon usage, G+C content or dinucleotide frequency and (2) insertion sequences and/or tRNA genetic flanking regions together with transposase coding genes. Several computational techniques for identifying pathogenicity islands exist. However, most of these techniques are only directed at the detection of horizontally transferred genes and/or the absence of certain genomic regions of the pathogenic bacterium in closely related non-pathogenic species. Here, we present a novel software suite designed for the prediction of pathogenicity islands (pathogenicity island prediction software, or PIPS). In contrast to other existing tools, our approach is capable of utilizing multiple features for pathogenicity island detection in an integrative manner. We show that PIPS provides better accuracy than other available software packages. As an example, we used PIPS to study the veterinary pathogen Corynebacterium pseudotuberculosis, in which we identified seven putative pathogenicity islands

    The Influence of Recombination on Human Genetic Diversity

    In humans, the rate of recombination, as measured on the megabase scale, is positively associated with the level of genetic variation, as measured at the genic scale. Despite considerable debate, it is not clear whether these factors are causally linked or, if they are, whether this is driven by the repeated action of adaptive evolution or molecular processes such as double-strand break formation and mismatch repair. We introduce three innovations to the analysis of recombination and diversity: fine-scale genetic maps estimated from genotype experiments that identify recombination hotspots at the kilobase scale, analysis of an entire human chromosome, and the use of wavelet techniques to identify correlations acting at different scales. We show that recombination influences genetic diversity only at the level of recombination hotspots. Hotspots are also associated with local increases in GC content and the relative frequency of GC-increasing mutations but have no effect on substitution rates. Broad-scale association between recombination and diversity is explained through covariance of both factors with base composition. To our knowledge, these results are the first evidence of a direct and local influence of recombination hotspots on genetic variation and the fate of individual mutations. However, that hotspots have no influence on substitution rates suggests that they are too ephemeral on an evolutionary time scale to have a strong influence on broader scale patterns of base composition and long-term molecular evolution

    Bacterial genomic G + C composition-eliciting environmental adaptation

    Bacterial genomes reflect their adaptation strategies through nucleotide usage trends found in their chromosome composition. Bacteria, unlike eukaryotes contain a wide range of genomic G + C. This wide variability may be viewed as a response to environmental adaptation. Two overarching trends are observed across bacterial genomes, the first, correlates genomic G + C to environmental niches and lifestyle, while the other utilizees intra-genomic G + C incongruence to delineate horizontally transferred material. In this review, we focus on the influence of several properties including biochemical, genetic flows, selection biases, and the biochemical-energetic properties shaping genome composition. Outcomes indicate a trend toward high G + C and larger genomes in free-living organisms, as a result of more complex and varied environments (higher chance for horizontal gene transfer). Conversely, nutrient limiting and nutrient poor environments dictate smaller genomes of low GC in attempts to conserve replication expense. Varied processes including translesion repair mechanisms, phage insertion and cytosine degradation has been shown to introduce higher AT in genomic sequences. We conclude the review with an analysis of current bioinformatics tools seeking to elicit compositional variances and highlight the practical implications when using such techniques