120 research outputs found
Re-Assembly of the Genome of Francisella tularensis Subsp. holarctica OSU18
Francisella tularensis is a highly infectious human intracellular pathogen that is the causative agent of tularemia. It occurs in several major subtypes, including the live vaccine strain holarctica (type B). F. tularensis is classified as category A biodefense agent in part because a relatively small number of organisms can cause severe illness. Three complete genomes of subspecies holarctica have been sequenced and deposited in public archives, of which OSU18 was the first and the only strain for which a scientific publication has appeared [1]. We re-assembled the OSU18 strain using both de novo and comparative assembly techniques, and found that the published sequence has two large inversion mis-assemblies. We generated a corrected assembly of the entire genome along with detailed information on the placement of individual reads within the assembly. This assembly will provide a more accurate basis for future comparative studies of this pathogen
Major data analysis errors invalidate cancer microbiome findings
We re-analyzed the data from a recent large-scale study that reported strong correlations between DNA signatures of microbial organisms and 33 different cancer types and that created machine-learning predictors with near-perfect accuracy at distinguishing among cancers. We found at least two fundamental flaws in the reported data and in the methods: (i) errors in the genome database and the associated computational methods led to millions of false-positive findings of bacterial reads across all samples, largely because most of the sequences identified as bacteria were instead human; and (ii) errors in the transformation of the raw data created an artificial signature, even for microbes with no reads detected, tagging each tumor type with a distinct signal that the machine-learning programs then used to create an apparently accurate classifier. Each of these problems invalidates the results, leading to the conclusion that the microbiome-based classifiers for identifying cancer presented in the study are entirely wrong. These flaws have subsequently affected more than a dozen additional published studies that used the same data and whose results are likely invalid as well
A New Rhesus Macaque Assembly and Annotation for Next-Generation Sequencing Analyses
BACKGROUND: The rhesus macaque (Macaca mulatta) is a key species for advancing biomedical research. Like all draft mammalian genomes, the draft rhesus assembly (rheMac2) has gaps, sequencing errors and misassemblies that have prevented automated annotation pipelines from functioning correctly. Another rhesus macaque assembly, CR_1.0, is also available but is substantially more fragmented than rheMac2 with smaller contigs and scaffolds. Annotations for these two assemblies are limited in completeness and accuracy. High quality assembly and annotation files are required for a wide range of studies including expression, genetic and evolutionary analyses.
RESULTS: We report a new de novo assembly of the rhesus macaque genome (MacaM) that incorporates both the original Sanger sequences used to assemble rheMac2 and new Illumina sequences from the same animal. MacaM has a weighted average (N50) contig size of 64 kilobases, more than twice the size of the rheMac2 assembly and almost five times the size of the CR_1.0 assembly. The MacaM chromosome assembly incorporates information from previously unutilized mapping data and preliminary annotation of scaffolds. Independent assessment of the assemblies using Ion Torrent read alignments indicates that MacaM is more complete and accurate than rheMac2 and CR_1.0. We assembled messenger RNA sequences from several rhesus tissues into transcripts which allowed us to identify a total of 11,712 complete proteins representing 9,524 distinct genes. Using a combination of our assembled rhesus macaque transcripts and human transcripts, we annotated 18,757 transcripts and 16,050 genes with complete coding sequences in the MacaM assembly. Further, we demonstrate that the new annotations provide greatly improved accuracy as compared to the current annotations of rheMac2. Finally, we show that the MacaM genome provides an accurate resource for alignment of reads produced by RNA sequence expression studies.
CONCLUSIONS: The MacaM assembly and annotation files provide a substantially more complete and accurate representation of the rhesus macaque genome than rheMac2 or CR_1.0 and will serve as an important resource for investigators conducting next-generation sequencing studies with nonhuman primates.
REVIEWERS: This article was reviewed by Dr. Lutz Walter, Dr. Soojin Yi and Dr. Kateryna Makova
Genome Assembly Has a Major Impact on Gene Content: A Comparison of Annotation in Two Bos Taurus Assemblies
Gene and SNP annotation are among the first and most important steps in analyzing a genome. As the number of sequenced genomes continues to grow, a key question is: how does the quality of the assembled sequence affect the annotations? We compared the gene and SNP annotations for two different Bos taurus genome assemblies built from the same data but with significant improvements in the later assembly. The same annotation software was used for annotating both sequences. While some annotation differences are expected even between high-quality assemblies such as these, we found that a staggering 40% of the genes (>9,500) varied significantly between assemblies, due in part to the availability of new gene evidence but primarily to genome mis-assembly events and local sequence variations. For instance, although the later assembly is generally superior, 660 protein coding genes in the earlier assembly are entirely missing from the later genome's annotation, and approximately 3,600 (15%) of the genes have complex structural differences between the two assemblies. In addition, 12–20% of the predicted proteins in both assemblies have relatively large sequence differences when compared to their RefSeq models, and 6–15% of bovine dbSNP records are unrecoverable in the two assemblies. Our findings highlight the consequences of genome assembly quality on gene and SNP annotation and argue for continued improvements in any draft genome sequence. We also found that tracking a gene between different assemblies of the same genome is surprisingly difficult, due to the numerous changes, both small and large, that occur in some genes. As a side benefit, our analyses helped us identify many specific loci for improvement in the Bos taurus genome assembly
Clustering metagenomic sequences with interpolated Markov models
<p>Abstract</p> <p>Background</p> <p>Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not necessarily representative of environments interesting to many metagenomics projects.</p> <p>Results</p> <p>We present S<smcaps>CIMM</smcaps> (Sequence Clustering with Interpolated Markov Models), an unsupervised sequence clustering method. S<smcaps>CIMM</smcaps> achieves greater clustering accuracy than previous unsupervised approaches. We examine the limitations of unsupervised learning on complex datasets, and suggest a hybrid of S<smcaps>CIMM</smcaps> and supervised learning method Phymm called P<smcaps>HY</smcaps>S<smcaps>CIMM</smcaps> that performs better when evolutionarily close training genomes are available.</p> <p>Conclusions</p> <p>S<smcaps>CIMM</smcaps> and P<smcaps>HY</smcaps>S<smcaps>CIMM</smcaps> are highly accurate methods to cluster metagenomic sequences. S<smcaps>CIMM</smcaps> operates entirely unsupervised, making it ideal for environments containing mostly novel microbes. P<smcaps>HY</smcaps>S<smcaps>CIMM</smcaps> uses supervised learning to improve clustering in environments containing microbial strains from well-characterized genera. S<smcaps>CIMM</smcaps> and P<smcaps>HY</smcaps>S<smcaps>CIMM</smcaps> are available open source from <url>http://www.cbcb.umd.edu/software/scimm</url>.</p
Yersinia pestis Evolution on a Small Timescale: Comparison of Whole Genome Sequences from North America
Yersinia pestis, the etiologic agent of plague, was responsible for several devastating epidemics throughout history and is currently of global importance to current public heath and biodefense efforts. Y. pestis is widespread in the Western United States. Because Y. pestis was first introduced to this region just over 100 years ago, there has been little time for genetic diversity to accumulate. Recent studies based upon single nucleotide polymorphisms have begun to quantify the genetic diversity of Y. pestis in North America.To examine the evolution of Y. pestis in North America, a gapped genome sequence of CA88-4125 was generated. Sequence comparison with another North American Y. pestis strain, CO92, identified seven regions of difference (six inversions, one rearrangement), differing IS element copy numbers, and several SNPs.The relatively large number of inverted/rearranged segments suggests that North American Y. pestis strains may be undergoing inversion fixation at high rates over a short time span, contributing to higher-than-expected diversity in this region. These findings will hopefully encourage the scientific community to sequence additional Y. pestis strains from North America and abroad, leading to a greater understanding of the evolutionary history of this pathogen
Efficient oligonucleotide probe selection for pan-genomic tiling arrays
Background: Array comparative genomic hybridization is a fast and cost-effective method for detecting, genotyping, and comparing the genomic sequence of unknown bacterial isolates. This method, as with all microarray applications, requires adequate coverage of probes targeting the regions of interest. An unbiased tiling of probes across the entire length of the genome is the most flexible design approach. However, such a whole-genome tiling requires that the genome sequence is known in advance. For the accurate analysis of uncharacterized bacteria, an array must query a fully representative set of sequences from the species' pan-genome. Prior microarrays have included only a single strain per array or the conserved sequences of gene families. These arrays omit potentially important genes and sequence variants from the pan-genome.
Results: This paper presents a new probe selection algorithm (PanArray) that can tile multiple whole genomes using a minimal number of probes. Unlike arrays built on clustered gene families, PanArray uses an unbiased, probe-centric approach that does not rely on annotations, gene clustering, or multi-alignments. Instead, probes are evenly tiled across all sequences of the pangenome at a consistent level of coverage. To minimize the required number of probes, probes conserved across multiple strains in the pan-genome are selected first, and additional probes are used only where necessary to span polymorphic regions of the genome. The viability of the algorithm is demonstrated by array designs for seven different bacterial pan-genomes and, in particular, the design of a 385,000 probe array that fully tiles the genomes of 20 different Listeria monocytogenes strains with overlapping probes at greater than twofold coverage.
Conclusion: PanArray is an oligonucleotide probe selection algorithm for tiling multiple genome sequences using a minimal number of probes. It is capable of fully tiling all genomes of a species on a single microarray chip. These unique pan-genome tiling arrays provide maximum flexibility for the analysis of both known and uncharacterized strains.https://doi.org/10.1186/1471-2105-10-29
Highly Pathogenic Avian Influenza Virus Subtype H5N1 in Africa: A Comprehensive Phylogenetic Analysis and Molecular Characterization of Isolates
Highly pathogenic avian influenza virus A/H5N1 was first officially reported in Africa in early 2006. Since the first outbreak in Nigeria, this virus spread rapidly to other African countries. From its emergence to early 2008, 11 African countries experienced A/H5N1 outbreaks in poultry and human cases were also reported in three of these countries. At present, little is known of the epidemiology and molecular evolution of A/H5N1 viruses in Africa. We have generated 494 full gene sequences from 67 African isolates and applied molecular analysis tools to a total of 1,152 A/H5N1 sequences obtained from viruses isolated in Africa, Europe and the Middle East between 2006 and early 2008. Detailed phylogenetic analyses of the 8 gene viral segments confirmed that 3 distinct sublineages were introduced, which have persisted and spread across the continent over this 2-year period. Additionally, our molecular epidemiological studies highlighted the association between genetic clustering and area of origin in a majority of cases. Molecular signatures unique to strains isolated in selected areas also gave us a clearer picture of the spread of A/H5N1 viruses across the continent. Mutations described as typical of human influenza viruses in the genes coding for internal proteins or associated with host adaptation and increased resistance to antiviral drugs have also been detected in the genes coding for transmembrane proteins. These findings raise concern for the possible human health risk presented by viruses with these genetic properties and highlight the need for increased efforts to monitor the evolution of A/H5N1 viruses across the African continent. They further stress how imperative it is to implement sustainable control strategies to improve animal and public health at a global level
The genomes of two key bumblebee species with primitive eusocial organization
Background: The shift from solitary to social behavior is one of the major evolutionary transitions. Primitively eusocial bumblebees are uniquely placed to illuminate the evolution of highly eusocial insect societies. Bumblebees are also invaluable natural and agricultural pollinators, and there is widespread concern over recent population declines in some species. High-quality genomic data will inform key aspects of bumblebee biology, including susceptibility to implicated population viability threats. Results: We report the high quality draft genome sequences of Bombus terrestris and Bombus impatiens, two ecologically dominant bumblebees and widely utilized study species. Comparing these new genomes to those of the highly eusocial honeybee Apis mellifera and other Hymenoptera, we identify deeply conserved similarities, as well as novelties key to the biology of these organisms. Some honeybee genome features thought to underpin advanced eusociality are also present in bumblebees, indicating an earlier evolution in the bee lineage. Xenobiotic detoxification and immune genes are similarly depauperate in bumblebees and honeybees, and multiple categories of genes linked to social organization, including development and behavior, show high conservation. Key differences identified include a bias in bumblebee chemoreception towards gustation from olfaction, and striking differences in microRNAs, potentially responsible for gene regulation underlying social and other traits. Conclusions: These two bumblebee genomes provide a foundation for post-genomic research on these key pollinators and insect societies. Overall, gene repertoires suggest that the route to advanced eusociality in bees was mediated by many small changes in many genes and processes, and not by notable expansion or depauperation
- …