153 research outputs found

    De novo finished 2.8 Mbp Staphylococcus aureus genome assembly from 100 bp short and long range paired-end reads

    Get PDF
    Motivation: Paired-end sequencing allows circumventing the shortness of the reads produced by second generation sequencers and is essential for de novo assembly of genomes. However, obtaining a finished genome from short reads is still an open challenge. We present an algorithm that exploits the pairing information issued from inserts of potentially any length. The method determines paths through an overlaps graph by using a constrained search tree. We also present a method that automatically determines suited overlaps cutoffs according to the contextual coverage, reducing thus the need for manual parameterization. Finally, we introduce an interactive mode that allows querying an assembly at targeted regions. Results: We assess our methods by assembling two Staphylococcus aureus strains that were sequenced on the Illumina platform. Using 100 bp paired-end reads and minimal manual curation, we produce a finished genome sequence for the previously undescribed isolate SGH-10-168. Availability and implementation: The presented algorithms are implemented in the standalone Edena software, freely available under the General Public License (GPLv3) at www.genomic.ch/edena.php. Contact: [email protected] Supplementary Information: Supplementary data are available at Bioinformatics onlin

    BASE: a practical de novo assembler for large genomes using long NGS reads

    Get PDF
    © 2016 The Author(s). Background: De novo genome assembly using NGS data remains a computation-intensive task especially for large genomes. In practice, efficiency is often a primary concern and favors using a more efficient assembler like SOAPdenovo2. Yet SOAPdenovo2, based on de Bruijn graph, fails to take full advantage of longer NGS reads (say, 150 bp to 250 bp from Illumina HiSeq and MiSeq). Assemblers that are based on string graphs (e.g., SGA), though less popular and also very slow, are more favorable for longer reads. Methods: This paper shows a new de novo assembler called BASE. It enhances the classic seed-extension approach by indexing the reads efficiently to generate adaptive seeds that have high probability to appear uniquely in the genome. Such seeds form the basis for BASE to build extension trees and then to use reverse validation to remove the branches based on read coverage and paired-end information, resulting in high-quality consensus sequences of reads sharing the seeds. Such consensus sequences are then extended to contigs. Results: Experiments on two bacteria and four human datasets shows the advantage of BASE in both contig quality and speed in dealing with longer reads. In the experiment on bacteria, two datasets with read length of 100 bp and 250 bp were used. Especially for the 250 bp dataset, BASE gives much better quality than SOAPdenovo2 and SGA and is simlilar to SPAdes. Regarding speed, BASE is consistently a few times faster than SPAdes and SGA, but still slower than SOAPdenovo2. BASE and Soapdenov2 are further compared using human datasets with read length 100 bp, 150 bp and 250 bp. BASE shows a higher N50 for all datasets, while the improvement becomes more significant when read length reaches 250 bp. Besides, BASE is more-meory efficent than SOAPdenovo2 when sequencing data with error rate. Conclusions: BASE is a practically efficient tool for constructing contig, with significant improvement in quality for long NGS reads. It is relatively easy to extend BASE to include scaffolding.published_or_final_versio

    Bioinformatics analysis of bacterial pathogens from East African camels

    Get PDF
    The camel is the most valuable livestock species in arid and semi-arid regions in the Greater Horn of Africa. Streptococcus agalactiae and Staphylococcus aureus are important pathogens for a wide range of hosts including camels, cattle and humans. Streptococcus agalactiae has been reported to cause infections of the skin, the respiratory tract, the mammary gland and the vaginal tract in camels. Staphylococcus aureus has been isolated from the nasal cavity, wound infections and mastitis from camels. Both pathogens account for decline in health and productivity of camels, hence causing economic losses to the inhabitants of arid and semi arid lands. To define candidate virulence traits in these bacteria, we compared the genomes of S. agalactiae and S. aureus. We sequenced and completely assembled the genomes of two S. agalactiae isolates ILRI005 and ILRI112 from abscessed case camels and an S. aureus isolate ILRI_Eymole1/1 from the nasal swab of camel in Kenya. To perform comparative analysis, we also sequenced and assembled an S. agalactiae isolate 09mas018883 from subclinical mastitis case cattle in Sweden. Mapping assembly, de novo assembly and post-assembly genome finishing were performed to obtain completely assembled genomes. Comparative genomics approach was applied to explore the genetic heterogeneity, core genome construction and protein repertoire comparison of these novel genomes, and to highlight potential virulence factors that could have contributed to the pathogenicity of these isolates in their hosts. Newly sequenced camel S. agalactiae genomes were compared with human and cattle S. agalactiae genomes. This comparison revealed that the two camel isolates were genetically close to each other but relatively distinct from other isolates, while cattle isolate 09mas018883 was genetically closer to the human isolates. Large proportion of the isolate-specific genes of the camel S. agalactiae isolates was clustered in putative phage insertions and genomic islands suggesting the lateral transfer of these putative phages. The two camel S. agalactiae isolates shared a novel potential virulent locus, the CRISPR2 (Cluster Regularly Interspaced Palindromic Repeats) locus. The two cattle S. agalactiae isolates and three human S. agalactiae isolates contained similar putative phage insertions. Important potential pathogenic factors found in all S. agalactiae isolates were CRISPR1 locus, cyl locus, capsular polysaccharide locus and pilus islands. Phylogenetic analysis of novel camel S. aureus genome of strain type ST30 and previously sequenced human S. aureus genomes of type Clonal Complex 30 (CC30) revealed that camel S. aureus isolate is genetically distinct from human S. aureus isolates of the same sequence type. Important features were also identified such as genes encoding bacterial adhesins and secretory proteins. The availability of genomic sequences of S. agalactiae and S. aureus from camels, their detailed bioinformatics analysis and identified potential virulence factors will foster the development of control measures such as molecular diagnostic assays and vaccines for control of S. agalactiae and S. aureus infections in camels. This will ensure improvement in health and productivity of camels

    Impact of exposure of methicillin-resistant Staphylococcus aureus to polyhexanide in vitro and in vivo.

    Get PDF
    Staphylococcus aureus (MRSA) resistant to decolonization agents such as mupirocin and chlorhexidine increase the need to develop alternative decolonization molecules. The absence of reported adverse reactions and bacterial resistance to polyhexanide makes it an excellent choice as topical antiseptic. In the present study we evaluated the in vitro and in vivo capacity to generate strains with reduced polyhexanide susceptibility and cross-resistance with chlorhexidine and/or antibiotics currently used in clinic. Here we report the in vitro emergence of reduced-susceptibility to polyhexanide by prolonged-stepwise exposure to low concentrations in broth culture. Reduced susceptibility to polyhexanide was associated with genomic changes in the mprF and purR genes, and with concomitant decreased susceptibility to daptomycin and other cell-wall active antibiotics. However, the in vitro emergence of reduced-susceptibility to polyhexanide did not result in cross-resistance to chlorhexidine antiseptic. During in vivo polyhexanide clinical decolonization treatment, neither polyhexanide reduced-susceptibility nor chlorhexidine cross-resistance were observed. Together, these observations suggest that polyhexanide could be used safely for decolonisation of carriers of chlorhexidine-resistant S. aureus strains but highlight the need for careful use of polyhexanide at low antiseptic concentrations

    Draft genome sequences of two Xanthomonas vesicatoria strains from the Balkan peninsula

    Get PDF
    Xanthomonas vesicatoria causes bacterial spot disease of pepper and tomato plants. We report here the first genome sequences of X. vesicatoria strains that have been isolated from pepper plants. These data will be used for comparative genomics and will allow the development of new detection and typing tools for epidemiological surveillance

    Genome Assembly: Novel Applications by Harnessing Emerging Sequencing Technologies and Graph Algorithms

    Get PDF
    Genome assembly is a critical first step for biological discovery. All current sequencing technologies share the fundamental limitation that segments read from a genome are much shorter than even the smallest genomes. Traditionally, whole- genome shotgun (WGS) sequencing over-samples a single clonal (or inbred) target chromosome with segments from random positions. The amount of over-sampling is known as the coverage. Assembly software then reconstructs the target. So called next-generation (or second-generation) sequencing has reduced the cost and increased throughput exponentially over first-generation sequencing. Unfortunately, next-generation sequences present their own challenges to genome assembly: (1) they require amplification of source DNA prior to sequencing leading to artifacts and biased coverage of the genome; (2) they produce relatively short reads: 100bp- 700bp; (3) the sizeable runtime of most second-generation instruments is prohibitive for applications requiring rapid analysis, with an Illumina HiSeq 2000 instrument requiring 11 days for the sequencing reaction. Recently, successors to the second-generation instruments (third-generation) have become available. These instruments promise to alleviate many of the down- sides of second-generation sequencing and can generate multi-kilobase sequences. The long sequences have the potential to dramatically improve genome and transcriptome assembly. However, the high error rate of these reads is challenging and has limited their use. To address this limitation, we introduce a novel correction algorithm and assembly strategy that utilizes shorter, high-identity sequences to correct the error in single-molecule sequences. Our approach achieves over 99% read accuracy and produces substantially better assemblies than current sequencing strategies. The availability of cheaper sequencing has made new sequencing targets, such as multiple displacement amplified (MDA) single-cells and metagenomes, popular. Current algorithms assume assembly of a single clonal target, an assumption that is violated in these sequencing projects. We developed Bambus 2, a new scaffolder that works for metagenomics and single cell datasets. It can accurately detect repeats without assumptions about the taxonomic composition of a dataset. It can also identify biological variations present in a sample. We have developed a novel end-to-end analysis pipeline leveraging Bambus 2. Due to its modular nature, it is applicable to clonal, metagenomic, and MDA single-cell targets and allows a user to rapidly go from sequences to assembly, annotation, genes, and taxonomic info. We have incorporated a novel viewer, allowing a user to interactively explore the variation present in a genomic project on a laptop. Together, these developments make genome assembly applicable to novel targets while utilizing emerging sequencing technologies. As genome assembly is critical for all aspects of bioinformatics, these developments will enable novel biological discovery

    First draft genome sequence of the Acidovorax caeni sp. nov. type strain R-24608 (DSM 19327)

    Get PDF
    We report the draft genome sequence of the Acidovorax caeni type strain R-24608 that was isolated from activated sludge of an aerobic-anaerobic wastewater treatment plant. The closest strain to Acidovorax caeni strain R-24608 is Acidovorax sp. strain MR-S7 with a 55.4% (amino-acid sequence) open reading frames (ORFs) average similarity

    Computational Metagenomics: Network, Classification and Assembly

    Get PDF
    Due to the rapid advance of DNA sequencing technologies in recent 10 years, large amounts of short DNA reads can be obtained quickly and cheaply. For example, a single Illumina HiSeq machine can produce several terabytes of data sets within a week. Metagenomics is a new scientific field that involves the analysis of genomic DNA sequences obtained directly from the environment, enabling studies of novel microbial systems. Metagenomics was made possible from high-throughput sequencing technologies. The analysis of the resulting data requires sophisticated computational analyses and data mining. In clinical settings, a fundamental goal of metagenomics is to help people diagnose and cure disease in clinical settings. One major bottleneck so far is how to analyze the huge noisy data sets quickly and precisely. My PhD research focuses on developing algorithms and tools to tackle these challenging and interesting computational problems. From the functional perspective, a metagenomic sample can be represented as a weighted metabolic network, in which the nodes are molecules, edges are enzymes encoded by genes, and the weights can be considered as the number of organisms providing the functions. One goal of functional comparison between metagenomic samples is to find differentially abundant metabolic subnetworks between two groups under comparison. We have developed a statistical network analysis tool - MetaPath, which uses a greedy search algorithm to find maximum weight subnetwork and a nonparametric permutation test to measure the statistical significance. Unlike previous approaches, MetaPath explicitly searches for significant subnetwork in the global network, enabling us to detect signatures at a finer level. In addition, we developed statistical methods that take into account the topology of the network when testing the significance of the subnetworks. Another computational problem involves classifying anonymous DNA sequences obtained from metagenomic samples. There are several challenges here: (1) The classification labels follow a hierarchical tree structure, in which the leaves are most specific, and the internal nodes are more general. How can we classify novel sequences that do not belong to leaf categories (species) but belong to internal groups (e.g., phylum)? (2) For each classification how can we compute a confidence score, such that the users have a tradeoff between sensitivity and specificity? (3) How can we analyze billions of data items quickly? We have developed a novel hierarchical classifier (MetaPhyler) for the classification of anonymous DNA reads. Through simulation, MetaPhyler models the distribution of pairwise similarities within different hierarchical groups with nonparametric density estimation. The confidence score is computed by the ratio of likelihood function. For a query DNA sequence with arbitrary length, its similarity can be calculated through linear approximation. Through benchmark comparison, we have shown that MetaPhyler is significantly faster and more accurate than previous tools. DNA sequencing machines can only produce very short strings (e.g., 100bp) relative to the size of a genome (e.g., a typical bacterial genome is 5Mbp). One of the most challenging computational tasks is the assembly of millions of short reads into longer contigs, which are used as the basis of subsequent computational analyses. In this project, we have developed a comparative metagenomic assembler (MetaCompass), which utilizes the genomes that have already been sequenced previously, and produces long contigs through read mapping (alignment) and assembly. Given the availability of thousands of existing bacteria genomes, for a particular sample, MetaCompass first chooses a best subset as reference based on the taxonomic composition. Then, the reads are aligned against these genomes using MUMmer-map or Bowtie2. Afterwards, we use a greedy algorithm of the minimum set-covering problem to build long contigs, and the consensus sequences are computed by the majority rule. We also propose an iterative approach to improve the performance. Finally, MetaCompass has been successfully evaluated and tested on over 20 terabytes of metagenomic data sets generated from the Human Microbiome Project. In addition, to facilitate the identification and characterization of antibiotic resistance genes, we have created Antibiotic Resistance Genes Database (ARDB), which provides a centralized compendium of information on antibiotic resistance. Furthermore, we have applied our tools to the analysis of a novel oral microbiome data set, and have discovered interesting functional mechanisms and ecological changes underlying the transition from health to periodontal disease of human mouth at a system level

    Novel methods for comparing and evaluating single and metagenomic assemblies

    Get PDF
    The current revolution in genomics has been made possible by software tools called genome assemblers, which stitch together DNA fragments “read” by sequencing machines into complete or nearly complete genome sequences. Despite decades of research in this field and the development of dozens of genome assemblers, assessing and comparing the quality of assembled genome sequences still heavily relies on the availability of independently determined standards, such as manually curated genome sequences, or independently produced mapping data. The focus of this work is to develop reference-free computational methods to accurately compare and evaluate genome assemblies. We introduce a reference-free likelihood-based measure of assembly quality which allows for an objective comparison of multiple assemblies generated from the same set of reads. We define the quality of a sequence produced by an assembler as the conditional probability of observing the sequenced reads from the assembled sequence. A key property of our metric is that the true genome sequence maximizes the score, unlike other commonly used metrics. Despite the unresolved challenges of single genome assembly, the decreasing costs of sequencing technology has led to a sharp increase in metagenomics projects over the past decade. These projects allow us to better understand the diversity and function of microbial communities found in the environment, including the ocean, Arctic regions, other living organisms, and the human body. We extend our likelihood-based framework and show that we can accurately compare assemblies of these complex bacterial communities. After an assembly has been produced, it is not an easy task determining what parts of the underlying genome are missing, what parts are mistakes, and what parts are due to experimental artifacts from the sequencing machine. Here we introduce VALET, the first reference-free pipeline that flags regions in metagenomic assemblies that are statistically inconsistent with the data generation process. VALET detects mis-assemblies in publicly available datasets and highlights the current shortcomings in available metagenomic assemblers. By providing the computational methods for researchers to accurately evalu- ate their assemblies, we decrease the chance of incorrect biological conclusions and misguided future studies
    • 

    corecore