9,007 research outputs found

    Do Read Errors Matter for Genome Assembly?

    Full text link
    While most current high-throughput DNA sequencing technologies generate short reads with low error rates, emerging sequencing technologies generate long reads with high error rates. A basic question of interest is the tradeoff between read length and error rate in terms of the information needed for the perfect assembly of the genome. Using an adversarial erasure error model, we make progress on this problem by establishing a critical read length, as a function of the genome and the error rate, above which perfect assembly is guaranteed. For several real genomes, including those from the GAGE dataset, we verify that this critical read length is not significantly greater than the read length required for perfect assembly from reads without errors.Comment: Submitted to ISIT 201

    The unseen world: environmental microbial sequencing and identification methods for ecologists

    Get PDF
    Archaea, bacteria, microeukaryotes, and the viruses that infect them (collectively “microorganisms”) are foundational components of all ecosystems, inhabiting almost every imaginable environment and comprising the majority of the planet’s organismal and evolutionary diversity. Microorganisms play integral roles in ecosystem functioning; are important in the biogeochemical cycling of carbon (C), nitrogen (N), sulfur (S), phosphorus (P), and various metals (eg Barnard et al. 2005); and may be vital to ecosystem responses to large-scale climatic change (Mackelprang et al. 2011). Rarely found alone, microorganisms often form complex communities that are dynamic in space and time (Martiny et al. 2006). For these and other reasons, ecologists and environmental scientists have become increasingly interested in understanding microbial dynamics in ecosystems. Ecological studies of microbes in the environment generally focus on determining which organisms are present and what functional roles they are playing or could play. Rapid advances in molecular and bioinformatic approaches over the past decade have dramatically reduced the difficulty and cost of addressing such questions (Figure 1; WebTable 1). Yet the range of methodologies currently in use and the rapid pace of their ongoing development can be daunting for researchers unaccustomed to these technologies

    The Vaginal Microbiome: Disease, Genetics and the Environment

    Get PDF
    The vagina is an interactive interface between the host and the environment. Its surface is covered by a protective epithelium colonized by bacteria and other microorganisms. The ectocervix is nonsterile, whereas the endocervix and the upper genital tract are assumed to be sterile in healthy women. Therefore, the cervix serves a pivotal role as a gatekeeper to protect the upper genital tract from microbial invasion and subsequent reproductive pathology. Microorganisms that cross this barrier can cause preterm labor, pelvic inflammatory disease, and other gynecologic and reproductive disorders. Homeostasis of the microbiome in the vagina and ectocervix plays a paramount role in reproductive health. Depending on its composition, the microbiome may protect the vagina from infectious or non-infectious diseases, or it may enhance its susceptibility to them. Because of the nature of this organ, and the fact that it is continuously colonized by bacteria from birth to death, it is virtually certain that this rich environment evolved in concert with its microbial flora. Specific interactions dictated by the genetics of both the host and microbes are likely responsible for maintaining both the environment and the microbiome. However, the genetic basis of these interactions in both the host and the bacterial colonizers is currently unknown. _Lactobacillus_ species are associated with vaginal health, but the role of these species in the maintenance of health is not yet well defined. Similarly, other species, including those representing minor components of the overall flora, undoubtedly influence the ability of potential pathogens to thrive and cause disease. Gross alterations in the vaginal microbiome are frequently observed in women with bacterial vaginosis, but the exact etiology of this disorder is still unknown. There are also implications for vaginal flora in non-infectious conditions such as pregnancy, pre-term labor and birth, and possibly fertility and other aspects of women’s health. Conversely, the role of environmental factors in the maintenance of a healthy vaginal microbiome is largely unknown. To explore these issues, we have proposed to address the following questions:

*1.	Do the genes of the host contribute to the composition of the vaginal microbiome?* We hypothesize that genes of both host and bacteria have important impacts on the vaginal microbiome. We are addressing this question by examining the vaginal microbiomes of mono- and dizygotic twin pairs selected from the over 170,000 twin pairs in the Mid-Atlantic Twin Registry (MATR). Subsequent studies, beyond the scope of the current project, may investigate which host genes impact the microbial flora and how they do so.
*2.	What changes in the microbiome are associated with common non-infectious pathological states of the host?* We hypothesize that altered physiological (e.g., pregnancy) and pathologic (e.g., immune suppression) conditions, or environmental exposures (e.g., antibiotics) predictably alter the vaginal microbiome. Conversely, certain vaginal microbiome characteristics are thought to contribute to a woman’s risk for outcomes such as preterm delivery. We are addressing this question by recruiting study participants from the ~40,000 annual clinical visits to women’s clinics of the VCU Health System.
*3.	What changes in the vaginal microbiome are associated with relevant infectious diseases and conditions?* We hypothesize that susceptibility to infectious disease (e.g. HPV, _Chlamydia_ infection, vaginitis, vaginosis, etc.) is impacted by the vaginal microbiome. In turn, these infectious conditions clearly can affect the ability of other bacteria to colonize and cause pathology. Again, we are exploring these issues by recruiting participants from visitors to women’s clinics in the VCU Health System.

Three kinds of sequence data are generated in this project: i) rDNA sequences from vaginal microbes; ii) whole metagenome shotgun sequences from vaginal samples; and iii) whole genome shotgun sequences of bacterial clones selected from vaginal samples. The study includes samples from three vaginal sites: mid-vaginal, cervical, and introital. The data sets also include buccal and perianal samples from all twin participants. Samples from these additional sites are used to test the hypothesis of a per continuum spread of bacteria in relation to vaginal health. An extended set of clinical metadata associated with these sequences are deposited with dbGAP. We have currently collected over 4,400 samples from ~100 twins and over 450 clinical participants. We have analyzed and deposited data for 480 rDNA samples, eight whole metagenome shotgun samples, and over 50 complete bacterial genomes. These data are available to accredited investigators according to NIH and Human Microbiome Project (HMP) guidelines. The bacterial clones are deposited in the Biodefense and Emerging Infections Research Resources Repository ("http://www.beiresources.org/":http://www.beiresources.org/). 

In addition to the extensive sequence data obtained in this study, we are collecting metadata associated with each of the study participants. Thus, participants are asked to complete an extensive health history questionnaire at the time samples are collected. Selected clinical data associated with the visit are also obtained, and relevant information is collected from the medical records when available. This data is maintained securely in a HIPAA-compliant data system as required by VCU’s Institutional Review Board (IRB). The preponderance of these data (i.e., that judged appropriate by NIH staff and VCU’s IRB are deposited at dbGAP ("http://www.ncbi.nlm.nih.gov/gap":http://www.ncbi.nlm.nih.gov/gap). Selected fields of this data have been identified by NIH staff as ‘too sensitive’ and are not available in dbGAP. Individuals requiring access to these data fields are asked to contact the PI of this project or NIH Program Staff. 
&#xa

    A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data

    Full text link
    Deep shotgun sequencing and analysis of genomes, transcriptomes, amplified single-cell genomes, and metagenomes has enabled investigation of a wide range of organisms and ecosystems. However, sampling variation in short-read data sets and high sequencing error rates of modern sequencers present many new computational challenges in data interpretation. These challenges have led to the development of new classes of mapping tools and {\em de novo} assemblers. These algorithms are challenged by the continued improvement in sequencing throughput. We here describe digital normalization, a single-pass computational algorithm that systematizes coverage in shotgun sequencing data sets, thereby decreasing sampling variation, discarding redundant data, and removing the majority of errors. Digital normalization substantially reduces the size of shotgun data sets and decreases the memory and time requirements for {\em de novo} sequence assembly, all without significantly impacting content of the generated contigs. We apply digital normalization to the assembly of microbial genomic data, amplified single-cell genomic data, and transcriptomic data. Our implementation is freely available for use and modification

    ShotgunWSD: An unsupervised algorithm for global word sense disambiguation inspired by DNA sequencing

    Full text link
    In this paper, we present a novel unsupervised algorithm for word sense disambiguation (WSD) at the document level. Our algorithm is inspired by a widely-used approach in the field of genetics for whole genome sequencing, known as the Shotgun sequencing technique. The proposed WSD algorithm is based on three main steps. First, a brute-force WSD algorithm is applied to short context windows (up to 10 words) selected from the document in order to generate a short list of likely sense configurations for each window. In the second step, these local sense configurations are assembled into longer composite configurations based on suffix and prefix matching. The resulted configurations are ranked by their length, and the sense of each word is chosen based on a voting scheme that considers only the top k configurations in which the word appears. We compare our algorithm with other state-of-the-art unsupervised WSD algorithms and demonstrate better performance, sometimes by a very large margin. We also show that our algorithm can yield better performance than the Most Common Sense (MCS) baseline on one data set. Moreover, our algorithm has a very small number of parameters, is robust to parameter tuning, and, unlike other bio-inspired methods, it gives a deterministic solution (it does not involve random choices).Comment: In Proceedings of EACL 201

    Optimal Assembly for High Throughput Shotgun Sequencing

    Get PDF
    We present a framework for the design of optimal assembly algorithms for shotgun sequencing under the criterion of complete reconstruction. We derive a lower bound on the read length and the coverage depth required for reconstruction in terms of the repeat statistics of the genome. Building on earlier works, we design a de Brujin graph based assembly algorithm which can achieve very close to the lower bound for repeat statistics of a wide range of sequenced genomes, including the GAGE datasets. The results are based on a set of necessary and sufficient conditions on the DNA sequence and the reads for reconstruction. The conditions can be viewed as the shotgun sequencing analogue of Ukkonen-Pevzner's necessary and sufficient conditions for Sequencing by Hybridization.Comment: 26 pages, 18 figure
    • …
    corecore