316 research outputs found

    Use of Whole Genome Shotgun Sequencing for the Analysis of Microbial Communities in Arabidopsis thaliana Leaves

    Get PDF
    Microorganisms, such as all Bacteria, Archaeae, and some Eukaryotes, inhabit all imaginable habitats in the planet, from water vents in the deep ocean to extreme environments of high temperature and salinity. Microbes also constitute the most diverse group of organisms in terms if genetic information, metabolic function, and taxonomy. Furthermore, many of these microbes establish complex interactions with each others and with many other multicellular organisms. The collection of microbes that share a body space with a plant or animal is called the microbiota, and their genetic information is called the microbiome. The microbiota has emerged as a crucial determinant of a host’s overall health and understanding it has become crucial in many biological fields. In mammals, the gut microbiota has been linked to important diseases such as diabetes, inflammatory bowel disease, and dementia. In plants, the microbiota can provide protection against certain pathogens or confer resistance against harsh environmental conditions such as drought. Furthermore, the leaves of plants represent one of the largest surface areas that can potentially be colonized by microbes. The advent of sequencing technologies has let researchers to study microbial communities at unprecedented resolution and scale. By targeting individual loci such as the 16S rDNA locus in bacteria, many species can be studied simultaneously, as well as their properties such as relative abundance without the need of individual isolation of target taxa. Decreasing costs of DNA sequencing has also led to whole shotgun sequencing where instead of targeting a single or a number of loci, random fragments of DNA are sequenced. This effectively renders the entire microbiome accessible to study, referred to as metagenomics. Consequently many more areas of investigation are open, such as the exploration of within host genetic diversity, functional analysis, or assembly of individual genomes from metagenomes. In this study, I described the analysis of metagenomic sequencing data from microbial 11 communities in leaves of wild Arabidopsis thaliana individuals from southwest Germany. As a model organisms, A. thaliana not only is accessible in the wild but also has a rich body of previous research in plant-microbe interactions. In the first section, I describe how whole shotgun sequencing of leaf DNA extracts can be used to accurately describe the taxonomic composition of the microbial community of individual hosts. The nature of whole shotgun sequencing is used to estimate true microbial abundances which can not be done with amplicons sequencing. I show how this community varies across hosts, but some trends are seen, such as the dominance of the bacterial genera Pseudomonas and Sphingomonas . Moreover, even though there is variation between individuals, I explore the influence of site of origin and host genotype. Finally, metagenomic assembly is applied to individual samples, showing the limitations of WGS in plant leaves. In the second section, I explore the genomic diversity of the most abundant genera: Pseudomonas and Sphingomonas . I use a core genome approach where a set of common genes is obtained from previously sequenced and assembled genomes. Thereafter, the gene sequences of the core genome is used as a reference for short genome mapping. Based on these mappings, individual strain mixtures are inferred based on the frequency distribution of non reference bases at each detected single nucleotide polymorphism (SNP). Finally, SNP’s are then used to derive population structure of strain mixtures across samples and with known reference genomes. In conclusion, this thesis provides insights into the use of metagenomic sequencing to study microbial populations in wild plants. I identify the strengths and weaknesses of using whole genome sequencing for this purpose. As well as a way to study strain level dynamics of prevalent taxa within a single host

    Knowledge-Driven Methods for Geographic Information Extraction in the Biomedical Domain

    Get PDF
    abstract: Accounting for over a third of all emerging and re-emerging infections, viruses represent a major public health threat, which researchers and epidemiologists across the world have been attempting to contain for decades. Recently, genomics-based surveillance of viruses through methods such as virus phylogeography has grown into a popular tool for infectious disease monitoring. When conducting such surveillance studies, researchers need to manually retrieve geographic metadata denoting the location of infected host (LOIH) of viruses from public sequence databases such as GenBank and any publication related to their study. The large volume of semi-structured and unstructured information that must be reviewed for this task, along with the ambiguity of geographic locations, make it especially challenging. Prior work has demonstrated that the majority of GenBank records lack sufficient geographic granularity concerning the LOIH of viruses. As a result, reviewing full-text publications is often necessary for conducting in-depth analysis of virus migration, which can be a very time-consuming process. Moreover, integrating geographic metadata pertaining to the LOIH of viruses from different sources, including different fields in GenBank records as well as full-text publications, and normalizing the integrated metadata to unique identifiers for subsequent analysis, are also challenging tasks, often requiring expert domain knowledge. Therefore, automated information extraction (IE) methods could help significantly accelerate this process, positively impacting public health research. However, very few research studies have attempted the use of IE methods in this domain. This work explores the use of novel knowledge-driven geographic IE heuristics for extracting, integrating, and normalizing the LOIH of viruses based on information available in GenBank and related publications; when evaluated on manually annotated test sets, the methods were found to have a high accuracy and shown to be adequate for addressing this challenging problem. It also presents GeoBoost, a pioneering software system for georeferencing GenBank records, as well as a large-scale database containing over two million virus GenBank records georeferenced using the algorithms introduced here. The methods, database and software developed here could help support diverse public health domains focusing on sequence-informed virus surveillance, thereby enhancing existing platforms for controlling and containing disease outbreaks.Dissertation/ThesisDoctoral Dissertation Biomedical Informatics 201

    Pool-seq analysis for the identification of polymorphisms in bacterial strains and utilization of the variants for protein database creation

    Get PDF
    Pooled sequencing (Pool-seq) is the sequencing of a single library that contains DNA pooled from different samples. It is a cost-effective alternative to individual whole genome sequencing. In this study, we utilized Pool-seq to sequence 100 streptococcus pyogenes strains in two pools to identify polymorphisms and create variant protein databases for shotgun proteomics analysis. We investigated the efficacy of the pooling strategy and the four tools used for variant calling by using individual sequence data of six of the strains in the pools as well as 3407 publicly available strains from the European Nucleotide Archive. Besides the raw sequence data from the public repository, we also extracted polymorphisms from 19 S.pyogenes publicly available complete genomes and compared the variations against our pools. In total 78955 variants (76981 SNPs and 1725 INDELs ) were identified from the two pools. Of these, ∼ 60.5% and 95.7% were discovered in the complete genomes and the European Nucleotide Archive data respectively. Collectively, the four variant calling tools were able to mine majority of the variants, ∼ 96.5%, found from the six individual strains, suggesting Pool-seq is a robust approach for variation discovery. Variants from the pools that fell in coding regions and had non synonymous effects constituted 24% and were used to create variant protein databases for shotgun proteomics analysis. These variant databases improved protein identification in mass spectrometry analysis

    Deciphering Taxa-function Relationships in Population-level Studies of Human Gut Microbiomes

    Get PDF
    The human gut microbiome is a complex and dynamic ecosystem, featuring a multitude of microbes all interacting with their hosts in an elaborate manner. Even though this exchange is often mediated through microbial metabolic and functional outputs, such as the production of certain metabolites, environmental exposures, and host lifestyle are highly influential in shaping the presence of microbial species irrespective of their individual roles. As such, a comprehensive understanding of the microbiome requires researchers to examine the relationship between taxonomic abundance and function simultaneously. Assessing microbial contributions to important ecosystem services can enable identification of robust functions supported by a variety of species, or to identify important keystone taxa that are associated with a disease-causing biochemical pathway. The primary objective of this thesis is to assess different approaches for investigating the taxa-function relationship and evaluate its value in providing unique biological insights. First, we leveraged densely collected multi-omics data from the New Hampshire Birth Cohort Study to identify genus-metabolite pairs that are core to infant gut microbiomes. Second, we developed a novel statistical method that enables integrating taxa-function relationships in epidemiological studies. Third, we assessed microbial phenotypic traits as a potential source for defining interpretable and human-centric microbiome function

    Global Survey of Organ and Organelle Protein Expression in Mouse: Combined Proteomic and Transcriptomic Profiling

    Get PDF
    SummaryOrgans and organelles represent core biological systems in mammals, but the diversity in protein composition remains unclear. Here, we combine subcellular fractionation with exhaustive tandem mass spectrometry-based shotgun sequencing to examine the protein content of four major organellar compartments (cytosol, membranes [microsomes], mitochondria, and nuclei) in six organs (brain, heart, kidney, liver, lung, and placenta) of the laboratory mouse, Mus musculus. Using rigorous statistical filtering and machine-learning methods, the subcellular localization of 3274 of the 4768 proteins identified was determined with high confidence, including 1503 previously uncharacterized factors, while tissue selectivity was evaluated by comparison to previously reported mRNA expression patterns. This molecular compendium, fully accessible via a searchable web-browser interface, serves as a reliable reference of the expressed tissue and organelle proteomes of a leading model mammal

    Collaborative Cross Graphical Genome

    Get PDF
    Reference genomes are the foundation of most bioinformatic pipelines. They are conventionally represented as a set of single-sequence assembled contigs, referred to as linear genomes. The rapid growth of sequencing technologies has driven the advent of pangenomes that integrate multiple genome assemblies in a single representation. Graphs are commonly used in pangenome models. However, there are challenges for graph-based pangenome representations and operations. This dissertation introduces methods for reference pangenome construction, genomic feature annotation, and tools for analyzing population-scale sequence data based on a graphical pangenome model. We first develop a genome registration tool for constructing a reference pangenome model by merging multiple linear genome assemblies and annotations into a graphical genome. Secondly, we develop a graph-based coordinate framework and discuss the strategies for referring to, annotating, and comparing genomic features in a graphical pangenome model. We demonstrate that the graph coordinate system simplifies assembly and annotation updates, identifying and segmenting updated sequences in a specific genomic region. Thirdly, we develop an alignment-free method to analyze population-scale sequence data based on a pangenome model. We demonstrate the application of our methods by constructing pangenome models for a mouse genetic reference population, Collaborative Cross. The pangenome framework proposed in this dissertation simplified the maintenance and management of massive genomic data and established a novel data structure for analyzing, visualizing, and comparing genomic features in an intra-specific population.Doctor of Philosoph

    Investigation of intergenic regions of Mycoplasma hyopneumoniae and development of statistical methods for analyzing small-scale RT-qPCR assays

    Get PDF
    The intergenic region (IG) transcriptional activity of Mycoplasma hyopneumoniae strain 232 was studied via two-color microarrays and quantitative real-time polymerase chain reactions (RT-qPCR). Two types of microarrays were constructed, one consisting of PCR products and the other of synthesized oligonucleotides. The PCR-array consisted of 994 PCR products (probes) which covers 98% (683/698) of the total open reading frames (ORFs) of strain 232, five structural ribosomal RNA probes, and 159 IG probes for 112 of 215 IG regions greater than 124 bp. The oligonucleotide-array consisted of 528 oligonucleotide probes ranging in size between 50 and 60 bp, and was designed for IG regions for which PCR products were not constructed or the length of the region (50-124 bp). Transcriptional signals were identified in 93.6% (321/343) of the IG regions larger than 49 bp. From these IG regions with transcriptional activity, five large (\u3e500 bp) IG regions and the region upstream of dnaK were chosen for further analysis by RT-qPCR. A novel method to compare the relative quantity estimates of several different targets was developed for the RT-qPCR assays, and various methods were investigated to obtain error estimates of the fold change and relative quantity by applying top-down or bottom-up statistical approaches for two different experimental designs. The results from these assays indicate that no single transcriptional start site can account for transcriptional activity within IG regions. Transcription can end abruptly at the end of an ORF, but this does not seem to occur at high frequency. Rather, transcription continues past the end of the ORF, with RNA polymerase gradually releasing the template. Transcription can also be initiated within IG regions in the absence of accepted promoter-like sequences. Also, when conducting small scale RT-qPCR studies, the error in estimation of amplification efficiency should not be ignored in determining statistically significant differences. An assay design which uses serial dilutions of each individual sample to determine the amplification efficiency of a target sequence is favored over an assay design which uses the Stock I methodology to evaluate target sequence amplification efficiencies. In summary, methods to analyze the transcriptional activity of M. hyopneumoniae have been developed and the results have shown that IG regions are transcriptionally active and under some regulatory control

    Investigating Genetic (IN)Compatibility Between Temperate Phages and CRISPR-CAS Systems in Staphylococcus Aureus

    Get PDF
    Prokaryotic organisms employ various mechanisms for defending against parasitism by viruses and other mobile genetic elements. One form of defense comprises the adaptive immune systems derived from clustered, regularly interspaced, short palindromic repeat (CRISPR) loci and CRISPR-associated (cas) genes. CRISPR-Cas immune systems enable the acquisition of heritable resistance to specific mobile genetic elements on the basis of nucleic acid sequence recognition, but do not necessarily discriminate between target elements which are burdensome and those which are beneficial. My thesis is concerned with the consequences of CRISPR-Cas immunity directed at a particular breed of bacterial DNA viruses, known as temperate phages, which cause both harmful (lytic) and benign (lysogenic) infections under different conditions. Initial studies investigating prokaryotic CRISPR-Cas immunity seemed to indicate that functional, DNA-targeting systems cannot stably co-exist with their target elements in vivo. For example, in studies where immunity was directed at temperate phages, DNA-targeting CRISPR-Cas systems were found to prevent both lysogenic and lytic infections except when targeting was altogether abrogated via mutation or inhibition of the CRISPR-Cas system. The first part of my thesis work includes in vivo experiments which challenged the generality of this view, with regard to the different types of DNA-targeting CRISPR-Cas systems. Namely, I demonstrated that a staphylococcal branch of the ‘type III’ CRISPR-Cas systems is capable of tolerating lysogenic infections by specific temperate phages which are otherwise targeted during lytic infections. I further established that the capacity for conditional temperate phage tolerance results from a transcription-dependent targeting modality which was not anticipated for this particular DNA-targeting type III system. In contrast, I observed only the expected genetic escape outcomes when temperate phages were targeted by a ‘type II’ CRISPR-Cas system with a transcription-independent (Cas9-based) DNA targeting modality. These findings laid the groundwork for subsequent studies of CRISPR-Cas immunity to phages in Staphylococcus aureus hosts, and guided my colleagues towards in vitro characterization of the type III system’s transcription-dependent targeting mechanism. CRISPR-Cas systems have been identified in about 50% of sequenced bacterial genomes, and the factors which influence this distribution are still not fully understood. My description of conditional tolerance by a staphylococcal, type III CRISPR-Cas system illustrated that, in principle, these particular systems could stably co-exist with their temperate phage target elements in lysogenic hosts while maintaining their ability to protect against lytic infections. During the second part of my thesis work, I set out to define additional phenotypic consequences for the lysogenized lineages of S. aureus which maintain conditional tolerance, in an effort to better understand how this phenomenon might influence the distribution and stability of type III systems among natural isolates. Notably, I found that the maintenance of certain temperate-phage-targeting systems can incur fitness costs in lysogenic populations. I showed, furthermore, that these costs are potentially greater if more than one temperate phage is targeted in populations of double lysogens, but that they can be alleviated by mutations which do not abrogate phage targeting during lytic infections. Collectively, these findings imply that long-term maintenance of type III systems in natural populations of lysogens might require additional evolutionary fine-tuning, particularly among lineages which are prone to multiple infection
    corecore