60 research outputs found

    A Primer on Metagenomics

    Get PDF
    Metagenomics is a discipline that enables the genomic study of uncultured microorganisms. Faster, cheaper sequencing technologies and the ability to sequence uncultured microbes sampled directly from their habitats are expanding and transforming our view of the microbial world. Distilling meaningful information from the millions of new genomic sequences presents a serious challenge to bioinformaticians. In cultured microbes, the genomic data come from a single clone, making sequence assembly and annotation tractable. In metagenomics, the data come from heterogeneous microbial communities, sometimes containing more than 10,000 species, with the sequence data being noisy and partial. From sampling, to assembly, to gene calling and function prediction, bioinformatics faces new demands in interpreting voluminous, noisy, and often partial sequence data. Although metagenomics is a relative newcomer to science, the past few years have seen an explosion in computational methods applied to metagenomic-based research. It is therefore not within the scope of this article to provide an exhaustive review. Rather, we provide here a concise yet comprehensive introduction to the current computational requirements presented by metagenomics, and review the recent progress made. We also note whether there is software that implements any of the methods presented here, and briefly review its utility. Nevertheless, it would be useful if readers of this article would avail themselves of the comment section provided by this journal, and relate their own experiences. Finally, the last section of this article provides a few representative studies illustrating different facets of recent scientific discoveries made using metagenomics

    Sequence data mining and characterisation of unclassified microbial diversity

    Get PDF
    In the last two decades, sequencing has become increasingly affordable and a routine tool to study the microbial community of a given environment. Metagenomics has revolutionised the way microbes are identified and studied in this age of biological data science because it provides a relatively unbiased view of the composition of microbial communities we interact with every day, which are integral to our ecosystem. These technological advances have led to an exponential growth of raw data repositories that save, distribute and archive these metagenomic datasets. Since metagenomics presents the ultimate opportunity to capture, explore and identify uncultivated microbial genomic sequences, these metagenomic datasets harbour a large proportion of unknown sequences that do not bear any similarity to known sequences readily available in the standard sequence data repositories. The aim of this thesis was to systematically catalogue, quantify and potentially characterise the unknown sequences embedded within the metagenomic datasets. To this end, a comprehensive, portable, modular framework called UnXplore was developed to determine the proportion of unknown sequences included in human microbiome datasets. UnXplore was applied to a range of different human microbiomes and showed that on average 2% of assembled sequences were categorised as unknown meaning that they did not bear any sequence similarity to known sequences. A third of the unknown sequences were shown to contain large open reading frames indicating the coding potential and biological origin of the unknowns. Furthermore, a small proportion of these potentially coding sequences were shown to have functional similarities as they were deemed to contain known protein domain signatures. These results indicated that unknown sequences captured through the UnXplore framework were not artefacts and were indeed of biological origin. To test this formally, supervised kmer-based machine learning models were devised, tested and validated. These models are currently distributed in a package called TetraPredX that can accurately predict whether a sequence originated from bacteria, archaea, virus or plasmid. TetraPredX models were applied to the unknown sequence dataset and revealed that the majority of unknown sequences are of biological origin. Furthermore, TetraPredX results demonstrated that >70% of all long unknown sequences (i.e. >1kb) are likely to be of virus origin indicating an unexplored diversity of viruses that is yet to be fully characterised and classified. In order to catalogue the diversity of virus sequences in human microbiome samples analysed here, an extensive virus discovery analysis was carried out on the contigs assembled through UnXplore. This helped to characterise a vast diversity of prokaryotic, eukaryotic and unclassified virus sequences captured in a range of human microbiomes. The results obtained here demonstrate the need to systematically interrogate metagenomic datasets to fully comprehend and compile the presence of both known and unknown uncultivated microbes within them. A comprehensive survey of metagenomic datasets carried out in this manner would provide a more complete picture of the known and unknown organisms that surround us

    On Computable Protein Functions

    Get PDF
    Proteins are biological machines that perform the majority of functions necessary for life. Nature has evolved many different proteins, each of which perform a subset of an organism’s functional repertoire. One aim of biology is to solve the sparse high dimensional problem of annotating all proteins with their true functions. Experimental characterisation remains the gold standard for assigning function, but is a major bottleneck due to resource scarcity. In this thesis, we develop a variety of computational methods to predict protein function, reduce the functional search space for proteins, and guide the design of experimental studies. Our methods take two distinct approaches: protein-centric methods that predict the functions of a given protein, and function-centric methods that predict which proteins perform a given function. We applied our methods to help solve a number of open problems in biology. First, we identified new proteins involved in the progression of Alzheimer’s disease using proteomics data of brains from a fly model of the disease. Second, we predicted novel plastic hydrolase enzymes in a large data set of 1.1 billion protein sequences from metagenomes. Finally, we optimised a neural network method that extracts a small number of informative features from protein networks, which we used to predict functions of fission yeast proteins

    Dynamics of Marine Microbial Metabolism and Physiology at Station Aloha.

    Get PDF
    Ph.D. Thesis. University of Hawaiʻi at Mānoa 2017

    Machine-learning-based identification of factors that influence molecular virus-host interactions

    Get PDF
    Viruses are the cause of many infectious diseases such as the pandemic viruses: acquired immune deficiency syndrome (AIDS) and coronavirus disease 2019 (COVID-19). During the infection cycle, viruses invade host cells and trigger a series of virus-host interactions with different directionality. Some of these interactions disrupt host immune responses or promote the expression of viral proteins and exploitation of the host system thus are considered ‘pro-viral’. Some interactions display ‘pro-host’ traits, principally the immune response, to control or inhibit viral replication. Concomitant pro-viral and pro-host molecular interactions on the same host molecule suggests more complex virus-host conflicts and genetic signatures that are crucial to host immunity. In this work, machinelearning-based prediction of virus-host interaction directionality was examined by using data from Human immunodeficiency virus type 1 (HIV-1) infection. Host immune responses to viral infections are mediated by interferons(IFNs) in the initial stage of the immune response to infection. IFNs induce the expression of many IFN-stimulated genes (ISGs), which make the host cell refractory to further infection. We propose that there are many features associated with the up-regulation of human genes in the context of IFN-α stimulation. They make ISGs predictable using machine-learning models. In order to overcome the interference of host immune responses for successful replication, viruses adopt multiple strategies to avoid being detected by cellular sensors in order to hijack the machinery of host transcription or translation. Here, the strategy of mimicry of host-like short linear motifs (SLiMs) by the virus was investigated by using the example of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). The integration of in silico experiments and analyses in this thesis demonstrates an interactive and intimate relationship between viruses and their hosts. Findings here contribute to the identification of host dependency and antiviral factors. They are of great importance not only to the ongoing COVID-19 pandemic but also to the understanding of future disease outbreaks

    Bioinformatics approaches to study antibiotics resistance emergence across levels of biological organization.

    Get PDF
    The Review on Antimicrobial Resistance predicts that in thirty years infections with antibiotic-resistant microorganisms will become one of the leading causes of death. The discovery of new antibiotics has so far been too slow to ensure continuous use of antibiotics in the face of growing resistance. Therefore, efforts to curb resistance emergence gain in importance. These efforts comprise two complementary strategies. The first focuses on the mechanisms of resistance emergence, in the hope that it would enable development of pharmacological agents constraining resistance emergence. The second aims at improving antibiotic use practices, based on studies of the impact of antibiotics on resistance emergence within patient populations. Antibiotic resistance emerges in bacterial cells, negatively influences the human gut microbiome, and transfers between people. Hence, antibiotic resistance has impacts across several levels of biological organization. This thesis describes four projects, which concerned various aspects of antibiotics resistance. The first two projects deal with basic resistance emergence mechanisms, on the level of bacterial strains and bacterial consortia, whereas the other two deal with finding better practices for antibiotic use on a population level. During the first project, I analyzed changes in genomes of MRSA strains isolated from several patients throughout antibiotic therapies and developing MRSA infections. I observed changes in number and types of virulence factors responsible for interacting with the human body, which are attributed to mobile genetic elements. In the second project, I showed that, prompted by antibiotic therapy, within the human gut microbiome resistance transfers from bacterial genomes onto plasmids, prophages, and free phages. Hence, resistance emergence depends not only on the antibiotic therapy but also on the state of the gut microbiome, which again results from the patients’ overall health and previous antibiotic therapies. The third project, SATURN, employed machine learning methods for a large set of data regarding patients’ demographics, comorbidities, antibiotic therapies, surgeries, and colonization with multi-drug resistant bacteria. The final classifiers were made available on the AskSaturn website where the doctors can compare antibiotic therapies based on the probability of colonization with multi-drug resistant bacteria. The fourth project, Tübiom, focused on the antibiotic-influenced gut microbiomes of the healthy population. The first two projects rely on genome and metagenome sequencing data. For them, I designed specialized bioinformatics analysis pipelines. The latter two projects use mixed data, which were analyzed with machine learning algorithms. These projects also involved web development and data visualization. Although each of the projects requires different data and methods, each of them provides a crucial part in a pipeline aiming at utilizing gut microbiome information in medical practice to constrain resistance emergence

    Genomic studies on the impact of host/virus interaction in EBV infection using massively parallel high throughput sequencing

    Get PDF
    Epstein-Barr virus is one of the most common viral infections in humans and, once acquired, persists within its host throughout their life. EBV therefore represents an ex- tremely successful virus, having evolved complex strategies to evade the host’s innate and adaptive immune response during both initial and persistent stages of infection. While infection is mostly harmless in the majority of cases, EBV has the ability to be oncogenic in some individuals, and is associated with a wide range of malignancies as well as non-cancerous diseases. To generate new and useful insights into the evolution of EBV interactions with its host, a hybridization-based target enrichment methodology was optimised to enable whole genome sequencing of EBV directly from clinical samples. This allowed the gen- eration of whole genome sequences of EBV directly from blood for the first time. This methodology was subsequently applied to a number of distinct EBV sample col- lections and the resulting data used to investigate the intra- and inter-host variation in various clinical settings, such as infectious mononucleosis and immunosuppression with chronic EBV infection. Additionally, the number of available whole genomes from East Asia is expanded by eleven (unique) novel genomes from primary infection from a NPC- non-endemic area. These sequences were used for a comparative analysis between NPC- and non-NPC-derived EBV genomes and a number of sites were determined differenti- ating these two groups. Finally, comparative genomic analyses of world-wide EBV strain diversity were per- formed using genome sequences generated here in conjunction with a large number of publicly available EBV genome sequences. The comprehensive data sets generated, which included measures of diversity, selection, and linkage, were used to identify poten- tial targets of T cell immunity. In addition, the population structure of EBV was analysed to better understand the forces that have shaped the evolution of EBV

    Metabolic engineering approaches reveal widespread physiological functions of membrane lipids for Saccharomyces cerevisiae

    Get PDF
    The lipid composition of biological membranes can differ significantly between organisms and even between organelles of the same cell in terms of lipid compounds and specific ratios of lipid classes. Referring to this, every membrane features a characteristic lipid composition that is thought to regulate its physicochemical properties and cellular function by providing lipid environments supporting the integrity of membrane-localized protein machinery and membrane-associated processes. Chapter I gives a brief overview of the interlinkage between the chemical nature of membrane lipids, the structural and functional organization as well as the physicochemical properties of lipid bilayers and their influence on membrane-embedded proteins. Studies to gain detailed knowledge on how membrane lipid composition influences the physiology of cells and regulates cellular processes require tools to manipulate lipid composition in vivo. By employing metabolic engineering approaches based on titratable gene expression tools, sets of Saccharomyces cerevisiae strains in which membrane lipid composition is under experimental control were engineered. The study described in Chapter II addresses OLE1, encoding for the sole fatty acid desaturase of budding yeast, to control the extent of acyl unsaturation of fatty acids incorporated in phospholipids. This approach revealed cellular roles for the physical state of cell membranes, so called membrane fluidity, on yeast flocculation and hypoxic growth. It is shown, how the endogenous lipid homeostasis machinery of budding yeast is adapted to carry out a broad response to oxygen limitation (hypoxia) and how it activates a non-canonical yeast flocculation pathway involving FLO1, which encodes for cell wall glycoproteins that mediate cell-cell-interactions by binding cell wall mannose residues of adjacent cells. In Chapter III, the previously generated strain in which expression of OLE1 is under experimental control was used as a cellular platform to assay the activity of heterologously expressed stearoyl-CoA desaturases (SCDs). Putative SCDs from human pathogens T. brucei and T. cruzi were functionally expressed in S. cerevisiae, thereby additionally confirming their SCD activity in vivo. The presented assay might also provide a tool to screen for inhibitors of SCDs, which are interesting drug targets in the treatment of bacterial and parasitic infections in humans. The study presented in Chapter IV addresses ERG9, an essential gene involved in the ergosterol biosynthetic pathway and used a metabolic engineering approach to achieve control over the total sterol biosynthetic activity of the cell. Cells that allowed for manipulating the native sterol homeostasis were employed to unveil physiological effects of ergosterol and total sterol depletion on the cell’s general viability as well as on fundamental membrane associated processes such as protein sorting and endo- and exocytosis. By combining this metabolic engineering approach and the powerful method of marker-free CRISPR/Cas9-mediated gene tagging, it was possible to establish a cellular system for investigating the impact of sterol depletion on the lateral distribution pattern of lipid-raft associated GFP-tagged membrane proteins within the plasma membrane of yeast. Chapter V introduces a novel set of all-in-one constitutive and inducible CRISPR/Cas9 vectors that allow for a very easy and highly convenient application of the technology in S. cerevisiae. The simplicity of the inducible system is based on the possibility of introducing a desired gRNA targeting sequence with homologous recombination-mediated assembly of overlapping single-stranded oligonucleotides. The inducible Cas9 expression approach also introduces the novel concept of chronologically separating the cloning procedure from the actual genome editing step by preloading cells with an all-in-one CRISPR/Cas9 plasmid. This way, CRISPR/Cas9-supported genome editing can be obtained with high efficiency and effectivity by just transforming a desired preloaded target strain with donor DNA to be genomically integrated without the need of co-introducing any of the CRISPR system components. These novel CRISPR/Cas9 systems will help to overcome limitations often observed for challenging metabolic and genetic engineering approaches that can be e.g. used for following studies to reveal physiological roles of membrane lipids for budding yeast
    corecore