2,538 research outputs found

    Comparison of alignment software for genome-wide bisulphite sequence data

    Get PDF
    Recent advances in next generation sequencing (NGS) technology now provide the opportunity to rapidly interrogate the methylation status of the genome. However, there are challenges in handling and interpretation of the methylation sequence data because of its large volume and the consequences of bisulphite modification. We sequenced reduced representation human genomes on the Illumina platform and efficiently mapped and visualized the data with different pipelines and software packages. We examined three pipelines for aligning bisulphite converted sequencing reads and compared their performance. We also comment on pre-processing and quality control of Illumina data. This comparison highlights differences in methods for NGS data processing and provides guidance to advance sequence-based methylation data analysis for molecular biologists

    Strain-level resolution and pneumococcal carriage dynamics by single-molecule real-time (SMRT) sequencing of the plyNCR marker: a longitudinal study in Swiss infants

    Full text link
    BACKGROUND Pneumococcal carriage has often been studied from a serotype perspective; however, little is known about the strain-specific carriage and inter-strain interactions. Here, we examined the strain-level carriage and co-colonization dynamics of Streptococcus pneumoniae in a Swiss birth cohort by PacBio single-molecule real-time (SMRT) sequencing of the plyNCR marker. METHODS A total of 872 nasal swab (NS) samples were included from 47 healthy infants during the first year of life. Pneumococcal carriage was determined based on the quantitative real-time polymerase chain reaction (qPCR) targeting the lytA gene. The plyNCR marker was amplified from 214 samples having lytA-based carriage for pneumococcal strain resolution. Amplicons were sequenced using SMRT technology, and sequences were analyzed with the DADA2 pipeline. In addition, pneumococcal serotypes were determined using conventional, multiplex PCR (cPCR). RESULTS PCR-based plyNCR amplification demonstrated a 94.2% sensitivity and 100% specificity for Streptococcus pneumoniae if compared to lytA qPCR. The overall carriage prevalence was 63.8%, and pneumococcal co-colonization (≄ 2 plyNCR amplicon sequence variants (ASVs)) was detected in 38/213 (17.8%) sequenced samples with the relative proportion of the least abundant strain(s) ranging from 1.1 to 48.8% (median, 17.2%; IQR, 5.8-33.4%). The median age to first acquisition was 147 days, and having ≄ 2 siblings increased the risk of acquisition. CONCLUSION The plyNCR amplicon sequencing is species-specific and enables pneumococcal strain resolution. We therefore recommend its application for longitudinal strain-level carriage studies of Streptococcus pneumoniae. Video Abstract

    Integration and mining of malaria molecular, functional and pharmacological data: how far are we from a chemogenomic knowledge space?

    Get PDF
    The organization and mining of malaria genomic and post-genomic data is highly motivated by the necessity to predict and characterize new biological targets and new drugs. Biological targets are sought in a biological space designed from the genomic data from Plasmodium falciparum, but using also the millions of genomic data from other species. Drug candidates are sought in a chemical space containing the millions of small molecules stored in public and private chemolibraries. Data management should therefore be as reliable and versatile as possible. In this context, we examined five aspects of the organization and mining of malaria genomic and post-genomic data: 1) the comparison of protein sequences including compositionally atypical malaria sequences, 2) the high throughput reconstruction of molecular phylogenies, 3) the representation of biological processes particularly metabolic pathways, 4) the versatile methods to integrate genomic data, biological representations and functional profiling obtained from X-omic experiments after drug treatments and 5) the determination and prediction of protein structures and their molecular docking with drug candidate structures. Progresses toward a grid-enabled chemogenomic knowledge space are discussed.Comment: 43 pages, 4 figures, to appear in Malaria Journa

    Strain-level resolution and pneumococcal carriage dynamics by single-molecule real-time (SMRT) sequencing of the plyNCR marker: a longitudinal study in Swiss infants.

    Get PDF
    BACKGROUND Pneumococcal carriage has often been studied from a serotype perspective; however, little is known about the strain-specific carriage and inter-strain interactions. Here, we examined the strain-level carriage and co-colonization dynamics of Streptococcus pneumoniae in a Swiss birth cohort by PacBio single-molecule real-time (SMRT) sequencing of the plyNCR marker. METHODS A total of 872 nasal swab (NS) samples were included from 47 healthy infants during the first year of life. Pneumococcal carriage was determined based on the quantitative real-time polymerase chain reaction (qPCR) targeting the lytA gene. The plyNCR marker was amplified from 214 samples having lytA-based carriage for pneumococcal strain resolution. Amplicons were sequenced using SMRT technology, and sequences were analyzed with the DADA2 pipeline. In addition, pneumococcal serotypes were determined using conventional, multiplex PCR (cPCR). RESULTS PCR-based plyNCR amplification demonstrated a 94.2% sensitivity and 100% specificity for Streptococcus pneumoniae if compared to lytA qPCR. The overall carriage prevalence was 63.8%, and pneumococcal co-colonization (≄ 2 plyNCR amplicon sequence variants (ASVs)) was detected in 38/213 (17.8%) sequenced samples with the relative proportion of the least abundant strain(s) ranging from 1.1 to 48.8% (median, 17.2%; IQR, 5.8-33.4%). The median age to first acquisition was 147 days, and having ≄ 2 siblings increased the risk of acquisition. CONCLUSION The plyNCR amplicon sequencing is species-specific and enables pneumococcal strain resolution. We therefore recommend its application for longitudinal strain-level carriage studies of Streptococcus pneumoniae. Video Abstract

    찚섞대 엌Ʞ서엎 분석 임ëč„ëĄœ 생성한 메타지놈 데읎터 분석을 위한 씜적의 ìƒëŹŒì •ëłŽí•™ 시슀템 개발

    Get PDF
    í•™ìœ„ë…ŒëŹž (ë°•ì‚Ź)-- 서욞대학ꔐ 대학원 : í˜‘ë™êłŒì • ìƒëŹŒì •ëłŽí•™ì „êł”, 2014. 2. ìČœìą…ì‹.Metagenome is total DNA directly extracted from environment, and the purpose of metagenomics is to reveal the function of the metagenome as well as the taxonomic structure in the metagenome. There are two analysis approaches for metagenomics, namely amplicon based approach and random shotgun based approach. Both approaches require large scale sequencing reads which could not be satisfied through Sanger sequencing. However, high throughput sequencing of reads at relatively low cost by Next Generation Sequencing (NGS) technologies meets the requirement of metagenomics. In addition, the advent of NGS technologies gave rise to the development of bioinformatic algorithms necessary for processing this large and complex sequencing data. Consequently, the large amount of sequencing data obtained from NGS and corresponding proper bioinformatic algorithms facilitated the metagenomics to become essential tool for microbiology. However, limitations incurred by NGS sequencing errors, short read length, and lack of analysis system still hinder accurate metagenome analysis. Therefore, evaluation of currently used NGS error handling algorithms and development of systematic pipeline with more efficient algorithms are required to improve the accuracy of analysis. In this study, bioinformatic pipelines were constructed for both metagenome analysis approaches. The pipelines were dedicated to improve the accuracy of the final end result by minimizing the effect of errors and short read length. For the amplicon based metagenomics, two different analysis pipelines were developed for both 454 pyrosequencing and Illumina MiSeq. During the construction of 454 pyrosequencing pipeline, new error handling algorithm was developed to treat homo-polymer and PCR errors. Upon completion of the pipeline construction, household microbial community was analyzed using 454 pyrosequencing data as a case study. As for Illumina MiSeq data, the most appropriate sequencing conditions and sequencing target region were settled. Paired end merging programs were evaluated and correlation of the sequencing errors and quality was studied to correct the errors within 3 overlap regions. Novel iterative consensus clustering method was developed to correct the errors occurring ubiquitously in a single read. For shotgun metagenomics approach, bioinformatic analysis system for Illumina MiSeq paired end data was constructed. Unlike the targeted amplicon sequencing reads, most of the shotgun sequencing reads are not mergedthus short reads are used for both functional and taxonomical profiling. However, a short read has less information than longer contigs, so the use of short reads is likely to cause biased characterization of the metagenome. Therefore, the development of analysis system did focus on creating longer contigs by means of mapping and de novo assembly. For raw read mapping, a dynamic mapping genome set construction method was developed. A list of mapping genomes was selected from the taxonomic profile inferred from the ribosomal RNA profiles. The genome sequence of the selected genomes were downloaded from Ezbiocloud. By mapping raw reads to the genome sequences, the longer contigs can be obtained in case of the relatively simple metagenome such as fecal matter. However in case of the complex metagenomes such as soil sample, both mapping and de novo assembly did not perform properly due to a lack of sequencing coverage and numerousity of uncultured microorganisms in the metagenome. In addition to the pipeline construction, visualization tools were also developed to display resultant taxonomic and functional profile at the same time. Newly developed JAVA-based standalone sequence alignment editing application was named as EzEditor. As both, conserved functional coding sequences and 16S rRNA gene have been used copiously in bacterial molecular phylogenetics, the codon-based sequence alignment editing functions are required for the coding genes. EzEditor provides simultaneous DNA and protein sequence alignment editing interface which enables us with the robust sequence alignment for both protein and rRNA sequences. EzEditor can be applied to various molecular sequence involved analysis not only as a basic sequence editor but also for phylogenetic application.ABSTRACT I TABLE OF CONTENTS IV ABBREVIATIONS VI FIGURE LIST VII TABLE LIST XII Chapter 1 General Introduction 1 1.1 Bioinformatics 2 1.2 Next Generation Sequencing 5 1.3 Metagenomics 11 1.4 Objectives of This Study 21 Chapter 2 Amplicon-based Metagenome Analysis Systems 23 2.1 Introduction 24 2.2 Analysis System for 454 Pyrosequencing 35 2.2.1 Methods 36 2.2.2 Results 39 2.3 Analysis System for Illumina MiSeq 60 2.3.1 Methods 62 2.3.2 Results 68 2.4 Summary and Discussion 93 Chapter 3 Shotgun-based Metagenome Analysis System 99 3.1 Introduction 100 3.1.1 Tools for Metagenomics 101 3.2 Methods 118 3.3 Results 125 3.4 Summary and Discussion 165 Chapter 4 EzEditor: A versatile Molecular Sequence Editor for Both Ribosomal RNA and Protein Coding Genes 169 4.1 Overview 170 4.2 Features of EzEditor 172 4.2.1 Algorithms and Models Implemented in EzEditor 177 4.2.2 Miscellaneous Functions 178 4.3 Summary and Discussion 181 Conclusions 183 References 187 APPENDIX I. Estimated Diversity Index of Household Microbiome 217 ê”­ëŹž ìŽˆëĄ (Abstract in Korean) 221Docto

    The BioLighthouse: Reusable Software Design for Bioinformatics

    Get PDF
    Advances in next-generation sequencing have accelerated the field of microbiology by making accessible a wealth of information about microbiomes. Unfortunately, microbiome experiments are among the least reproducible in terms of bioinformatics. Software tools are often poorly documented, under-maintained, and commonly have arcane dependencies requiring significant time investment to configure them correctly. Microbiome studies are multidisciplinary efforts but communication and knowledge discrepancies make accessibility, reproducibility, and transparency of computational workflows difficult. The BioLighthouse uses Ansible roles, playbooks, and modules to automate configuration and execution of bioinformatics workflows. The roles and playbooks act as virtual laboratory notebooks by documenting the provenance of a bioinformatics workflow. The BioLighthouse was tested for platform dependence and data-scale dependence with a microbial profiling pipeline. The microbial profiling pipeline consisted of Cutadapt, FLASH2, and DADA2. The pipeline was tested on 3 canola root and soil microbiome datasets with differing orders of magnitude of data: 1 sample, 10 samples, and 100 samples. Each dataset was processed by The BioLighthouse with 10 unique parameter sets and outputs were compared across 8 computing environments for a total of 240 pipeline runs. Outputs after each step in the pipeline were tested for identity using the Linux diff command to ensure reproducible results. Testing of The BioLighthouse suggested no platform or data-scale dependence. To provide an easy way of maintaining environment reproducibility in user-space, Conda and the channel Bioconda were used for virtual environments and software dependencies for configuring bioinformatics tools. The BioLighthouse provides a framework for developers to make their tools accessible to the research community, for bioinformaticians to build bioinformatics workflows, and for the broader research community to consume these tools at a high level while knowing the tools will execute as intended

    Assessment of Next Generation Sequencing Technologies for \u3ci\u3eDe novo\u3c/i\u3e and Hybrid Assemblies of Challenging Bacterial Genomes

    Get PDF
    In past decade, tremendous progress has been made in DNA sequencing methodologies in terms of throughput, speed, read-lengths, along with a sharp decrease in per base cost. These technologies, commonly referred to as next-generation sequencing (NGS) are complimented by the development of hybrid assembly approaches which can utilize multiple NGS platforms. In the first part of my dissertation I performed systematic evaluations and optimizations of nine de novo and hybrid assembly protocols across four novel microbial genomes. While each had strengths and weaknesses, via optimization using multiple strategies I obtained dramatic improvements in overall assembly size and quality. To select the best assembly, I also proposed the novel rDNA operon validation approach to evaluate assembly accuracy. Additionally, I investigated the ability of third-generation PacBio sequencing platform and achieved automated finishing of Clostridium autoethanogenum without any accessory data. These complete genome sequences facilitated comparisons which revealed rDNA operons as a major limitation for short read technologies, and also enabled comparative and functional genomics analysis. To facilitate future assessment and algorithms developments of NGS technologies we publically released the sequence datasets for C. autoethanogenum which span three generations of sequencing technologies, containing six types of data from four NGS platforms. To assess limitations of NGS technologies, assessment of unassembled regions within Illumina and PacBio assemblies was performed using eight microbial genomes. This analysis confirmed rDNA operons as major breakpoints within Illumina assembly while gaps within PacBio assembly appears to be an unaccounted for event and assembly quality is cumulative effect of read-depth, read-quality, sample DNA quality and presence of phage DNA or mobile genetic elements. In a final collaborative study an enrichment protocol was applied for isolation of live endophytic bacteria from roots of the tree Populus deltoides. This protocol achieved a significant reduction in contaminating plant DNA and enabled use these samples for single-cell genomics analysis for the first time. Whole genome sequencing of selected single-cell genomes was performed, assembly and contamination removal optimized, and followed by the bioinformatics, phylogenetic and comparative genomics analyses to identify unique characteristics of these uncultured microorganisms

    Novel computational methods for studying the role and interactions of transcription factors in gene regulation

    Get PDF
    Regulation of which genes are expressed and when enables the existence of different cell types sharing the same genetic code in their DNA. Erroneously functioning gene regulation can lead to diseases such as cancer. Gene regulatory programs can malfunction in several ways. Often if a disease is caused by a defective protein, the cause is a mutation in the gene coding for the protein rendering the protein unable to perform its functions properly. However, protein-coding genes make up only about 1.5% of the human genome, and majority of all disease-associated mutations discovered reside outside protein-coding genes. The mechanisms of action of these non-coding disease-associated mutations are far more incompletely understood. Binding of transcription factors (TFs) to DNA controls the rate of transcribing genetic information from the coding DNA sequence to RNA. Binding affinities of TFs to DNA have been extensively measured in vitro, ligands by exponential enrichment) and Protein Binding Microarrays (PBMs), and the genome-wide binding locations and patterns of TFs have been mapped in dozens of cell types. Despite this, our understanding of how TF binding to regulatory regions of the genome, promoters and enhancers, leads to gene expression is not at the level where gene expression could be reliably predicted based on DNA sequence only. In this work, we develop and apply computational tools to analyze and model the effects of TF-DNA binding. We also develop new methods for interpreting and understanding deep learning-based models trained on biological sequence data. In biological applications, the ability to understand how machine learning models make predictions is as, or even more important as raw predictive performance. This has created a demand for approaches helping researchers extract biologically meaningful information from deep learning model predictions. We develop a novel computational method for determining TF binding sites genome-wide from recently developed high-resolution ChIP-exo and ChIP-nexus experiments. We demonstrate that our method performs similarly or better than previously published methods while making less assumptions about the data. We also describe an improved algorithm for calling allele-specific TF-DNA binding. We utilize deep learning methods to learn features predicting transcriptional activity of human promoters and enhancers. The deep learning models are trained on massively parallel reporter gene assay (MPRA) data from human genomic regulatory elements, designed regulatory elements and promoters and enhancers selected from totally random pool of synthetic input DNA. This unprecedentedly large set of measurements of human gene regulatory element activities, in total more than 100 times the size of the human genome, allowed us to train models that were able to predict genomic transcription start site positions more accurately than models trained on genomic promoters, and to correctly predict effects of disease-associated promoter variants. We also found that interactions between promoters and local classical enhancers are non-specific in nature. The MPRA data integrated with extensive epigenetic measurements supports existence of three different classes of enhancers: classical enhancers, closed chromatin enhancers and chromatin-dependent enhancers. We also show that TFs can be divided into four different, non-exclusive classes based on their activities: chromatin opening, enhancing, promoting and TSS determining TFs. Interpreting the deep learning models of human gene regulatory elements required application of several existing model interpretation tools as well as developing new approaches. Here, we describe two new methods for visualizing features and interactions learned by deep learning models. Firstly, we describe an algorithm for testing if a deep learning model has learned an existing binding motif of a TF. Secondly, we visualize mutual information between pairwise k-mer distributions in sample inputs selected according to predictions by a machine learning model. This method highlights pairwise, and positional dependencies learned by a machine learning model. We demonstrate the use of this model-agnostic approach with classification and regression models trained on DNA, RNA and amino acid sequences.Monet eliöt koostuvat useista erilaisista solutyypeistÀ, vaikka kaikissa nÀiden eliöiden soluissa onkin sama DNA-koodi. Geenien ilmentymisen sÀÀtely mahdollistaa erilaiset solutyypit. Virheellisesti toimiva sÀÀtely voi johtaa sairauksiin, esimerkiksi syövÀn puhkeamiseen. Jos sairauden aiheuttaa viallinen proteiini, on syynÀ usein mutaatio tÀtÀ proteiinia koodaavassa geenissÀ, joka muuttaa proteiinia siten, ettei se enÀÀ pysty toimittamaan tehtÀvÀÀnsÀ riittÀvÀn hyvin. Kuitenkin vain 1,5 % ihmisen genomista on proteiineja koodaavia geenejÀ. Suurin osa kaikista löydetyistÀ sairauksiin liitetyistÀ mutaatioista sijaitsee nÀiden ns. koodaavien alueiden ulkopuolella. Ei-koodaavien sairauksiin liitetyiden mutaatioiden vaikutusmekanismit ovat yleisesti paljon huonommin tunnettuja, kuin koodaavien alueiden mutaatioiden. Transkriptiotekijöiden sitoutuminen DNA:han sÀÀtelee transkriptiota, eli geeneissÀ olevan geneettisen informaation lukemista ja muuntamista RNA:ksi. Transkriptiotekijöiden sitoutumista DNA:han on mitattu kattavasti in vitro-olosuhteissa, ja monien transkriptiotekijöiden sitoutumiskohdat on mitattu genominlaajuisesti useissa eri solutyypeissÀ. TÀstÀ huolimatta ymmÀrryksemme siitÀ miten transkriptioitekijöiden sitoutuminen genomin sÀÀtelyelementteihin, eli promoottoreihin ja vahvistajiin, johtaa geenien ilmentymiseen ei ole sellaisella tasolla, ettÀ voisimme luotettavasti ennustaa geenien ilmentymistÀ pelkÀstÀÀn DNA-sekvenssin perusteella. TÀssÀ työssÀ kehitÀmme ja sovellamme laskennallisia työkaluja transkriptiotekijöiden sitoutumisesta johtuvan geenien ilmentymisen analysointiin ja mallintamiseen. KehitÀmme myös uusia menetelmiÀ biologisella sekvenssidatalla opetettujen syvÀoppimismallien tulkitsemiseksi. Koneoppimismallin tekemien ennusteiden ymmÀrrettÀvyys on biologisissa sovelluksissa yleensÀ yhtÀ tÀrkeÀÀ, ellei jopa tÀrkeÀmpÀÀ kuin pelkkÀ raaka ennustetarkkuus. TÀmÀ on synnyttÀnyt tarpeen uusille menetelmille, jotka auttavat tutkijoita louhimaan biologisesti merkityksellistÀ tietoa syvÀoppimismallien ennusteista. Kehitimme tÀssÀ työssÀ uuden laskennallisen työkalun, jolla voidaan mÀÀrittÀÀ transkriptiotekijöiden sitoutumiskohdat genominlaajuisesti kÀyttÀen mittausdataa hiljattain kehitetyistÀ korkearesoluutioisista ChIP-exo ja ChIP-nexus kokeista. NÀytÀmme, ettÀ kehittÀmÀmme menetelmÀ suoriutuu paremmin, tai vÀhintÀÀn yhtÀ hyvin kuin aiemmin julkaistut menetelmÀt tehden nÀitÀ vÀhemmÀn oletuksia signaalin muodosta. Esittelemme myös parannellun algoritmin transkriptiotekijöiden alleelispesifin sitoutumisen mÀÀrittÀmiseksi. KÀytÀmme syvÀoppimismenetelmiÀ oppimaan mitkÀ ominaisuudet ennustavat ihmisen promoottori- ja voimistajaelementtien aktiivisuutta. NÀmÀ syvÀoppimismallit on opetettu valtavien rinnakkaisten reportterigeenikokeiden datalla ihmisen genomisista sÀÀtelyelementeistÀ, sekÀ aktiivisista promoottoreista ja voimistajista, jotka ovat valikoituneet satunnaisesta joukosta synteettisiÀ DNA-sekvenssejÀ. TÀmÀ ennennÀkemÀttömÀn laaja joukko mittauksia ihmisen sÀÀtelyelementtien aktiivisuudesta - yli satakertainen mÀÀrÀ DNA sekvenssiÀ ihmisen genomiin verrattuna - mahdollisti transkription aloituskohtien sijainnin ennustamisen ihmisen genomissa tarkemmin kuin ihmisen genomilla opetetut mallit. NÀmÀ mallit myös ennustivat oikein sairauksiin liitettyjen mutaatioiden vaikutukset ihmisen promoottoreilla. Tuloksemme nÀyttivÀt, ettÀ vuorovaikutukset ihmisen promoottorien ja klassisten paikallisten voimistajien vÀlillÀ ovat epÀspesifejÀ. MPRA-data, integroituna kattavien epigeneettisten mittausten kanssa mahdollisti voimistajaelementtien jaon kolmeen luokkaan: klassiset, suljetun kromatiinin, ja kromatiinista riippuvat voimistajat. Tutkimuksemme osoitti, ettÀ transkriptiotekijÀt voidaan jakaa neljÀÀn, osittain pÀÀllekkÀiseen luokkaan niiden aktiivisuuksien perusteella: kromatiinia avaaviin, voimistaviin, promotoiviin ja transkription aloituskohdan mÀÀrittÀviin transkriptiotekijöihin. Ihmisen genomin sÀÀtelyelementtejÀ kuvaavien syvÀoppimismallien tulkitseminen vaati sekÀ olemassa olevien menetelmien soveltamista, ettÀ uusien kehittÀmistÀ. Kehitimme tÀssÀ työssÀ kaksi uutta menetelmÀÀ syvÀoppimismallien oppimien muuttujien ja niiden vÀlisten vuorovaikutusten visualisoimiseksi. Ensin esittelemme algoritmin, jonka avulla voidaan testata onko syvÀoppimismalli oppinut jonkin jo tunnetun transkriptiotekijÀn sitoutumishahmon. Toiseksi, visualisoimme positiokohtaisten k-meerijakaumien keskeisinformaatiota sekvensseissÀ, jotka on valittu syvÀoppimismallin ennusteiden perusteella. TÀmÀ menetelmÀ paljastaa syvÀoppimismallin oppimat parivuorovaikutukset ja positiokohtaiset riippuvuudet. NÀytÀmme, ettÀ kehittÀmÀmme menetelmÀ on mallin arkkitehtuurista riippumaton soveltamalla sitÀ sekÀ luokittelijoihin, ettÀ regressiomalleihin, jotka on opetettu joko DNA-, RNA-, tai aminohapposekvenssidatalla
    • 

    corecore