662 research outputs found

    The application of long‑read sequencing in clinical settings

    Get PDF
    Long-read DNA sequencing technologies have been rapidly evolving in recent years, and their ability to assess large and complex regions of the genome makes them ideal for clinical applications in molecular diagnosis and therapy selection, thereby providing a valuable tool for precision medicine. In the third-generation sequencing duopoly, Oxford Nanopore Technologies and Pacific Biosciences work towards increasing the accuracy, throughput, and portability of long-read sequencing methods while trying to keep costs low. These trades have made long-read sequencing an attractive tool for use in research and clinical settings. This article provides an overview of current clinical applications and limitations of long-read sequencing and explores its potential for point-of-care testing and health care in remote settings

    Designing a Robust and Portable Workflow for Detecting Genetic Variants Associated with Molecular Phenotypes Across Multiple Studies

    Get PDF
    Kvantitatiivse tunnuse lookusteks (quantitative trait locus, QTL) nimetatakse geneetilisi variante, millel on statistiline seos mõne molekulaarse tunnusega. QTL analüüs võimaldab paremini aru saada komplekshaiguseid ja tunnuseid mõjutavatest molekulaarsetest mehhanismidest. Tüüpiline QTL analüüs koosneb suurest hulgast sammudest, mille kõigi jaoks on olemas palju erinevaid tööriistu, kuid mida ei ole siiani kokku pandud ühte lihtsasti kasutatavasse, teisaldatavasse ning korratavasse töövoogu. Käesolevas töös loodud töövoog koosneb kolmest moodulist: huvipakkuva tunnuse kvantifitseerimine (i), andmete normaliseerimine ja kvaliteedikontroll (ii) ning QTL analüüs (iii). Kvantifitseerimise ja QTL analüüsi moodulite jaoks kasutasime Nextflow töövoo juhtimise süsteemi ning järgisime kõiki nf-core raamistiku parimaid praktikaid. Mõlemad töövoo moodulid on avatud lähekoodiga ning kasutavad tarkvarakonteinereid, mis võimaldab kasutajatel neid lihtsalt laiendada ning jooksutada erinevates arvutuskeskkondades. Kvaliteedikontrolli teostamiseks ning andmete normaliseerimiseks arendasime välja skripti, mis automaatselt arvutab välja erinevad kvaliteedimõõdikud ning esitab need kasutajale. Juhtprojekti raames viisime läbi geeniekspressiooni QTL analüüsi 15 andmestikus ja 40 erinevas bioloogilises kontekstis ning tuvastasime vähemalt ühe statistiliselt olulise QTLi enam kui 9000 geenile. Loodud töövoogude laialdasem kasutuselevõtt võimaldab muuta QTL analüüsi korratavamaks, teisaldatavamaks ning lihtsamini kasutatavaks.Quantitative trait locus (QTL) analysis links variations in molecular phenotype expression levels to genotype variation. This analysis has become a standard practice to better understand molecular mechanisms underlying complex traits and diseases. Typical QTL analysis consists of multiple steps. Although a diverse set of tools is available to perform these individual analysis, the tools have so far not been integrated into a reproducible and scalable workflow that is easy to use across a wide range computational environments. Our analysis workflow consists of three modules. The analysis starts with quantification of the phenotype of interest, proceeds with normalisation and quality control and finishes with the QTL analysis. For phenotype quantification and QTL mapping modules we developed pipelines following best practices of the nf-core framework. The pipelines are containerized, open-source, extensible and eligible to be parallelly executed in a variety computational environments. For quality control module we developed a script which automatically computes the measures of quality and provides user with information. As a proof of concept, we uniformly processed more than 40 context specific groups from more than 15 studies and discovered at least one significant eQTL for more than 9000 genes. We believe that adopting our pipelines will increase reproducibility, portability and robustness of QTL analysis in comparison to existing approaches

    Nanopore Sequencing Technology and Tools for Genome Assembly: Computational Analysis of the Current State, Bottlenecks and Future Directions

    Full text link
    Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages, and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we 1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and 2) provide guidelines for determining the appropriate tools for each step. We analyze various combinations of different tools and expose the tradeoffs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, in order to overcome the high error rates of the nanopore sequencing technology.Comment: To appear in Briefings in Bioinformatics (BIB), 201

    Investigation of de novo mutations in human genomes using whole genome sequencing data

    Full text link
    De novo mutations (DNMs) are novel mutations which occur for the first time in an offspring and are not inherited from the parents. High-Throughput Sequencing (HTS) technologies such as whole genome sequencing (WGS) and whole exome sequencing (WES) of trios have allowed the investigation of DNMs and their role in diseases. Increased contribution of DNMs in both rare monogenic and common complex disorders is now known. Identification of DNMs from WGS is challenging since the error rates in the HTS data are much higher than the expected DNM rate. To facilitate the evaluation of existing DNM callers and development of new callers, I developed TrioSim, the first automated tool to generate simulated WGS datasets for trios with a feature to spike-in DNMs in the offspring WGS data. Several computational methods have been developed to call DNMs from HTS data. I performed the first systematic evaluation of current DNM callers for WGS trio data using real dataset and simulated trio datasets and found that DNM callers have high sensitivity and can detect the majority of true DNMs. However, they suffer from very low specificity with thousands of false positive calls made by each caller. To address this, I developed MetaDeNovo, a consensus-based ensemble computational method to call DNMs using cloud-based technologies. MetaDeNovo is a fully automated methodology that utilises existing DNM callers and integrates their results. It demonstrates much higher specificity than all other callers while maintaining high sensitivity. Congenital Heart Disease (CHD) is the most common birth disorder worldwide. DNMs have been found to contribute to CHD causation. Most CHD cases are sporadic, suggesting role of DNMs in large proportion of them. I applied MetaDeNovo to detect DNMs in a WGS dataset of CHD trios to aid with genetic variant prioritisation. MetaDeNovo can dramatically reduce the number of false positive DNMs as compared to individual DNM callers. This has improved the current practices of identifying the genetic causes of disease in such cohorts. MetaDeNovo is applicable to other trio WGS datasets of other genetic diseases. This thesis has contributed new knowledge by in depth exploration of existing DNM callers, development of a novel tool (TrioSim) to simulate trio WGS data and an ensemble improved automated tool (MetaDeNovo) to identify DNMs with high specificity. MetaDeNovo demonstrates its use to identify disease-causing mutations in a trio analysis using WGS

    Comprehensive outline of whole exome sequencing data analysis tools available in clinical oncology

    Get PDF
    Whole exome sequencing (WES) enables the analysis of all protein coding sequences in the human genome. This technology enables the investigation of cancer-related genetic aberrations that are predominantly located in the exonic regions. WES delivers high-throughput results at a reasonable price. Here, we review analysis tools enabling utilization of WES data in clinical and research settings. Technically, WES initially allows the detection of single nucleotide variants (SNVs) and copy number variations (CNVs), and data obtained through these methods can be combined and further utilized. Variant calling algorithms for SNVs range from standalone tools to machine learning-based combined pipelines. Tools for CNV detection compare the number of reads aligned to a dedicated segment. Both SNVs and CNVs help to identify mutations resulting in pharmacologically druggable alterations. The identification of homologous recombination deficiency enables the use of PARP inhibitors. Determining microsatellite instability and tumor mutation burden helps to select patients eligible for immunotherapy. To pave the way for clinical applications, we have to recognize some limitations of WES, including its restricted ability to detect CNVs, low coverage compared to targeted sequencing, and the missing consensus regarding references and minimal application requirements. Recently, Galaxy became the leading platform in non-command line-based WES data processing. The maturation of next-generation sequencing is reinforced by Food and Drug Administration (FDA)-approved methods for cancer screening, detection, and follow-up. WES is on the verge of becoming an affordable and sufficiently evolved technology for everyday clinical use. © 2019 by the authors. Licensee MDPI, Basel, Switzerland

    ProkEvo: an automated, reproducible, and scalable framework for high-throughput bacterial population genomics analyses

    Get PDF
    Whole Genome Sequence (WGS) data from bacterial species is used for a variety of applications ranging from basic microbiological research, diagnostics, and epidemiological surveillance. The availability of WGS data from hundreds of thousands of individual isolates of individual microbial species poses a tremendous opportunity for discovery and hypothesis-generating research into ecology and evolution of these microorganisms. Flexibility, scalability, and user-friendliness of existing pipelines for population-scale inquiry, however, limit applications of systematic, population-scale approaches. Here, we present ProkEvo, an automated, scalable, reproducible, and open-source framework for bacterial population genomics analyses using WGS data. ProkEvo was specifically developed to achieve the following goals: (1) Automation and scaling of complex combinations of computational analyses for many thousands of bacterial genomes from inputs of raw Illumina paired-end sequence reads; (2) Use of workflow management systems (WMS) such as Pegasus WMS to ensure reproducibility, scalability, modularity, fault-tolerance, and robust file management throughout the process; (3) Use of high-performance and high-throughput computational platforms; (4) Generation of hierarchical-based population structure analysis based on combinations of multi-locus and Bayesian statistical approaches for classification for ecological and epidemiological inquiries; (5) Association of antimicrobial resistance (AMR) genes, putative virulence factors, and plasmids from curated databases with the hierarchically-related genotypic classifications; and (6) Production of pan-genome annotations and data compilation that can be utilized for downstream analysis such as identification of population-specific genomic signatures. The scalability of ProkEvo was measured with two datasets comprising significantly different numbers of input genomes (one with ~2,400 genomes, and the second with ~23,000 genomes). Depending on the dataset and the computational platform used, the running time of ProkEvo varied from ~3-26 days. ProkEvo can be used with virtually any bacterial species, and the Pegasus WMS uniquely facilitates addition or removal of programs from the workflow or modification of options within them. To demonstrate versatility of the ProkEvo platform, we performed a hierarchical-based population structure analyses from available genomes of three distinct pathogenic bacterial species as individual case studies. The specific case studies illustrate how hierarchical analyses of population structures, genotype frequencies, and distribution of specific gene functions can be integrated into an analysis. Collectively, our study shows that ProkEvo presents a practical viable option for scalable, automated analyses of bacterial populations with direct applications for basic microbiology research, clinical microbiological diagnostics, and epidemiological surveillance

    Comprehensive analysis of methylation data in non-model plant species

    Get PDF
    One of the goals of plant epigenetics is detecting differential methylation that may occur following specific treatments or in variable environments. This can be achieved with a single-base resolution with standard methods for whole-genome bisulfite sequencing (WGBS) and reduced representation bisulfite sequencing (RRBS). Another important goal is to exploit sequencing methods in combination with bisulfite treatment to associate genetics and epigenetics with phenotypic traits. In the past 19 years, this has become possible using so-called genome-wide association studies (GWAS) and epigenome-wide association studies (EWAS), the latter of which aims to reveal the potential biomarkers between phenotypic traits and epigenetic variation. In practice, such studies rely on software packages or “bioinformatics pipelines” which make the requisite computational processes routine and reliable. This thesis describes several such pipelines, developed within the framework of EpiDiverse, an Innovative Training Network (ITN) (https://epidiverse.eu/, accessed on 1 May 2021) carrying out comprehensive studies on pipelines for WGBS, differentially methylated region (DMR), EWAS, and single nucleotide polymorphism (SNP) analyses. Here I introduce the benchmark study with DMR tools, the EWAS pipeline, and bioinformatics pipelines implemented within the EpiDiverse toolkit. At first, by analyzing DMR tools with simulated datasets with seven different tools (metilene, methylKit, MOABS, DMRcate, Defiant, BSmooth, MethylSig) and four plant species (Aethionema arabicum, Arabidopsis thaliana, Picea abies, and Physcomitrium patens), together with the coauthors, we showed that metilene has a superior performance in terms of overall precision and recall. Therefore, we set it as a default DMR caller in the EpiDiverse DMR pipeline. Afterward, I introduced extended features of the EWAS pipeline beyond the GEM R package e.g., graphical outputs, novel missing data imputation, compatibility with new input types, etc. Then I revealed the effect of missing data with the Picea abies (Norway spruce) data and showed the pipeline presents logical missing data imputation. Furthermore, I obtained a significant overlap between the pipeline and Quercus lobata (valley oak) analysis results. By extensive benchmark with various tools, a group of pipelines became publicly available, whereby the EpiDiverse toolkit suits for people working with WGBS datasets (https://github.com/EpiDiverse, accessed on 1 May 2021)

    A Toolkit for bulk PCR-based marker design from next-generation sequence data: application for development of a framework linkage map in bulb onion (Allium cepa L.)

    Get PDF
    BACKGROUND: Although modern sequencing technologies permit the ready detection of numerous DNA sequence variants in any organisms, converting such information to PCR-based genetic markers is hampered by a lack of simple, scalable tools. Onion is an example of an under-researched crop with a complex, heterozygous genome where genome-based research has previously been hindered by limited sequence resources and genetic markers. RESULTS: We report the development of generic tools for large-scale web-based PCR-based marker design in the Galaxy bioinformatics framework, and their application for development of next-generation genetics resources in a wide cross of bulb onion (Allium cepa L.). Transcriptome sequence resources were developed for the homozygous doubled-haploid bulb onion line ‘CUDH2150’ and the genetically distant Indian landrace ‘Nasik Red’, using 454™ sequencing of normalised cDNA libraries of leaf and shoot. Read mapping of ‘Nasik Red’ reads onto ‘CUDH2150’ assemblies revealed 16836 indel and SNP polymorphisms that were mined for portable PCR-based marker development. Tools for detection of restriction polymorphisms and primer set design were developed in BioPython and adapted for use in the Galaxy workflow environment, enabling large-scale and targeted assay design. Using PCR-based markers designed with these tools, a framework genetic linkage map of over 800cM spanning all chromosomes was developed in a subset of 93 F(2) progeny from a very large F(2) family developed from the ‘Nasik Red’ x ‘CUDH2150’ inter-cross. The utility of tools and genetic resources developed was tested by designing markers to transcription factor-like polymorphic sequences. Bin mapping these markers using a subset of 10 progeny confirmed the ability to place markers within 10 cM bins, enabling increased efficiency in marker assignment and targeted map refinement. The major genetic loci conditioning red bulb colour (R) and fructan content (Frc) were located on this map by QTL analysis. CONCLUSIONS: The generic tools developed for the Galaxy environment enable rapid development of sets of PCR assays targeting sequence variants identified from Illumina and 454 sequence data. They enable non-specialist users to validate and exploit large volumes of next-generation sequence data using basic equipment
    corecore