4,149 research outputs found

    Interpreting viral deep sequencing data with GLUE

    Get PDF
    Using deep sequencing technologies such as Illumina’s platform, it is possible to obtain reads from the viral RNA population revealing the viral genome diversity within a single host. A range of software tools and pipelines can transform raw deep sequencing reads into Sequence Alignment Mapping (SAM) files. We propose that interpretation tools should process these SAM files, directly translating individual reads to amino acids in order to extract statistics of interest such as the proportion of different amino acid residues at specific sites. This preserves per-read linkage between nucleotide variants at different positions within a codon location. The samReporter is a subsystem of the GLUE software toolkit which follows this direct read translation approach in its processing of SAM files. We test samReporter on a deep sequencing dataset obtained from a cohort of 241 UK HCV patients for whom prior treatment with direct-acting antivirals has failed; deep sequencing and resistance testing have been suggested to be of clinical use in this context. We compared the polymorphism interpretation results of the samReporter against an approach that does not preserve per-read linkage. We found that the samReporter was able to properly interpret the sequence data at resistance-associated locations in nine patients where the alternative approach was equivocal. In three cases, the samReporter confirmed that resistance or an atypical substitution was present at NS5A position 30. In three further cases, it confirmed that the sofosbuvir-resistant NS5B substitution S282T was absent. This suggests the direct read translation approach implemented is of value for interpreting viral deep sequencing data

    GenoMetric Query Language: A novel approach to large-scale genomic data management

    Get PDF
    Motivation: Improvement of sequencing technologies and data processing pipelines is rapidly providing sequencing data, with associated high-level features, of many individual genomes in multiple biological and clinical conditions. They allow for data-driven genomic, transcriptomic and epigenomic characterizations, but require state-of-the-art ‘big data’ computing strategies, with abstraction levels beyond available tool capabilities. Results: We propose a high-level, declarative GenoMetric Query Language (GMQL) and a toolkit for its use. GMQL operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous datasets and samples; as such it is key to genomic ‘big data’ analysis. GMQL leverages a simple data model that provides both abstractions of genomic region data and associated experimental, biological and clinical metadata and interoperability between many data formats. Based on Hadoop framework and Apache Pig platform, GMQL ensures high scalability, expressivity, flexibility and simplicity of use, as demonstrated by several biological query examples on ENCODE and TCGA datasets. Availability and implementation: The GMQL toolkit is freely available for non-commercial use at http://www.bioinformatics.deib.polimi.it/GMQL/. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online

    Framework for Supporting Genomic Operations

    Get PDF
    Next Generation Sequencing (NGS) is a family of technologies for reading the DNA or RNA, capable of producing whole genome sequences at an impressive speed, and causing a revolution of both biological research and medical practice. In this exciting scenario, while a huge number of specialized bio-informatics programs extract information from sequences, there is an increasing need for a new generation of systems and frameworks capable of integrating such information, providing holistic answers to the needs of biologists and clinicians. To respond to this need, we developed GMQL, a new query language for genomic data management that operates on heterogeneous genomic datasets. In this paper, we focus on three domain-specific operations of GMQL used for the efficient processing of operations on genomic regions, and we describe their efficient implementation; the paper develops a theory of binning strategies as a generic approach to parallel execution of genomic operations, and then describes how binning is embedded into two efficient implementations of the operations using Flink and Spark, two emerging frameworks for data management on the cloud

    A Low-Cost Genomic Sensor for Ocean-Observing Systems and Infectious Disease Detection

    Get PDF
    abstract: Many environmental microorganisms such as marine microbes are un-culturable; hence, they should be analyzed in situ. Even though a few in situ ocean observing instruments have been available to oceanographers, their applications are limited, because these instruments are expensive and power hungry. In this dissertation project, an inexpensive, portable, low-energy consuming, and highly quantitative microbiological genomic sensor has been developed for in situ ocean-observing systems. A novel real-time colorimetric loop-mediated isothermal amplification (LAMP) technology has been developed for quantitative detection of microbial nucleic acids. This technology was implemented on a chip-level device with an embedded inexpensive imaging device and temperature controller to achieve quantitative detection within one hour. A bubble-free liquid handling approach was introduced to avoid bubble trapping during liquid loading, a common problem in microfluidic devices. An algorithm was developed to reject the effect of bubbles generated during the reaction process, to enable more accurate nucleic acid analysis. This genomic sensor has been validated at gene and gene expression levels using Synechocystis sp. PCC 6803 genomic DNA and total RNA. Results suggest that the detection limits reached 10 copies/μL and 100 fg/μL, respectively. This approach was highly quantitative, with linear standard curves down to 103 copies/μL and 1 pg/μL, respectively. In addition to environmental microbe characterization, this genomic sensor has been employed for viral RNA quantification during an infectious disease outbreak. As the Zika fever was spreading in America, a quantitative detection of Zika virus has been performed. The results show that the genomic sensor is highly quantitative from 10 copies/μL to 105 copies/μL. This suggests that the novel nucleic acid quantification technology is sensitive, quantitative, and robust. It is a promising candidate for rapid microbe detection and quantification in routine laboratories. In the future, this genomic sensor will be implemented in in situ platforms as a core analytical module with minor modifications, and could be easily accessible by oceanographers. Deployment of this microbial genomic sensor in the field will enable new scientific advances in oceanography and provide a possible solution for infectious disease detection.Dissertation/ThesisDoctoral Dissertation Biological Design 201

    Combining DNA Methylation with Deep Learning Improves Sensitivity and Accuracy of Eukaryotic Genome Annotation

    Get PDF
    Thesis (Ph.D.) - Indiana University, School of Informatics, Computing, and Engineering, 2020The genome assembly process has significantly decreased in computational complexity since the advent of third-generation long-read technologies. However, genome annotations still require significant manual effort from scientists to produce trust-worthy annotations required for most bioinformatic analyses. Current methods for automatic eukaryotic annotation rely on sequence homology, structure, or repeat detection, and each method requires a separate tool, making the workflow for a final product a complex ensemble. Beyond the nucleotide sequence, one important component of genetic architecture is the presence of epigenetic marks, including DNA methylation. However, no automatic annotation tools currently use this valuable information. As methylation data becomes more widely available from nanopore sequencing technology, tools that take advantage of patterns in this data will be in demand. The goal of this dissertation was to improve the annotation process by developing and training a recurrent neural network (RNN) on trusted annotations to recognize multiple classes of elements from both the reference sequence and DNA methylation. We found that our proposed tool, RNNotate, detected fewer coding elements than GlimmerHMM and Augustus, but those predictions were more often correct. When predicting transposable elements, RNNotate was more accurate than both Repeat-Masker and RepeatScout. Additionally, we found that RNNotate was significantly less sensitive when trained and run without DNA methylation, validating our hypothesis. To our best knowledge, we are not only the first group to use recurrent neural networks for eukaryotic genome annotation, but we also innovated in the data space by utilizing DNA methylation patterns for prediction

    Processing genome-wide association studies within a repository of heterogeneous genomic datasets

    Get PDF
    Background Genome Wide Association Studies (GWAS) are based on the observation of genome-wide sets of genetic variants – typically single-nucleotide polymorphisms (SNPs) – in different individuals that are associated with phenotypic traits. Research efforts have so far been directed to improving GWAS techniques rather than on making the results of GWAS interoperable with other genomic signals; this is currently hindered by the use of heterogeneous formats and uncoordinated experiment descriptions. Results To practically facilitate integrative use, we propose to include GWAS datasets within the META-BASE repository, exploiting an integration pipeline previously studied for other genomic datasets that includes several heterogeneous data types in the same format, queryable from the same systems. We represent GWAS SNPs and metadata by means of the Genomic Data Model and include metadata within a relational representation by extending the Genomic Conceptual Model with a dedicated view. To further reduce the gap with the descriptions of other signals in the repository of genomic datasets, we perform a semantic annotation of phenotypic traits. Our pipeline is demonstrated using two important data sources, initially organized according to different data models: the NHGRI-EBI GWAS Catalog and FinnGen (University of Helsinki). The integration effort finally allows us to use these datasets within multisample processing queries that respond to important biological questions. These are then made usable for multi-omic studies together with, e.g., somatic and reference mutation data, genomic annotations, epigenetic signals. Conclusions As a result of our work on GWAS datasets, we enable 1) their interoperable use with several other homogenized and processed genomic datasets in the context of the META-BASE repository; 2) their big data processing by means of the GenoMetric Query Language and associated system. Future large-scale tertiary data analysis may extensively benefit from the addition of GWAS results to inform several different downstream analysis workflows

    A transcriptome-wide antitermination mechanism sustaining identity of embryonic stem cells

    Get PDF

    The Era of Next-Generation Sequencing in Clinical Oncology

    Get PDF

    The Era of Next-Generation Sequencing in Clinical Oncology

    Get PDF
    • …
    corecore