10 research outputs found

    GenoMetric Query Language: A novel approach to large-scale genomic data management

    Get PDF
    Motivation: Improvement of sequencing technologies and data processing pipelines is rapidly providing sequencing data, with associated high-level features, of many individual genomes in multiple biological and clinical conditions. They allow for data-driven genomic, transcriptomic and epigenomic characterizations, but require state-of-the-art ‘big data’ computing strategies, with abstraction levels beyond available tool capabilities. Results: We propose a high-level, declarative GenoMetric Query Language (GMQL) and a toolkit for its use. GMQL operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous datasets and samples; as such it is key to genomic ‘big data’ analysis. GMQL leverages a simple data model that provides both abstractions of genomic region data and associated experimental, biological and clinical metadata and interoperability between many data formats. Based on Hadoop framework and Apache Pig platform, GMQL ensures high scalability, expressivity, flexibility and simplicity of use, as demonstrated by several biological query examples on ENCODE and TCGA datasets. Availability and implementation: The GMQL toolkit is freely available for non-commercial use at http://www.bioinformatics.deib.polimi.it/GMQL/. Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics online

    Implementing a transcription factor interaction prediction system using the genometric query language

    Get PDF
    Novel technologies and growing interest have resulted in a large increase in the amount of data available for genomics and transcriptomics studies, both in terms of volume and contents. Biology is relying more and more on computational methods to process, investigate, and extract knowledge from this huge amount of data. In this work, we present the TICA web server (available at http://www.gmql.eu/tica/), a fast and compact tool developed to support data-driven knowledge discovery in the realm of transcription factor interaction prediction. TICA leverages both the GenoMetric Query Language, a novel query tool (based on the Apache Hadoop and Spark technologies) specialized in the integration and management of heterogeneous, large genomic datasets, and a statistical method for robust detection of co-locations across interval-based data, in order to infer physically interacting transcription factors. Notably, TICA allows investigators to upload and analyze their own ChIP-seq experiments datasets, comparing them both against ENCODE data or between themselves, achieving computation time which increases linearly with respect to dataset size and density. Using ENCODE data from three well-studied cell lines as reference, we show that TICA predictions are supported by existing biological knowledge, making the web server a reliable and efficient tool for interaction screening and data-driven hypothesis generation

    Array-based data management for genomics

    No full text
    With the huge growth of genomic data, exposing multiple heterogeneous features of genomic regions for millions of individuals, we increasingly need to support domain-specific query languages and knowledge extraction operations, capable of aggregating and comparing trillions of regions arbitrarily positioned on the human genome. While row-based models for regions can be effectively used as a basis for cloud-based implementations, in previous work we have shown that the array-based model is effective in supporting the class of region-preserving operations, i.e. operations which do not create any new region but rather compose existing ones.In this paper, we remove the above constraint, and describe an array-based implementation which applies to unrestricted region operations, as required by the Genometric Query Language. Specifically, we define a wide spectrum of operations over datasets which are represented using arrays, and we show that the arraybased implementation scales well upon Spark, also thanks to a data representation which is effectively used for supporting machine learning. Our benchmark, which uses an independent, pre-existing collection of queries, shows that in many cases the novel array-based implementation significantly improves the performance of the row-based implementation

    Genomic data modeling for interoperability and next generation genomic data management

    No full text
    We illustrate the Genomic Data Model (GDM), a high-level, unifying data model which mediates among a variety of data formats and encodings. Thanks to GDM, thousands of samples within heterogeneous datasets can be integrated and queried, thereby facilitating the massive analysis of genomic data

    Multi-dimensional genomic data management for region-preserving operations

    No full text
    In previous work, we presented GenoMetric Query Language (GMQL), an algebraic language for querying genomic datasets, supported by Genomic Data Management System (GDMS), an open-source big data engine implemented on top of Apache Spark. GMQL datasets are represented as genomic regions (i.e. intervals of the genome, included within a start and stop position) with an associated value, representing the signal associated to that region (the most typical signals represent gene expressions, peaks of expressions, and variants relative to a reference genome.) GMQL can process queries over billions of regions, organized within distinct datasets.In this paper, we focus on the efficient execution of region preserving GMQL operations, in which the regions of the result are a subset of the regions of one of the operands; most GMQL operations are region-preserving. Chains of region-preserving operations can be efficiently executed by taking advantage of an array-based data organization, where region management can be separated from value management. We discuss this optimization in the context of the current GDMS system which has a row-based (relational) organization, and therefore requires dynamic data transformations. A similar approach applies to other application domains with interval-based data organization

    Next generation genomic computing

    No full text
    Next-generation sequencing (NGS) technologies and data processing pipelines are rapidly and inexpensively providing increasingly numerous sequencing data and associated (epi)genomic features of many individual genomes in multiple biological and clinical conditions, generally made publicly available within well-curated repositories. Answers to fundamental biomedical problems are hidden in these data; yet, their efficient management and integrative processing is becoming the biggest and most important “big data” problem of mankind. Multi-sample processing of heterogeneous information can support data-driven discoveries and biomolecular sense making, such as discovering how heterogeneous genomic, transcriptomic and epigenomic features cooperate to characterize biomolecular functions; yet, it requires state-of-the-art “big data” computing strategies, with abstractions beyond commonly used tool capabilities. We recently proposed a new paradigm in NGS data management and processing by introducing an essential Genomic Data Model (GDM) using few general abstractions for genomic region data and associated experimental, biological and clinical metadata that guarantee interoperability between existing data formats. Leveraging on GDM, we developed a next-generation, high-level, declarative GenoMetric Query Language (GMQL) for genomics data; here, we demonstrate its usefulness, flexibility and simplicity of use through several biological query examples. GMQL operates downstream of raw data preprocessing pipelines and supports queries over thousands of heterogeneous samples; computational efficiency and high scalability are achieved by using parallel computing on clusters or public clouds. GDM and GMQL are applicable to federated repositories, and can be exploited to provide integrated access to curated data, made available by large consortia such as ENCODE, Epigenomics Roadmap, or TCGA, through user-friendly search services
    corecore