46 research outputs found

    The Genomedata format for storing large-scale functional genomics data

    Get PDF
    Summary: We present a format for efficient storage of multiple tracks of numeric data anchored to a genome. The format allows fast random access to hundreds of gigabytes of data, while retaining a small disk space footprint. We have also developed utilities to load data into this format. We show that retrieving data from this format is more than 2900 times faster than a naive approach using wiggle files

    Exploratory analysis of genomic segmentations with Segtools

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>As genome-wide experiments and annotations become more prevalent, researchers increasingly require tools to help interpret data at this scale. Many functional genomics experiments involve partitioning the genome into labeled segments, such that segments sharing the same label exhibit one or more biochemical or functional traits. For example, a collection of ChlP-seq experiments yields a compendium of peaks, each labeled with one or more associated DNA-binding proteins. Similarly, manually or automatically generated annotations of functional genomic elements, including <it>cis</it>-regulatory modules and protein-coding or RNA genes, can also be summarized as genomic segmentations.</p> <p>Results</p> <p>We present a software toolkit called <it>Segtools </it>that simplifies and automates the exploration of genomic segmentations. The software operates as a series of interacting tools, each of which provides one mode of summarization. These various tools can be pipelined and summarized in a single HTML page. We describe the Segtools toolkit and demonstrate its use in interpreting a collection of human histone modification data sets and <it>Plasmodium falciparum </it>local chromatin structure data sets.</p> <p>Conclusions</p> <p>Segtools provides a convenient, powerful means of interpreting a genomic segmentation.</p

    Identifying elemental genomic track types and representing them uniformly

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>With the recent advances and availability of various high-throughput sequencing technologies, data on many molecular aspects, such as gene regulation, chromatin dynamics, and the three-dimensional organization of DNA, are rapidly being generated in an increasing number of laboratories. The variation in biological context, and the increasingly dispersed mode of data generation, imply a need for precise, interoperable and flexible representations of genomic features through formats that are easy to parse. A host of alternative formats are currently available and in use, complicating analysis and tool development. The issue of whether and how the multitude of formats reflects varying underlying characteristics of data has to our knowledge not previously been systematically treated.</p> <p>Results</p> <p>We here identify intrinsic distinctions between genomic features, and argue that the distinctions imply that a certain variation in the representation of features as genomic tracks is warranted. Four core informational properties of tracks are discussed: gaps, lengths, values and interconnections. From this we delineate fifteen generic track types. Based on the track type distinctions, we characterize major existing representational formats and find that the track types are not adequately supported by any single format. We also find, in contrast to the XML formats, that none of the existing tabular formats are conveniently extendable to support all track types. We thus propose two unified formats for track data, an improved XML format, BioXSD 1.1, and a new tabular format, GTrack 1.0.</p> <p>Conclusions</p> <p>The defined track types are shown to capture relevant distinctions between genomic annotation tracks, resulting in varying representational needs and analysis possibilities. The proposed formats, GTrack 1.0 and BioXSD 1.1, cater to the identified track distinctions and emphasize preciseness, flexibility and parsing convenience.</p

    SigTools: An exploratory visualization tool for genomic signals

    Get PDF
    With the advancement of sequencing technologies, genomic data sets are constantly being expanded by high volumes of different data types. One recently introduced data type in genomic science is genomic signals, with genomic coordinates associated with a score or probability indicating some form of biological activity. An example of genomic signals isEpigenomicmarkswhich represent short-read coverage measurements over the genome, and are utilized to locate functional and nonfunctional elements in genome annotation studies. To understand and evaluate the results of such studies, one needs to explore and analyze the characteristics of the input data. Information visualization is an effective approach that leverages human visual ability in data analysis. Several visualization applications have been deployed for this purpose such as the UCSC genome browser, Deeptools, and Segtools. However, we believe there is room for improvement in terms of programming skills requirements and proposed visualizations. Sigtools is an R-based exploratory visualization package, designed to enable the users with limited programming experience to produce statistical plots of continuous genomic data. It consists of several statistical visualizations such as value distribution, correlation, and autocorrelation that provide insights regarding the behavior of a group of signals in larger regions – such as a chromosome or the whole genome – as well as visualizing them around a specific point or short region. To demonstrate Sigtools utilization, first, we visualize five histone modifications downloaded from Roadmap Epigenomics data portal and show that Sigtools accurately captures their characteristics. Then, we visualize five chromatin state features, probabilistic generated genome annotations, to display how sigtools can assist in the interpretation of new and unknown signals

    Visualizing genome and systems biology: technologies, tools, implementation techniques and trends, past, present and future.

    Get PDF
    "Α picture is worth a thousand words." This widely used adage sums up in a few words the notion that a successful visual representation of a concept should enable easy and rapid absorption of large amounts of information. Although, in general, the notion of capturing complex ideas using images is very appealing, would 1000 words be enough to describe the unknown in a research field such as the life sciences? Life sciences is one of the biggest generators of enormous datasets, mainly as a result of recent and rapid technological advances; their complexity can make these datasets incomprehensible without effective visualization methods. Here we discuss the past, present and future of genomic and systems biology visualization. We briefly comment on many visualization and analysis tools and the purposes that they serve. We focus on the latest libraries and programming languages that enable more effective, efficient and faster approaches for visualizing biological concepts, and also comment on the future human-computer interaction trends that would enable for enhancing visualization further

    Computational pan-genomics: status, promises and challenges

    Get PDF

    Computational pan-genomics: status, promises and challenges

    Get PDF
    corecore