10 research outputs found

    Genome modeling system: A knowledge management platform for genomics

    Get PDF
    In this work, we present the Genome Modeling System (GMS), an analysis information management system capable of executing automated genome analysis pipelines at a massive scale. The GMS framework provides detailed tracking of samples and data coupled with reliable and repeatable analysis pipelines. The GMS also serves as a platform for bioinformatics development, allowing a large team to collaborate on data analysis, or an individual researcher to leverage the work of others effectively within its data management system. Rather than separating ad-hoc analysis from rigorous, reproducible pipelines, the GMS promotes systematic integration between the two. As a demonstration of the GMS, we performed an integrated analysis of whole genome, exome and transcriptome sequencing data from a breast cancer cell line (HCC1395) and matched lymphoblastoid line (HCC1395BL). These data are available for users to test the software, complete tutorials and develop novel GMS pipeline configurations. The GMS is available at https://github.com/genome/gms

    Genome remodelling in a basal-like breast cancer metastasis and xenograft

    Get PDF
    Massively parallel DNA sequencing technologies provide an unprecedented ability to screen entire genomes for genetic changes associated with tumour progression. Here we describe the genomic analyses of four DNA samples from an African-American patient with basal-like breast cancer: peripheral blood, the primary tumour, a brain metastasis and a xenograft derived from the primary tumour. The metastasis contained two de novo mutations and a large deletion not present in the primary tumour, and was significantly enriched for 20 shared mutations. The xenograft retained all primary tumour mutations and displayed a mutation enrichment pattern that resembled the metastasis. Two overlapping large deletions, encompassing CTNNA1, were present in all three tumour samples. The differential mutation frequencies and structural variation patterns in metastasis and xenograft compared with the primary tumour indicate that secondary tumours may arise from a minority of cells within the primary tumour

    HCC1395 (“TST1”) example input, models, and outputs.

    No full text
    <p>A test dataset for the HCC1395 cell line is provided with the GMS software to allow testing of software installation, and facilitate further development. It is also used to illustrate much of the current functionality of the GMS. HCC1395 tumor and the corresponding HCC1395BL ‘normal’ cell line DNA and RNA samples were sequenced by whole genome, exome, and RNA-seq methods producing six sets of instrument data for input to various GMS pipelines. Additional required inputs for the pipelines include a reference genome (e.g., GRCh37), gene annotations (e.g., Ensembl 67_37l), and variant databases (e.g., dbSNP37). Different versions (processing profiles) of the reference alignment were used to align WGS and exome DNA reads to the reference genome. A separate RNA-seq pipeline similarly aligns RNA reads. Alternate versions of the somatic variation pipeline are used to call various types of variants from exome and WGS data by comparing tumor and normal reference alignments. A differential expression pipeline identifies significantly altered transcript expression levels by comparing the tumor and normal RNA-seq alignments. Finally, the MedSeq pipeline summarizes all upstream pipelines into a single convenient result set. This includes a multitude of reports and visualizations for single nucleotide variants (SNVs), Indels (insertions and deletions), SVs (structural variants), CNVs (copy number variations), transcript fusions, differentially expressed genes, alternatively expressed isoforms, and much more. Data types are further integrated to, for example, identify which variants at the DNA level are expressed at the RNA level and which events affect known cancer driver genes or druggable targets.</p

    Key concepts of the GMS.

    No full text
    <p>The genome modeling system is architected around the idea of a ‘genome model’. The following vignettes illustrate key concepts integral to these models: (<b>A</b>) A subject can be modeled multiple times, possibly each with distinct ‘processing profiles’. For example, two different models can be defined for the HCC1395 genome using the ‘reference alignment’ pipeline. In Model 1, the processing profile specifies the use of BWA for alignment and Samtools for variant detection. In Model 2, Bowtie2 and GATK are used for these steps instead. (<b>B</b>) A given processing profile can be used across a group of models, ensuring, for instance, that all subjects in a cohort are processed in similar ways. In this example, two different cell line genomes (HCC1395 and XY2123) have models defined of the exact same type, using the processing profile with BWA/Samtools specified. (<b>C</b>) A model has no results until a build is generated. If the model is updated to have new inputs, a new build is required. Builds are immutable snapshots of modeling pipeline results. In this example, the HCC1395 genome has a reference alignment model again making use of the BWA/Samtools profile. However, as new instrument data becomes available, new builds are constructed to reflect the most complete data. (<b>D</b>) When models are used as inputs for other models, the last complete build for the input model is used as an input for the downstream build. In this example, both tumor and normal genomes are available for an individual (in this case HCC1395). Reference alignment models are built for each sample and then both are used as inputs for a third ‘somatic variation’ model. In reality, it is the underlying data in the reference alignment builds that are used to create a somatic variation build, identifying all variants that are thought to be tumor specific.</p

    Circos plot of HCC1395 tumor/normal comparison.

    No full text
    <p>Circos is a popular tool for summarizing genomic events in a tumor genome. This is just one of many automatically generated visualizations made possible by the GMS. In this example, the WGS, exome and RNA-seq data for HCC1395 are displayed in several tracks along with additional visualizations illustrating individual events. Moving inwards, SNVs and Indels are plotted on the outermost track, then highly expressed genes, CNVs, and finally chromosomal translocations at the center. For events predicted to affect protein coding genes, additional plots are auto-generated to display the mutation position relative to protein domains and previously reported mutations from the Cosmic database, as illustrated in the topmost plot. Moving clockwise, a screenshot of IGV demonstrates one of the somatic deletions identified. IGV XML sessions are automatically generated to allow rapid manual review of all predicted events. Next, a histogram illustrates the expression of a single highly expressed gene relative to the distribution of expression for all genes. Then, a CNV plot is shown for an amplified portion of one chromosome. Finally, the coverage and supporting reads for a chromosomal translocation are depicted.</p

    Overview of the GMS.

    No full text
    <p>The genome modeling system (GMS) is implemented to use a federated disk SAN, with meta-data stored in a PostgreSQL relational database. Sample management tools allow the import of new samples and instrument data. Data are then processed through various analysis pipelines (e.g., reference alignment, somatic variation detection, etc.) that in turn are managed and monitored by a workflow system (<a href="http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.1004274#box001" target="_blank">Box 1</a>). Stand-alone GMS tools, not part of automated pipelines, are available through a common tool tree. Most components of the system can be accessed through an Ubuntu Linux command-line interface or Ruby-on-Rails web interface.</p

    Somatic variation processing profile and workflow.

    No full text
    <p>To illustrate key GMS concepts, the processing profiles and workflow for the somatic variation pipeline are shown. Abbreviations: copy number variant (CNV), copy number amplification (CNA), genome analysis tool kit (GATK), insertion/deletion (Indel), loss of heterozygosity (LOH), mapping quality (MQ), single nucleotide variant (SNV), structural variant (SV), variant allele frequency (VAF).</p

    Major GMS pipelines.

    No full text
    <p>A brief description of each analysis pipeline tested for initial release of the GMS.</p

    Terminology for the Genome Modeling System.

    No full text
    <p>Brief descriptions of critical objects in the Genome Modeling System.</p
    corecore