38 research outputs found
Recommended from our members
From Cancer Sequencing Data to Neoantigen Prediction: A Reusable Pipeline using Snakemake
Neoantigens are newly formed peptides formed by somatic mutations that are capable of inducing tumor-specific T-cell recognition. Because neoantigens are expressed specifically in tumor cells, prediction of these neoantigens can lead to personalized immunotherapies for the treatment of cancers. This process involves many steps, the most crucial of which is identification of expressed somatic mutations (or variants) using next generation sequencing data. After evaluating multiple bioinformatics tools for somatic mutation calling, we selected GATK (Genome Analysis ToolKit) for its ability to accurately call expected mutations. There are other steps that need to be performed before and after identification of somatic mutations as well and these include mapping, duplicate marking, annotation of mutation calls, and filtering of mutation calls. We developed a pipeline using the workflow management system Snakemake to perform these steps in order to identify somatic mutations from whole exome and RNA-Seq data. By making this into a snakemake workflow, we are able to easily extend upon it and add more steps as was done for neoantigen prediction. Furthermore, Snakemake submits slurm jobs for each individual step and can intelligently adjust the runtime and processing load for those jobs. This makes it simple to run even very large samples through the pipeline. We have evaluated this pipeline using RNA sequencing and whole exome sequencing data from 46 Multiple Myeloma cell lines and have identified hundreds of expressed mutations per cell line. This reusable and expandable pipeline can serve as a useful resource for other researchers looking to identify expressed mutations and make neoantigen predictions from cancer sequencing data
Recommended from our members
Structural Variant Detection Tools Struggle with Whole Exome Sequencing (WES) Data
Whole exome sequencing (WES) is a targeted sequencing technique that sequences only the protein-coding regions of the genome. As WES is significantly cheaper than whole genome sequencing (WGS) while still providing meaningful information, WES has become a respected tool in identifying small genetic variants underlying diseases. It is also used, but less commonly, to identify large-scale structural variants (SVs) which because of their size and complexity, are more difficult to detect using short-read sequencing data. SVs are genome alterations spanning fifty or more base pairs and have been linked to the onset or progression of certain diseases, such as Multiple Myeloma (MM). Multiple bioinformatics tools are available for the identification of structural variants from genomic data; however, it is important to benchmark their accuracies and efficiencies, particularly when dealing with exome data. Using exome sequencing data from 71 Multiple Myeloma cell lines, we benchmarked six established SV identification tools by comparing their results to each cell-line’s known SVs. We utilized the Texas Advanced Computing Center (TACC) to parallelly run our workflows on these samples. When comparing the SVs detected by each tool to the SVs expected in these cell lines, the results brought to light the challenges of detecting SVs using short read WES data. At the chromosomal level of these known SVs, only two of six tools had a recall greater than 25%. At the coordinate level, no tool had a recall greater than 20%. These tools have been used in published studies to identify SVs from WES data; their poor recall in these MM cell-lines may indicate the need for WES-specific SV detection tools in the future
Meta-analysis of microarray data using a pathway-based approach identifies a 37-gene expression signature for systemic lupus erythematosus in human peripheral blood mononuclear cells
Meta-analysis of microarray data using a pathway-based approach identifies a 37-gene expression signature for systemic lupus erythematosus in human peripheral blood mononuclear cells.
Meta-analysis of microarray data using a pathway-based approach identifies a 37-gene expression signature for systemic lupus erythematosus in human peripheral blood mononuclear cells
Abstract Background A number of publications have reported the use of microarray technology to identify gene expression signatures to infer mechanisms and pathways associated with systemic lupus erythematosus (SLE) in human peripheral blood mononuclear cells. However, meta-analysis approaches with microarray data have not been well-explored in SLE. Methods In this study, a pathway-based meta-analysis was applied to four independent gene expression oligonucleotide microarray data sets to identify gene expression signatures for SLE, and these data sets were confirmed by a fifth independent data set. Results Differentially expressed genes (DEGs) were identified in each data set by comparing expression microarray data from control samples and SLE samples. Using Ingenuity Pathway Analysis software, pathways associated with the DEGs were identified in each of the four data sets. Using the leave one data set out pathway-based meta-analysis approach, a 37-gene metasignature was identified. This SLE metasignature clearly distinguished SLE patients from controls as observed by unsupervised learning methods. The final confirmation of the metasignature was achieved by applying the metasignature to a fifth independent data set. Conclusions The novel pathway-based meta-analysis approach proved to be a useful technique for grouping disparate microarray data sets. This technique allowed for validated conclusions to be drawn across four different data sets and confirmed by an independent fifth data set. The metasignature and pathways identified by using this approach may serve as a source for identifying therapeutic targets for SLE and may possibly be used for diagnostic and monitoring purposes. Moreover, the meta-analysis approach provides a simple, intuitive solution for combining disparate microarray data sets to identify a strong metasignature. Please see Research Highlight: http://genomemedicine.com/content/3/5/30</p
Recommended from our members
Meta-analysis of microarray data using a pathway-based approach identifies a 37-gene expression signature for systemic lupus erythematosus in human peripheral blood mononuclear cells.
BackgroundA number of publications have reported the use of microarray technology to identify gene expression signatures to infer mechanisms and pathways associated with systemic lupus erythematosus (SLE) in human peripheral blood mononuclear cells. However, meta-analysis approaches with microarray data have not been well-explored in SLE.MethodsIn this study, a pathway-based meta-analysis was applied to four independent gene expression oligonucleotide microarray data sets to identify gene expression signatures for SLE, and these data sets were confirmed by a fifth independent data set.ResultsDifferentially expressed genes (DEGs) were identified in each data set by comparing expression microarray data from control samples and SLE samples. Using Ingenuity Pathway Analysis software, pathways associated with the DEGs were identified in each of the four data sets. Using the leave one data set out pathway-based meta-analysis approach, a 37-gene metasignature was identified. This SLE metasignature clearly distinguished SLE patients from controls as observed by unsupervised learning methods. The final confirmation of the metasignature was achieved by applying the metasignature to a fifth independent data set.ConclusionsThe novel pathway-based meta-analysis approach proved to be a useful technique for grouping disparate microarray data sets. This technique allowed for validated conclusions to be drawn across four different data sets and confirmed by an independent fifth data set. The metasignature and pathways identified by using this approach may serve as a source for identifying therapeutic targets for SLE and may possibly be used for diagnostic and monitoring purposes. Moreover, the meta-analysis approach provides a simple, intuitive solution for combining disparate microarray data sets to identify a strong metasignature
Recommended from our members
Finding Expressed Mutations in Multiple Myeloma Cell Lines
Neoantigens are newly formed peptides created from somatic mutations that are capable of inducing tumor-specific T-cell recognition. Prediction of these neoantigens can lead to personalized immunotherapies for the treatment of cancers. Identification of expressed somatic mutations using next generation sequencing data is a crucial first step in neoantigen prediction. Because of the expansion of next generation sequencing data, there exist a plethora of tools designed to sift through this data and return high quality Single Nucleotide Variants (SNVs) and small insertions and deletions (indels), however, it is essential to select tools that are flexible, efficient, and above all, accurate at detecting these mutations. Using RNA sequencing combined with whole exome sequencing data from 71 Human Multiple Myeloma cell lines (HMCLs), we compared different variant calling tools to develop a workflow for identifying expressed mutations. The use of well characterized HMCL’s with known SNVs and indels enables us to compare the accuracy of each variant calling tool. Thus far, we have compared the accuracy and efficiency of VarScan’s simple variant calling pipeline to GATK’s fully encompassing pipelines for exome and RNA-Seq data and have incorporated post-filtering, annotation and visualization of found variants to our workflow. Because of the large number of HMCLs and the several steps required and specific to each pipeline, we used Lonestar5 to parallelize our processing of the data. Our completed workflow will provide a standardized means for identifying expressed mutations in tumors
Transcription factor motifs associated with anterior insula gene-expression underlying mood disorder phenotypes
Background: Mood disorders represent a major cause of morbidity and mortality worldwide but the brain-related molecular pathophysiology in mood disorders remains largely undefined. Methods: Because the anterior insula is reduced in volume in patients with mood disorders, RNA was extracted from postmortem mood disorder samples and compared with unaffected control samples for RNA-sequencing identification of differentially expressed genes (DEGs) in a) bipolar disorder (BD; n=37) versus (vs.) controls (n=33), and b) major depressive disorder (MDD n=30) vs controls, and c) low vs. high Axis-I comorbidity (a measure of cumulative psychiatric disease burden). Given the regulatory role of transcription factors (TFs) in gene expression via specific-DNA-binding domains (motifs), we used JASPAR TF binding database to identify TF-motifs. Results: We found that DEGs in BD vs. controls, MDD vs. controls, and high vs. low Axis-I comorbidity were associated with TF-motifs that are known to regulate expression of toll-like receptor genes, cellular homeostatic-control genes, and genes involved in embryonic, cellular/organ and brain development. Discussion: Robust imaging-guided transcriptomics (i.e., using meta-analytic imaging results to guide independent post-mortem dissection for RNA-sequencing) was applied by targeting the gray matter volume reduction in the anterior insula in mood disorders, to guide independent postmortem identification of TF motifs regulating DEG. TF motifs were identified for immune, cellular, embryonic and neurodevelopmental processes. Conclusion: Our findings of TF-motifs that regulate the expression of immune, cellular homeostatic-control, and developmental genes provides novel information about the hierarchical .CC-BY-NC-ND 4.0 International licenseunder anot certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which wasthis version posted September 28, 2020. ; https://doi.org/10.1101/864900doi: bioRxiv preprint 3 relationship between gene regulatory networks, the TFs that control them, and proximate underlying neuroanatomical phenotypes in mood disorders
Recommended from our members
Transcription Factor Motifs Associated with Anterior Insula Gene Expression Underlying Mood Disorder Phenotypes
Mood disorders represent a major cause of morbidity and mortality worldwide but the brain-related molecular pathophysiology in mood disorders remains largely undefined. Because the anterior insula is reduced in volume in patients with mood disorders, RNA was extracted from the anterior insula postmortem anterior insula of mood disorder samples and compared with unaffected controls for RNA-sequencing identification of differentially expressed genes (DEGs) in (a) bipolar disorder (BD; n = 37) versus (vs.) controls (n = 33), and (b) major depressive disorder (MDD n = 30) vs. controls, and (c) low vs. high axis I comorbidity (a measure of cumulative psychiatric disease burden). Given the regulatory role of transcription factors (TFs) in gene expression via specific-DNA-binding domains (motifs), we used JASPAR TF binding database to identify TF-motifs. We found that DEGs in BD vs. controls, MDD vs. controls, and high vs. low axis I comorbidity were associated with TF-motifs that are known to regulate expression of toll-like receptor genes, cellular homeostatic-control genes, and genes involved in embryonic, cellular/organ, and brain development. Robust imaging-guided transcriptomics by using meta-analytic imaging results to guide independent postmortem dissection for RNA-sequencing was applied by targeting the gray matter volume reduction in the anterior insula in mood disorders, to guide independent postmortem identification of TF motifs regulating DEG. Our findings of TF-motifs that regulate the expression of immune, cellular homeostatic-control, and developmental genes provide novel information about the hierarchical relationship between gene regulatory networks, the TFs that control them, and proximate underlying neuroanatomical phenotypes in mood disorders
Recommended from our members
Detecting Structural Variants in Multiple Myeloma Cell Lines using Whole Exome Sequencing
Whole exome sequencing (WES) is a targeted sequencing technique that sequences only the protein-coding regions of the genome. As WES has superior cost- effectiveness when compared to whole genome sequencing (WGS), WES has become a respected tool in identifying small genetic variants underlying diseases. However, it is less commonly used to identify large-scale structural variants (SVs) which because of their size and complexity, are more difficult to detect using short-read sequencing data. SVs are genome alterations spanning 50 or more base pairs and have been linked to the onset or progression of certain diseases, such as Multiple Myeloma (MM). Multiple bioinformatics tools are available for the identification of structural variants from genomic data; however, it is important to benchmark their accuracies and efficiencies, particularly in the context of WES data. Using WES data from 71 Human Multiple Myeloma Cell Lines (HMCLs), we benchmarked three established SV identification tools (Delly, Pindel, and Smoove) by comparing their results to the known structural variants in each cell line. We used an SV visualization tool, svviz and developed our own visualization scripts to examine output features, such as the distribution of base pair length, types of structural variants detected, and performance metrics, such as run-time. We utilized the Texas Advanced Computing Center (TACC) to run our workflow on all HMCLs in parallel. These SV identification tools each possess unique strengths and weaknesses, so they will be combined (along with filtering and visualization of SVs) to create a robust workflow that will be utilized to identify novel structural variants in HMCLs which can then be extended to patient tumors