85 research outputs found
Accurate Profiling of Microbial Communities from Massively Parallel Sequencing using Convex Optimization
We describe the Microbial Community Reconstruction ({\bf MCR}) Problem, which
is fundamental for microbiome analysis. In this problem, the goal is to
reconstruct the identity and frequency of species comprising a microbial
community, using short sequence reads from Massively Parallel Sequencing (MPS)
data obtained for specified genomic regions. We formulate the problem
mathematically as a convex optimization problem and provide sufficient
conditions for identifiability, namely the ability to reconstruct species
identity and frequency correctly when the data size (number of reads) grows to
infinity. We discuss different metrics for assessing the quality of the
reconstructed solution, including a novel phylogenetically-aware metric based
on the Mahalanobis distance, and give upper-bounds on the reconstruction error
for a finite number of reads under different metrics. We propose a scalable
divide-and-conquer algorithm for the problem using convex optimization, which
enables us to handle large problems (with species). We show using
numerical simulations that for realistic scenarios, where the microbial
communities are sparse, our algorithm gives solutions with high accuracy, both
in terms of obtaining accurate frequency, and in terms of species phylogenetic
resolution.Comment: To appear in SPIRE 1
An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics
For a decade, The Cancer Genome Atlas (TCGA) program collected clinicopathologic annotation data along with multi-platform molecular profiles of more than 11,000 human tumors across 33 different cancer types. TCGA clinical data contain key features representing the democratized nature of the data collection process. To ensure proper use of this large clinical dataset associated with genomic features, we developed a standardized dataset named the TCGA Pan-Cancer Clinical Data Resource (TCGA-CDR), which includes four major clinical outcome endpoints. In addition to detailing major challenges and statistical limitations encountered during the effort of integrating the acquired clinical data, we present a summary that includes endpoint usage recommendations for each cancer type. These TCGA-CDR findings appear to be consistent with cancer genomics studies independent of the TCGA effort and provide opportunities for investigating cancer biology using clinical correlates at an unprecedented scale. Analysis of clinicopathologic annotations for over 11,000 cancer patients in the TCGA program leads to the generation of TCGA Clinical Data Resource, which provides recommendations of clinical outcome endpoint usage for 33 cancer types
CTCF genetic alterations in endometrial carcinoma are pro-tumorigenic
CTCF is a haploinsufficient tumour suppressor gene with diverse normal functions in genome structure and gene regulation. However the mechanism by which CTCF haploinsufficiency contributes to cancer development is not well understood. CTCF is frequently mutated in endometrial cancer. Here we show that most CTCF mutations effectively result in CTCF haploinsufficiency through nonsense-mediated decay of mutant transcripts, or loss-of-function missense mutation. Conversely, we identified a recurrent CTCF mutation K365T, which alters a DNA binding residue, and acts as a gain-of-function mutation enhancing cell survival. CTCF genetic deletion occurs predominantly in poor prognosis serous subtype tumours, and this genetic deletion is associated with poor overall survival. In addition, we have shown that CTCF haploinsufficiency also occurs in poor prognosis endometrial clear cell carcinomas and has some association with endometrial cancer relapse and metastasis. Using shRNA targeting CTCF to recapitulate CTCF haploinsufficiency, we have identified a novel role for CTCF in the regulation of cellular polarity of endometrial glandular epithelium. Overall, we have identified two novel pro-tumorigenic roles (promoting cell survival and altering cell polarity) for genetic alterations of CTCF in endometrial cance
Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel
A major use of the 1000 Genomes Project (1000GP) data is genotype imputation in genome-wide association studies (GWAS). Here we develop a method to estimate haplotypes from low-coverage sequencing data that can take advantage of single-nucleotide polymorphism (SNP) microarray genotypes on the same samples. First the SNP array data are phased to build a backbone (or 'scaffold') of haplotypes across each chromosome. We then phase the sequence data 'onto' this haplotype scaffold. This approach can take advantage of relatedness between sequenced and non-sequenced samples to improve accuracy. We use this method to create a new 1000GP haplotype reference set for use by the human genetic community. Using a set of validation genotypes at SNP and bi-allelic indels we show that these haplotypes have lower genotype discordance and improved imputation performance into downstream GWAS samples, especially at low-frequency variants. © 2014 Macmillan Publishers Limited. All rights reserved
Integrated analysis of RNA and DNA from the phase III trial CALGB 40601 identifies predictors of response to trastuzumab-based neoadjuvant chemotherapy in HER2-positive breast cancer
Purpose: Response to a complex trastuzumab-based regimen is affected by multiple features of the tumor and its microenvironment. Developing a predictive algorithm is key to optimizing HER2-targeting therapy. Experimental Design: We analyzed 137 pretreatment tumors with mRNA-seq and DNA exome sequencing from CALGB 40601, a neoadjuvant phase III trial of paclitaxel plus trastuzumab with or without lapatinib in stage II to III HER2-positive breast cancer. We adopted an Elastic Net regularized regression approach that controls for covarying features within high-dimensional data. First, we applied 517 known gene expression signatures to develop an Elastic Net model to predict pCR, which we validated on 143 samples from four independent trials. Next, we performed integrative analyses incorporating clinicopathologic information with somatic mutation status, DNA copy number alterations (CNA), and gene signatures. Results: The Elastic Net model using only gene signatures predicted pCR in the validation sets (AUC ¼ 0.76). Integrative analyses showed that models containing gene signatures, clinical features, and DNA information were better pCR predictors than models containing a single data type. Frequently selected variables from the multiplatform models included amplifications of chromosome 6p, TP53 mutation, HER2-enriched subtype, and immune signatures. Variables predicting resistance included Luminal/ERþ features. Conclusions: Models using RNA only, as well as integrated RNA and DNA models, can predict pCR with improved accuracy over clinical variables. Somatic DNA alterations (mutation, CNAs), tumor molecular subtype (HER2E, Luminal), and the microenvironment (immune cells) were independent predictors of response to trastuzumab and paclitaxel-based regimens. This highlights the complexity of predicting response in HER2-positive breast cancer
Single‐Nucleotide Polymorphism Genotyping in Mapping Populations via Genomic Reduction and Next‐Generation Sequencing: Proof of Concept
Driver Fusions and Their Implications in the Development and Treatment of Human Cancers.
Gene fusions represent an important class of somatic alterations in cancer. We systematically investigated fusions in 9,624 tumors across 33 cancer types using multiple fusion calling tools. We identified a total of 25,664 fusions, with a 63% validation rate. Integration of gene expression, copy number, and fusion annotation data revealed that fusions involving oncogenes tend to exhibit increased expression, whereas fusions involving tumor suppressors have the opposite effect. For fusions involving kinases, we found 1,275 with an intact kinase domain, the proportion of which varied significantly across cancer types. Our study suggests that fusions drive the development of 16.5% of cancer cases and function as the sole driver in more than 1% of them. Finally, we identified druggable fusions involving genes such as TMPRSS2, RET, FGFR3, ALK, and ESR1 in 6.0% of cases, and we predicted immunogenic peptides, suggesting that fusions may provide leads for targeted drug and immune therapy
Multiomics in primary and metastatic breast tumors from the AURORA US network finds microenvironment and epigenetic drivers of metastasis
The AURORA US Metastasis Project was established with the goal to identify molecular features associated with metastasis. We assayed 55 females with metastatic breast cancer (51 primary cancers and 102 metastases) by RNA sequencing, tumor/germline DNA exome and low-pass whole-genome sequencing and global DNA methylation microarrays. Expression subtype changes were observed in ~30% of samples and were coincident with DNA clonality shifts, especially involving HER2. Downregulation of estrogen receptor (ER)-mediated cell–cell adhesion genes through DNA methylation mechanisms was observed in metastases. Microenvironment differences varied according to tumor subtype; the ER+/luminal subtype had lower fibroblast and endothelial content, while triple-negative breast cancer/basal metastases showed a decrease in B and T cells. In 17% of metastases, DNA hypermethylation and/or focal deletions were identified near HLA-A and were associated with reduced expression and lower immune cell infiltrates, especially in brain and liver metastases. These findings could have implications for treating individuals with metastatic breast cancer with immune- and HER2-targeting therapies
- …
