36 research outputs found

    Quality Control and Analysis of RNA-seq Data from Breast Cancer Tumor Samples

    Get PDF
    Abstract: Background: Breast cancer is the most common kind of cancer among women in Sweden. While the short- to mid-term survival chances are good, the long-term survival chances are poor, and a large number of women are also likely being overtreated and thus suffer from unnecessary side effects. The South Sweden Cancerome Analysis Network - Breast (SCAN-B) Initiative aims at improving breast cancer outcome by developing new diagnostics and predictive tests based on RNA sequencing (RNA-seq) technology. With RNA-seq being a complicated technology with many error sources, quality control is needed to gain confidence in the obtained data. Results: During this project an RNA-seq quality control pipeline was built and integrated into the existing SCAN-B RNA-seq analysis pipeline. The quality control pipeline was used to evaluate the quality of 2547 RNA-seq libraries. The evaluation showed good overall quality of the data. While the quality of the first sequenced libraries is not optimal, quality has increased steadily and settled in on a high level. Conclusions: Quality control is essential for the RNA-seq analysis process. The metrics used in this project provide good insight into the quality of the evaluated datasets. However, cancer cells feature a distinct genomic landscape which can make the interpretation of metrics difficult. Thus, care has to be taken when drawing conclusions about the quality of RNA-seq data from cancer-derived samples. Popular science summary: Quality Control of Data from Breast Cancer Tumor Samples Breast cancer is the most common kind of cancer among women in Sweden. Challenges remain in improving long-term survival and personalizing the most effective treatments with the least side-effects. To improve this situation, new techniques based on profiling the gene expression of individual tumors are being developed, one example being RNA sequencing. This project was about quality control for the data being produced by RNA sequencing. Breast cancer is the most common kind of cancer among women in Sweden. While the short- to mid-term survival chances are good, the long-term survival chances are much worse and thus certain patients require more personalized and effective therapies. Furthermore, a large number of women whose disease has very good prognosis are likely being overtreated and thus suffer from unnecessary side-effects. The South Sweden Cancerome Analysis Network - Breast Initiative (SCAN-B; http://scan.bmc.lu.se) aims at improving breast cancer outcomes by developing new diagnostics and treatment-predictive tests based on RNA sequencing (RNA-seq) technology of patient tumors. RNA-seq is a tool to determine the RNA sequences and their abundance in a sample. It can be used to analyze the specific characteristics of different tumors such as gene expression levels and gene mutations. However, the technology is complex and includes many potential sources of noise. Quality control is needed to gain confidence in the obtained data. During this project a computational RNA-seq quality control pipeline was built and integrated into the existing SCAN-B RNA-seq analysis pipeline. This quality control pipeline was used to evaluate the quality of RNA-seq datasets from breast cancer tumor samples of ~600 patients according to a variety of metrics. These measure the quality of the sample prepared for sequencing, the sequencing process as well as the basic analysis steps. The evaluation showed good overall quality of the data. While the quality of the data from the first sequenced samples is not optimal, quality has increased steadily and settled in on a high level. Different problems that occurred during the sequencing process could be correlated with specific low metrics. However, care has to be taken when interpreting quality metrics from cancer-derived data. Cancer is a disease that arises from accumulated changes at the genome level. In principle, cancer-associated genomic changes can contribute to some poorer-appearing quality metrics, even when all steps involved in the RNA-seq worked correctly. Quality control is essential for the RNA-seq analysis process. The metrics used in this project provide good insight into the quality of the evaluated datasets. However, cancer cells feature a distinct genomic landscape which can make the interpretation of metrics difficult. Thus, care has to be taken when drawing conclusions about the quality of RNA-seq data from cancer-derived samples. Advisor: Lao Saal MasterÂŽs Degree Project 30 credits in Bioinformatics 2013 Department of Oncology, Lund Universit

    Remarkable similarities of chromosomal rearrangements between primary human breast cancers and matched distant metastases as revealed by whole-genome sequencing.

    Get PDF
    To better understand and characterize chromosomal structural variation during breast cancer progression, we enumerated chromosomal rearrangements for 11 patients by performing low-coverage whole-genome sequencing of 11 primary breast tumors and their 13 matched distant metastases. The tumor genomes harbored a median of 85 (range 18-404) rearrangements per tumor, with a median of 82 (26-310) in primaries compared to 87 (18-404) in distant metastases. Concordance between paired tumors from the same patient was high with a median of 89% of rearrangements shared (range 61-100%), whereas little overlap was found when comparing all possible pairings of tumors from different patients (median 3%). The tumors exhibited diverse genomic patterns of rearrangements: some carried events distributed throughout the genome while others had events mostly within densely clustered chromothripsis-like foci at a few chromosomal locations. Irrespectively, the patterns were highly conserved between the primary tumor and metastases from the same patient. Rearrangements occurred more frequently in genic areas than expected by chance and among the genes affected there was significant enrichment for cancer-associated genes including disruption of TP53, RB1, PTEN, and ESR1, likely contributing to tumor development. Our findings are most consistent with chromosomal rearrangements being early events in breast cancer progression that remain stable during the development from primary tumor to distant metastasis

    A crowdsourced set of curated structural variants for the human genome.

    Get PDF
    Funder: U.S. Food and Drug Administration; funder-id: http://dx.doi.org/10.13039/100000038A high quality benchmark for small variants encompassing 88 to 90% of the reference genome has been developed for seven Genome in a Bottle (GIAB) reference samples. However a reliable benchmark for large indels and structural variants (SVs) is more challenging. In this study, we manually curated 1235 SVs, which can ultimately be used to evaluate SV callers or train machine learning models. We developed a crowdsourcing app-SVCurator-to help GIAB curators manually review large indels and SVs within the human genome, and report their genotype and size accuracy. SVCurator displays images from short, long, and linked read sequencing data from the GIAB Ashkenazi Jewish Trio son [NIST RM 8391/HG002]. We asked curators to assign labels describing SV type (deletion or insertion), size accuracy, and genotype for 1235 putative insertions and deletions sampled from different size bins between 20 and 892,149 bp. 'Expert' curators were 93% concordant with each other, and 37 of the 61 curators had at least 78% concordance with a set of 'expert' curators. The curators were least concordant for complex SVs and SVs that had inaccurate breakpoints or size predictions. After filtering events with low concordance among curators, we produced high confidence labels for 935 events. The SVCurator crowdsourced labels were 94.5% concordant with the heuristic-based draft benchmark SV callset from GIAB. We found that curators can successfully evaluate putative SVs when given evidence from multiple sequencing technologies

    RNA Sequencing for Molecular Diagnostics in Breast Cancer

    No full text
    Breast cancer is the most common type of cancer in women and, in Sweden, is the most deadly second only to lung cancer. While treatment and diagnostic options have improved in the past decades and short- to mid-term survival is good, long-term survival is much poorer. On the other hand, many women are likely cured by surgery and radiotherapy alone, but receive unnecessary adjuvant treatment leading to undesirable health-related and economic side-effects. Reliably differentiating high-risk from low-risk patients to provide optimal treatment remains a challenge.The Sweden Cancerome Analysis Network–Breast (SCAN-B) project was initiated in 2009 and aims to improve breast cancer outcomes by developing new diagnostics and treatment-predictive tests. Within SCAN-B, tumor material and blood are being biobanked and the transcriptomes of many thousands of breast tumors are being analyzed using RNA sequencing (RNA-seq). The resulting sample collection and dataset provide an unprecedented resource for research, and the information therein may harbor ways to improve prognosis and to predict tumor susceptibility or resistance to therapies.In the four original studies included in this thesis we explored the use of RNA-seq as a diagnostic tool within breast cancer. In study I we described the SCAN-B processes and protocols, and analyzed early data to show the feasibility of using RNA-seq as a diagnostic platform. We showed that the patient population enrolled in SCAN-B largely reflects the characteristics of the total breast cancer patient population and benchmarked RNA-seq against prior techniques. In study II we diagnosed problems in commonly used RNA-seq alignment software and described the development of a software tool to correct the problems and improve data usability. Study III focused on diagnostics for determining the status of the important breast cancer biomarkers ER, PgR, HER2, Ki67, and Nottingham histological grade. We assessed the reproducibility of histopathology in measuring these biomarkers, and developed new ways of predicting their status using RNA-seq-based gene expression. We showed that expression-based biomarkers add value to histopathology by improving prognostic possibilities. In study IV we focused on the prospects of using RNA-seq to detect mutations. We developed a new computational method to profile mutations and used it to describe the mutational landscape of thousands of patient tumors and its impact on patient survival. In particular, we identified mutations in a subset of patients that are known to confer resistance to standard treatments.The hope is that, together, the diagnostic results made possible by the studies herein may one day enable oncologists to adapt treatment plans accordingly and improve patient quality of life and outcome

    TopHat-Recondition : A post-processor for TopHat unmapped reads

    No full text
    Background: TopHat is a popular spliced junction mapper for RNA sequencing data, and writes files in the BAM format - the binary version of the Sequence Alignment/Map (SAM) format. BAM is the standard exchange format for aligned sequencing reads, thus correct format implementation is paramount for software interoperability and correct analysis. However, TopHat writes its unmapped reads in a way that is not compatible with other software that implements the SAM/BAM format. Results: We have developed TopHat-Recondition, a post-processor for TopHat unmapped reads that restores read information in the proper format. TopHat-Recondition thus enables downstream software to process the plethora of BAM files written by TopHat. Conclusions: TopHat-Recondition can repair unmapped read files written by TopHat and is freely available under a 2-clause BSD license on GitHub: https://github.com/cbrueffer/tophat-recondition

    Additional file 1 of TopHat-Recondition: a post-processor for TopHat unmapped reads

    No full text
    Usage information and walk-through example. (PDF 166 kb

    Biopython Project Update 2017

    No full text
    The Biopython Project is a long-running distributed collaborative effort, supported by the Open Bioinformatics Foundation, which develops a freely available Python library for biological computation [1]. We present here details of the Biopython releases since BOSC 2016, namely Biopython 1.68, 1.69 and 1.70. Together these had 82 named contributors including 51 newcomers which reflects our policy of trying to encourage even small contributions.Biopython 1.68 (August 2016) was a relatively small release, with the main new feature being support for RSSB’s new binary Macromolecular Transmission Format (MMTF) for structural data.Biopython 1.69 (April 2017) represents the start of our re-licensing plan, to transition away from our liberal but unique Biopython License Agreement to the similar but very widely used 3-Clause BSD License. We are reviewing the code base authorship file-by-file, in order to gradually dual license the entire project. Major new features include: a new parser for the ExPASy Cellosaurus cell line database, catalogue and ontology; support for the UCSC Multiple Alignment Format (MAF), FSA sequencing files, version 4 of the Affymetrix CEL format; updates to the REBASE February 2017 restriction enzyme list; Bio.PDB.PDBList now can download more formats including MMTF; enhanced PyPy support by taking advantage of NumPy and compiling most of the Biopython C code modules.Biopython 1.70 (July 2017) has internal changes to better support the now standard pip tool for Python package installation. Major new features include: support for Mauve’s eXtended Multi-FastA (XMFA) file format, updates to our BLAST XML and MEME parsers, ExPASy support, and phylogenetic distance matrices. This release is noteworthy for our new logo, contributed by Patrick Kunzmann. This draws on our original double helix logo, and the blue and yellow colors of the current Python logo.All releases fixed miscellaneous bugs, enhanced the test suite, and continued efforts to follow the PEP8 and PEP257 coding style guidelines which is now checked automatically with GitHub-integrated continuous integration testing using TravisCI. We now also use AppVeyor for continuous integration testing under Windows. Current efforts include improving the unit test coverage, which is easily viewed online at CodeCov.io

    Clinical associations of ESR2 (estrogen receptor beta) expression across thousands of primary breast tumors

    No full text
    Estrogen receptor alpha (ERα, encoded by ESR1) is a well-characterized transcription factor expressed in more than 75% of breast tumors and is the key biomarker to direct endocrine therapies. On the other hand, much less is known about estrogen receptor beta (ERÎČ, encoded by ESR2) and its importance in cancer. Previous studies had some disagreement, however most reports suggested a more favorable prognosis for patients with high ESR2 expression. To add further clarity to ESR2 in breast cancer, we interrogated a large population-based cohort of primary breast tumors (n = 3207) from the SCAN-B study. RNA-seq shows ESR2 is expressed at low levels overall with a slight inverse correlation to ESR1 expression (Spearman R = -0.18, p = 2.2e-16), and highest ESR2 expression in the basal- and normal-like PAM50 subtypes. ESR2-high tumors had favorable overall survival (p = 0.006), particularly in subgroups receiving endocrine therapy (p = 0.03) and in triple-negative breast cancer (p = 0.01). These results were generally robust in multivariable analyses accounting for patient age, tumor size, node status, and grade. Gene modules consistent with immune response were associated to ESR2-high tumors. Taken together, our results indicate that ESR2 is generally expressed at low levels in breast cancer but associated with improved overall survival and may be related to immune response modulation

    Features of increased malignancy in eosinophilic clear cell renal cell carcinoma

    No full text
    Clear cell renal cell carcinoma (ccRCC) is the most common form of renal cancer. Due to inactivation of the von Hippel Lindau tumour suppressor, the hypoxia inducible transcription factors (HIFs) are constitutively activated in these tumours, resulting in a pseudo‐hypoxic phenotype. The HIFs induce the expression of genes involved in angiogenesis and cell survival, but they also reset the cellular metabolism to protect cells from oxygen and nutrient deprivation. ccRCC tumours are highly vascularized and the cytoplasm of the cancer cells is filled with lipid droplets and glycogen resulting in the histologically distinctive pale (clear) cytoplasm. Intratumoural heterogeneity may occur, and in some tumours, areas with granular, eosinophilic cytoplasm are found. Little is known regarding these traits and how they relate to the coexistent clear cell component, yet eosinophilic ccRCC is associated with higher grade and clinically more aggressive tumours. In this study we have for the first time performed RNA sequencing comparing histologically verified clear cell and eosinophilic areas from ccRCC tissue, aiming to analyse the characteristics of these cell types. Findings from RNA sequencing were confirmed by immunohistochemical staining of biphasic ccRCC.We found that the eosinophilic phenotype displayed a higher proliferative drive and lower differentiation, and we confirmed a correlation to tumours of higher stage. We further identified mutations of the tumour suppressor p53 (TP53) exclusively in the eosinophilic ccRCC component, where mTORC1 activity was also elevated. Also, eosinophilic areas were less vascularized, yet harboured more abundant infiltrating immune cells. The cytoplasm of clear cell ccRCC cells was filled with lipids but had very low mitochondrial content while the reverse was found in eosinophilic tissue. We herein suggest possible transcriptional mechanisms behind these phenomena
    corecore