110 research outputs found

    Optimized pipeline of MuTect and GATK tools to improve the detection of somatic single nucleotide polymorphisms in whole- exome sequencing data

    Get PDF
    Background: Detecting somatic mutations in whole exome sequencing data of cancer samples has become a popular approach for profiling cancer development, progression and chemotherapy resistance. Several studies have proposed software packages, filters and parametrizations. However, many research groups reported low concordance among different methods. We aimed to develop a pipeline which detects a wide range of single nucleotide mutations with high validation rates. We combined two standard tools – Genome Analysis Toolkit (GATK) and MuTect – to create the GATK-LODN method. As proof of principle, we applied our pipeline to exome sequencing data of hematological (Acute Myeloid and Acute Lymphoblastic Leukemias) and solid (Gastrointestinal Stromal Tumor and Lung Adenocarcinoma) tumors. We performed experiments on simulated data to test the sensitivity and specificity of our pipeline. Results: The software MuTect presented the highest validation rate (90 %) for mutation detection, but limited number of somatic mutations detected. The GATK detected a high number of mutations but with low specificity. The GATK-LODN increased the performance of the GATK variant detection (from 5 of 14 to 3 of 4 confirmed variants), while preserving mutations not detected by MuTect. However, GATK-LODN filtered more variants in the hematological samples than in the solid tumors. Experiments in simulated data demonstrated that GATK-LODN increased both specificity and sensitivity of GATK results. Conclusion: We presented a pipeline that detects a wide range of somatic single nucleotide variants, with good validation rates, from exome sequencing data of cancer samples. We also showed the advantage of combining standard algorithms to create the GATK-LODN method, that increased specificity and sensitivity of GATK results. This pipeline can be helpful in discovery studies aimed to profile the somatic mutational landscape of cancer genomes

    BITS 2015: The annual meeting of the Italian Society of Bioinformatics

    Get PDF
    This preface introduces the content of the BioMed Central journal Supplements related to the BITS 2015 meeting, held in Milan, Italy, from the 3th to the 5th of June, 2015

    Convergent Evolution of Copy Number Alterations in Multi-Centric Hepatocellular Carcinoma

    Get PDF
    In the recent years, new molecular methods have been proposed to discriminate multicentric hepatocellular carcinomas (HCC) from intrahepatic metastases. Some of these methods utilize sequencing data to assess similarities between cancer genomes, whilst other achieved the same results with transcriptome and methylome data. Here, we attempt to classify two HCC patients with multi-centric disease using the recall-rates of somatic mutations but find that difficult because their tumors share some chromosome-scale copy-number alterations (CNAs) but little-to-no single-nucleotide variants. To resolve the apparent conundrum, we apply a phasing strategy to test if those shared CNAs are identical by descent. Our findings suggest that the conflicting alterations occur on different homologous chromosomes, which argues for multi-centric origin of respective HCCs

    New Approaches for the Molecular Profiling of Human Cancers through Omics Data Analysis

    Get PDF
    In this thesis, we present three studies in which we applied ad hoc computational methods for the molecular profiling of human cancers using omics data. In the first study our main goal was to develop a pipeline of analysis able to detect a wide range of single nucleotide mutations with high validation rates. We combined two standard tools to create the GATK-LODN method, and we applied our pipeline to exome sequencing data of hematological and solid tumors. We created simulated datasets and performed experimental validation to test the pipeline sensitivity and specificity. In the second study we characterized the gene expression profiles of 11 tumor types aiming the discovery of multi-tumor drug targets and new strategies of drug combination and repurposing. We clustered tumors and applied a network-based analysis to integrate gene expression and protein interaction information. We defined three multi-tumor gene signatures, characterized by the following categories: NF-KB signaling, chromosomal instability, ubiquitin-proteasome system, DNA metabolism, and apoptosis. We evaluated the gene signatures based on mutational, pharmacological and clinical evidences. Moreover, we defined new pharmacological strategies validated by in vitro experiments that showed inhibition of cell growth in two tumor cell lines. In the third study we evaluated thyroid gene expression profiles of normal, Papillary Thyroid Carcinoma (PTC) and Anaplastic Thyroid Carcinoma (ATC) samples. The samples grouped in a progressional trend according to tissue type and the main biological processes affected in the normal to PTC transition were related to extracellular matrix and cell morphology; and those affected in the PTC to ATC transition were related to the control of cell cycle. We defined signatures related to each step of tumor progression and mapped the signatures onto protein-protein interaction and transcriptomical regulatory networks to prioritize genes for following experimental validation

    Review of state-of-the-art algorithms for genomics data analysis pipelines

    Get PDF
    [EN]The advent of big data and advanced genomic sequencing technologies has presented challenges in terms of data processing for clinical use. The complexity of detecting and interpreting genetic variants, coupled with the vast array of tools and algorithms and the heavy computational workload, has made the development of comprehensive genomic analysis platforms crucial to enabling clinicians to quickly provide patients with genetic results. This chapter reviews and describes the pipeline for analyzing massive genomic data using both short-read and long-read technologies, discussing the current state of the main tools used at each stage and the role of artificial intelligence in their development. It also introduces DeepNGS (deepngs.eu), an end-to-end genomic analysis web platform, including its key features and applications

    Mutational landscape of EGFR-, MYC-, and Kras-driven genetically engineered mouse models of lung adenocarcinoma

    Get PDF
    Genetically engineered mouse models (GEMMs) of cancer are increasingly being used to assess putative driver mutations identified by large-scale sequencing of human cancer genomes. To accurately interpret experiments that introduce additional mutations, an understanding of the somatic genetic profile and evolution of GEMM tumors is necessary. Here, we performed whole-exome sequencing of tumors from three GEMMs of lung adenocarcinoma driven by mutant epidermal growth factor receptor (EGFR), mutant Kirsten rat sarcoma viral oncogene homolog (Kras), or overexpression of MYC proto-oncogene. Tumors from EGFR- and Kras-driven models exhibited, respectively, 0.02 and 0.07 nonsynonymous mutations per megabase, a dramatically lower average mutational frequency than observed in human lung adenocarcinomas. Tumors from models driven by strong cancer drivers (mutant EGFR and Kras) harbored few mutations in known cancer genes, whereas tumors driven by MYC, a weaker initiating oncogene in the murine lung, acquired recurrent clonal oncogenic Kras mutations. In addition, although EGFR- and Kras-driven models both exhibited recurrent whole-chromosome DNA copy number alterations, the specific chromosomes altered by gain or loss were different in each model. These data demonstrate that GEMM tumors exhibit relatively simple somatic genotypes compared with human cancers of a similar type, making these autochthonous model systems useful for additive engineering approaches to assess the potential of novel mutations on tumorigenesis, cancer progression, and drug sensitivity

    Advancing Personalized Medicine Through the Application of Whole Exome Sequencing and Big Data Analytics

    Get PDF
    There is a growing attention toward personalized medicine. This is led by a fundamental shift from the ‘one size fits all’ paradigm for treatment of patients with conditions or predisposition to diseases, to one that embraces novel approaches, such as tailored target therapies, to achieve the best possible outcomes. Driven by these, several national and international genome projects have been initiated to reap the benefits of personalized medicine. Exome and targeted sequencing provide a balance between cost and benefit, in contrast to whole genome sequencing (WGS). Whole exome sequencing (WES) targets approximately 3% of the whole genome, which is the basis for protein-coding genes. Nonetheless, it has the characteristics of big data in large deployment. Herein, the application of WES and its relevance in advancing personalized medicine is reviewed. WES is mapped to Big Data “10 Vs” and the resulting challenges discussed. Application of existing biological databases and bioinformatics tools to address the bottleneck in data processing and analysis are presented, including the need for new generation big data analytics for the multi-omics challenges of personalized medicine. This includes the incorporation of artificial intelligence (AI) in the clinical utility landscape of genomic information, and future consideration to create a new frontier toward advancing the field of personalized medicine

    ISOWN: accurate somatic mutation identification in the absence of normal tissue controls.

    Get PDF
    BackgroundA key step in cancer genome analysis is the identification of somatic mutations in the tumor. This is typically done by comparing the genome of the tumor to the reference genome sequence derived from a normal tissue taken from the same donor. However, there are a variety of common scenarios in which matched normal tissue is not available for comparison.ResultsIn this work, we describe an algorithm to distinguish somatic single nucleotide variants (SNVs) in next-generation sequencing data from germline polymorphisms in the absence of normal samples using a machine learning approach. Our algorithm was evaluated using a family of supervised learning classifications across six different cancer types and ~1600 samples, including cell lines, fresh frozen tissues, and formalin-fixed paraffin-embedded tissues; we tested our algorithm with both deep targeted and whole-exome sequencing data. Our algorithm correctly classified between 95 and 98% of somatic mutations with F1-measure ranges from 75.9 to 98.6% depending on the tumor type. We have released the algorithm as a software package called ISOWN (Identification of SOmatic mutations Without matching Normal tissues).ConclusionsIn this work, we describe the development, implementation, and validation of ISOWN, an accurate algorithm for predicting somatic mutations in cancer tissues in the absence of matching normal tissues. ISOWN is available as Open Source under Apache License 2.0 from https://github.com/ikalatskaya/ISOWN

    RNA-seq based SNP discovery in gluteus medius muscle of Polish Landrace pigs

    Get PDF
    BackgroundSingle nucleotide polymorphisms (SNPs) are the well-known molecular markers in genetics and breeding studies applied to veterinary sciences and livestock production. Advancement of next generation sequencing (NGS) provides a high-throughput means of potential putative SNP discovery. The aim of the study was to identify the putative genetic variants in gluteus medius muscle transcriptome of Polish Landrace pigs.MethodsRNA-seq based NGS experiment was performed on Polish Landrace pigs fed with omega-6 and omega-3 polyunsaturated fatty acids (PUFAs) and normal diets. Isolation of total RNA from gluteus medius muscle was performed on low PUFAs (n=6) and High PUFAs dietary group of Polish Landrace pigs. The RNA-seq libraries were constructed by mRNA enrichment, mRNA fragmentation, second strand cDNA synthesis, adaptor ligation, size selection and PCR amplification using the illumina TruSeq RNA Sample Prep Kit v2 (Illumina, San Diego CA, USA), followed by NGS sequencing on MiSeq illumina platform. The quality control of raw RNA-seq data was performed using the Trimmomatic and FastQC tools. High QC paired-end RNA-seq data of gluteus medius muscle transcriptome were mapped to the reference genome Sus scrofa v.10.2. Finally, the SNPs discovery was performed using GATK and SAMtools bioinformatics SNPs caller tools.ResultsThe Fastq RNA-seq data generated from two pooled paired-end libraries (151bp) of gluteus medius muscle tissue of Polish Landrace pigs were submitted to NCBI SRA database (https://www.ncbi.nlm.nih.gov/sra). Study identified a total of 50.5 million paired-end reads (32.5 million low PUFAs dietary group and 18 million reads high PUFAs dietary group) of gluteus medius muscle transcriptome of Polish Landrace pigs. SNP discovery identified a total of 35436 homozygous and 28644 heterozygous cSNPs in gluteus medius muscle transcriptomes representing both dietary groups of Polish Landrace pig. Moreover, a total of 25187 and 5488 cSNP were identified as synonymous SNPs, and 18005 and 4780 cSNP were identified as nonsynonymous SNPs. Finally, single nucleotide variation (SNV) representing substitutions of all four possibilities (A,T,G,C) were identified ranging 2935 to 3227 SNVs (high PUFAs) and 3528 to 3882 SNVs (low PUFAs) for the heterozygous cSNPs and 2712 to 4058 (high PUFAs) and 4169 to 5692 SNVs (low PUFAs) for the heterozygous SNPs in gluteus medius muscle transcriptomes of Polish Landrace pigs.ConclusionsStudy concluded that identification of cSNPs dataset representing the gluteus medius muscle transcriptome of Polish Landrace pigs fed with a control diet (low) and pigs fed with a PUFAs diet (high) may be helpful to develop a new set of genetic markers specific to Polish Landrace pig breed. Such cSNP markers eventually can be utilized in genome-wide association studies (GWAS) and to finally implement on marker assisted selection (MAS) and genomics selection (GS) program in active breeding population of Polish Landrace pigs in Poland

    Virmid: accurate detection of somatic mutations with sample impurity inference

    Full text link
    Detection of somatic variation using sequence from disease-control matched data sets is a critical first step. In many cases including cancer, however, it is hard to isolate pure disease tissue, and the impurity hinders accurate mutation analysis by disrupting overall allele frequencies. Here, we propose a new method, Virmid, that explicitly determines the level of impurity in the sample, and uses it for improved detection of somatic variation. Extensive tests on simulated and real sequencing data from breast cancer and hemimegalencephaly demonstrate the power of our model. A software implementation of our method is available at http://sourceforge.net/projects/virmid/
    • 

    corecore