175 research outputs found

    Discovery of potential causative mutations in human coding and noncoding genome with the interactive software BasePlayer

    Get PDF
    Next-generation sequencing (NGS) is routinely applied in life sciences and clinical practice, but interpretation of the massive quantities of genomic data produced has become a critical challenge. The genome-wide mutation analyses enabled by NGS have had a revolutionary impact in revealing the predisposing and driving DNA alterations behind a multitude of disorders. The workflow to identify causative mutations from NGS data, for example in cancer and rare diseases, commonly involves phases such as quality filtering, case-control comparison, genome annotation, and visual validation, which require multiple processing steps and usage of various tools and scripts. To this end, we have introduced an interactive and user-friendly multi-platform-compatible software, BasePlayer, which allows scientists, regardless of bioinformatics training, to carry out variant analysis in disease genetics settings. A genome-wide scan of regulatory regions for mutation clusters can be carried out with a desktop computer in -10 min with a dataset of 3 million somatic variants in 200 whole-genome-sequenced (WGS) cancers.Peer reviewe

    Bioinformatics for personal genomics: development and application of bioinformatic procedures for the analysis of genomic data

    Get PDF
    In the last decade, the huge decreasing of sequencing cost due to the development of high-throughput technologies completely changed the way for approaching the genetic problems. In particular, whole exome and whole genome sequencing are contributing to the extraordinary progress in the study of human variants opening up new perspectives in personalized medicine. Being a relatively new and fast developing field, appropriate tools and specialized knowledge are required for an efficient data production and analysis. In line with the times, in 2014, the University of Padua funded the BioInfoGen Strategic Project with the goal of developing technology and expertise in bioinformatics and molecular biology applied to personal genomics. The aim of my PhD was to contribute to this challenge by implementing a series of innovative tools and by applying them for investigating and possibly solving the case studies included into the project. I firstly developed an automated pipeline for dealing with Illumina data, able to sequentially perform each step necessary for passing from raw reads to somatic or germline variant detection. The system performance has been tested by means of internal controls and by its application on a cohort of patients affected by gastric cancer, obtaining interesting results. Once variants are called, they have to be annotated in order to define their properties such as the position at transcript and protein level, the impact on protein sequence, the pathogenicity and more. As most of the publicly available annotators were affected by systematic errors causing a low consistency in the final annotation, I implemented VarPred, a new tool for variant annotation, which guarantees the best accuracy (>99%) compared to the state-of-the-art programs, showing also good processing times. To make easy the use of VarPred, I equipped it with an intuitive web interface, that allows not only a graphical result evaluation, but also a simple filtration strategy. Furthermore, for a valuable user-driven prioritization of human genetic variations, I developed QueryOR, a web platform suitable for searching among known candidate genes as well as for finding novel gene-disease associations. QueryOR combines several innovative features that make it comprehensive, flexible and easy to use. The prioritization is achieved by a global positive selection process that promotes the emergence of the most reliable variants, rather than filtering out those not satisfying the applied criteria. QueryOR has been used to analyze the two case studies framed within the BioInfoGen project. In particular, it allowed to detect causative variants in patients affected by lysosomal storage diseases, highlighting also the efficacy of the designed sequencing panel. On the other hand, QueryOR simplified the recognition of LRP2 gene as possible candidate to explain such subjects with a Dent disease-like phenotype, but with no mutation in the previously identified disease-associated genes, CLCN5 and OCRL. As final corollary, an extensive analysis over recurrent exome variants was performed, showing that their origin can be mainly explained by inaccuracies in the reference genome, including misassembled regions and uncorrected bases, rather than by platform specific errors

    CellPhy: accurate and fast probabilistic inference of single-cell phylogenies from scDNA-seq data

    Get PDF
    We introduce CellPhy, a maximum likelihood framework for inferring phylogenetic trees from somatic single-cell single-nucleotide variants. CellPhy leverages a finite-site Markov genotype model with 16 diploid states and considers amplification error and allelic dropout. We implement CellPhy into RAxML-NG, a widely used phylogenetic inference package that provides statistical confidence measurements and scales well on large datasets with hundreds or thousands of cells. Comprehensive simulations suggest that CellPhy is more robust to single-cell genomics errors and outperforms state-of-the-art methods under realistic scenarios, both in accuracy and speed.European Research Council | Ref. ERC-617457- PHYLOCANCERAgencia Estatal de Investigación | Ref. PID2019-106247GB-I00Fundação para a Ciência e a Tecnologia | Ref. PTDC/BIA-EVL/32030/2017Xunta de Galici

    CellPhy: accurate and fast probabilistic inference of single-cell phylogenies from scDNA-seq data

    Get PDF
    We introduce CellPhy, a maximum likelihood framework for inferring phylogenetic trees from somatic single-cell single-nucleotide variants. CellPhy leverages a finite-site Markov genotype model with 16 diploid states and considers amplification error and allelic dropout. We implement CellPhy into RAxML-NG, a widely used phylogenetic inference package that provides statistical confidence measurements and scales well on large datasets with hundreds or thousands of cells. Comprehensive simulations suggest that CellPhy is more robust to single-cell genomics errors and outperforms state-of-the-art methods under realistic scenarios, both in accuracy and speed. CellPhy is freely available a

    Client applications and Server Side docker for management of RNASeq and/or VariantSeq workflows and pipelines of the GPRO Suite

    Get PDF
    The GPRO suite is an in-progress bioinformatic project for -omic data analyses. As part of the continued growth of this project, we introduce a client side & server side solution for comparative transcriptomics and analysis of variants. The client side consists of two Java applications called "RNASeq" and "VariantSeq" to manage workflows for RNA-seq and Variant-seq analysis, respectively, based on the most common command line interface tools for each topic. Both applications are coupled with a Linux server infrastructure (named GPRO Server Side) that hosts all dependencies of each application (scripts, databases, and command line interface tools). Implementation of the server side requires a Linux operating system, PHP, SQL, Python, bash scripting, and third-party software. The GPRO Server Side can be deployed via a Docker container that can be installed in the user's PC using any operating system or on remote servers as a cloud solution. The two applications are available as desktop and cloud applications and provide two execution modes: a Step-by-Step mode enables each step of a workflow to be executed independently and a Pipeline mode allows all steps to be run sequentially. The two applications also feature an experimental support system called GENIE that consists of a virtual chatbot/assistant and a pipeline jobs panel coupled with an expert system. The chatbot can troubleshoot issues with the usage of each tool, the pipeline job panel provides information about the status of each task executed in the GPRO Server Side, and the expert provides the user with a potential recommendation to identify or fix failed analyses. The two applications and the GPRO Server Side combine the user-friendliness and security of client software with the efficiency of front-end & back-end solutions to manage command line interface software for RNA-seq and variant-seq analysis via interface environments

    Client Applications and Server-Side Docker for Management of RNASeq and/or VariantSeq Workflows and Pipelines of the GPRO Suite

    Get PDF
    The GPRO suite is an in-progress bioinformatic project for -omics data analysis. As part of the continued growth of this project, we introduce a client- and server-side solution for comparative transcriptomics and analysis of variants. The client-side consists of two Java applications called 'RNASeq' and 'VariantSeq' to manage pipelines and workflows based on the most common command line interface tools for RNA-seq and Variant-seq analysis, respectively. As such, 'RNASeq' and 'VariantSeq' are coupled with a Linux server infrastructure (named GPRO Server-Side) that hosts all dependencies of each application (scripts, databases, and command line interface software). Implementation of the Server-Side requires a Linux operating system, PHP, SQL, Python, bash scripting, and third-party software. The GPRO Server-Side can be installed, via a Docker container, in the user's PC under any operating system or on remote servers, as a cloud solution. 'RNASeq' and 'VariantSeq' are both available as desktop (RCP compilation) and web (RAP compilation) applications. Each application has two execution modes: a step-by-step mode enables each step of the workflow to be executed independently, and a pipeline mode allows all steps to be run sequentially. 'RNASeq' and 'VariantSeq' also feature an experimental, online support system called GENIE that consists of a virtual (chatbot) assistant and a pipeline jobs panel coupled with an expert system. The chatbot can troubleshoot issues with the usage of each tool, the pipeline jobs panel provides information about the status of each computational job executed in the GPRO Server-Side, while the expert system provides the user with a potential recommendation to identify or fix failed analyses. Our solution is a ready-to-use topic specific platform that combines the user-friendliness, robustness, and security of desktop software, with the efficiency of cloud/web applications to manage pipelines and workflows based on command line interface software

    Client applications and server-side docker for management of RNASeq and/or VariantSeq workflows and pipelines of the GPRO suite

    Get PDF
    The GPRO suite is an in-progress bioinformatic project for -omics data analysis. As part of the continued growth of this project, we introduce a client- and server-side solution for comparative transcriptomics and analysis of variants. The client-side consists of two Java applications called “RNASeq” and “VariantSeq” to manage pipelines and workflows based on the most common command line interface tools for RNA-seq and Variant-seq analysis, respectively. As such, “RNASeq” and “VariantSeq” are coupled with a Linux server infrastructure (named GPRO Server-Side) that hosts all dependencies of each application (scripts, databases, and command line interface software). Implementation of the Server-Side requires a Linux operating system, PHP, SQL, Python, bash scripting, and third-party software. The GPRO Server-Side can be installed, via a Docker container, in the user’s PC under any operating system or on remote servers, as a cloud solution. “RNASeq” and “VariantSeq” are both available as desktop (RCP compilation) and web (RAP compilation) applications. Each application has two execution modes: a step-by-step mode enables each step of the workflow to be executed independently, and a pipeline mode allows all steps to be run sequentially. “RNASeq” and “VariantSeq” also feature an experimental, online support system called GENIE that consists of a virtual (chatbot) assistant and a pipeline jobs panel coupled with an expert system. The chatbot can troubleshoot issues with the usage of each tool, the pipeline jobs panel provides information about the status of each computational job executed in the GPRO Server-Side, while the expert system provides the user with a potential recommendation to identify or fix failed analyses. Our solution is a ready-to-use topic specific platform that combines the user-friendliness, robustness, and security of desktop software, with the efficiency of cloud/web applications to manage pipelines and workflows based on command line interface software.This work was supported by the Marie Sklodowska-Curie OPATHY project grant agreement 642095, the pre-doctoral research fellowship from MINECO Industrial Doctorates (Grant 659 DI-17-09134); Grant TSI-100903-2019-11 from the Secretary of State for Digital Advancement from Ministry of Economic Affairs and Digital Transformation, Spain; the Expedient IDI-2021-158274-a from the Ministry of Science and Innovation, Spain; and the ThinkInAzul program supported by MCIN with funding from European Union NextGenerationEU (PRTR-C17.I1) and Generalitat Valenciana (THINKINAZUL/2021/024).Peer Reviewed"Article signat per 18 autors/es: Ahmed Ibrahem Hafez, Beatriz Soriano, Aya Allah Elsayed,Ricardo Futami,Raquel Ceprian, Ricardo Ramos-Ruiz, Genis Martinez, Francisco Jose Roig, Miguel Angel Torres-Font, Fernando Naya-Catala, Josep Alvar Calduch-Giner, Lucia Trilla-Fuertes, Angelo Gamez Pozo, Vicente Arnau, Jose Maria Sempere-Luna, Jaume Perez-Sanchez, Toni Gabaldon and Carlos Llorens "Postprint (published version

    Best practices for bioinformatic characterization of neoantigens for clinical utility

    Get PDF
    Neoantigens are newly formed peptides created from somatic mutations that are capable of inducing tumor-specific T cell recognition. Recently, researchers and clinicians have leveraged next generation sequencing technologies to identify neoantigens and to create personalized immunotherapies for cancer treatment. To create a personalized cancer vaccine, neoantigens must be computationally predicted from matched tumor-normal sequencing data, and then ranked according to their predicted capability in stimulating a T cell response. This candidate neoantigen prediction process involves multiple steps, including somatic mutation identification, HLA typing, peptide processing, and peptide-MHC binding prediction. The general workflow has been utilized for many preclinical and clinical trials, but there is no current consensus approach and few established best practices. In this article, we review recent discoveries, summarize the available computational tools, and provide analysis considerations for each step, including neoantigen prediction, prioritization, delivery, and validation methods. In addition to reviewing the current state of neoantigen analysis, we provide practical guidance, specific recommendations, and extensive discussion of critical concepts and points of confusion in the practice of neoantigen characterization for clinical use. Finally, we outline necessary areas of development, including the need to improve HLA class II typing accuracy, to expand software support for diverse neoantigen sources, and to incorporate clinical response data to improve neoantigen prediction algorithms. The ultimate goal of neoantigen characterization workflows is to create personalized vaccines that improve patient outcomes in diverse cancer types

    Arvutuslikud ja statistilised meetodid DNA sekveneerimisandmete analüüsimiseks ja rakendused TÜ Eesti Geenivaramu andmetel

    Get PDF
    Väitekirja elektrooniline versioon ei sisalda publikatsiooneTänapäeval võimaldavad teise põlvkonna sekveneerimisel (next-generation sequencing, NGS) põhinevad meetodid määrata inimese genoomi järjestusi suurtes kohortides. Seejuures toodetakse väga suuri andmemahtusid, mis tekitavad mitmeid väljakutseid nii informaatika kui statistika valdkonnas. TÜ Eesti Geenivaramu (TÜ EGV) on aastatel 2002-2011 kogunud enam kui 50 000 inimese geeniproovi ja käesoleval aastal lisandub veel 100 000. Praeguseks hetkeks on üle 5 500 geenidoonori DNA-d analüüsitud erinevate NGS meetoditega. Käesolevas doktoritöös on pakutud üldine raamistik TÜ EGV-s toodetud NGS-andmete töötluseks ning lisaks on uuritud, kuidas võimalikult hästi arvestada Eesti päritolu isikute geneetilist eripära. Üheks levinud NGS meetodiks on eksoomi ehk kõigi valku kodeerivate geenipiirkondade sekveneerimine, mis võimaldab efektiivselt leida harvu ja de novo geenivariante ja leiab seetõttu rakendust meditsiinigeneetikas mendeliaarsete haiguste geenimutatsioonide tuvastamisel. Doktoritöö esimeses osas on analüüsitud kolme Eesti perekonna andmeid ja kõigil kolmel juhul kindlaks tehtud potentsiaalne patogeenne mutatsioon, mis lubab tulevikus välja töötada paremaid ravimeetodeid. Samuti on läbi viidud genoomi sekveneerimisandmete analüüs kliinilise vere näitajatega. See analüüs tõi välja populatsioonipõhise biopanga eelised, mis lisaks rikkalikele genoomiandmetele sisaldab ka väärtuslikku informatsiooni erinevate haiguste ja tunnuste kohta. Uuringus tuvastati olulisi seoseid CEBPA geenivariantide ja basofiilide arvu vahel, kusjuures viimasel on roll mitmete autoimmuunhaiguste sümptomaatikas. Ülegenoomsete assotsiatsiooniuuringute võimsuse suurendamiseks kasutatakse puuduvate geenivariantide ennustamist ehk imputeerimist. Muutmaks just Eesti päritolu isikute andmeanalüüsi tõhusamaks, on kasutatud genoomi sekveneerimisandmeid eestlaste-spetsiifilise imputatsioonipaneeli loomiseks. Seejärel on imputeeritud puuduvaid geenivariante kolmel moel – kasutades nii eestlaste-spetsiifilist kui ka kahte multi-etnilist paneeli. Võrdlustulemused näitasid, et eestlaste-spetsiifilise paneeli kasutamisel õnnestub määrata rohkem parema kvaliteediga geenivariante ning loodud paneeli eelis tuleb eriti esile harvaesinevate variantide puhul.Next-generation sequencing (NGS) technology enables large-scale, routine sequencing in large cohorts. This thesis demonstrated that the analysis of NGS data has a huge potential in several fields, but also requires a massive computational power. Also, with the increase of data volumes, there is an incessant need for the development of computational and statistical methods. Covering the whole spectrum of protein-coding regions in a cost-effective way, exome sequencing opens new opportunities for quick and exact large-scale screenings. In the first part of the thesis we analysed three Estonian families with Mendelian diseases and detected potentially causative gene variants for each case. These projects highlighted that a tight collaboration between data scientists and medical geneticists can lead to findings with considerable impact in the research of rare genetic disorders and have the potential to lead to successful therapies in the future. Population-based biobanks provide numerous opportunities for expanding phenotypic datasets. We used additional blood cell measurements from the electronic medical records and our genome-wide scan detected previously undiscovered association with basophil counts near CEBPA gene, and highlighted their role in the autoimmune regulation. This example opens new dimensions for scanning underlying genetic basis for a variety of traits and diseases. To increase the resolution of genome-wide scans, imputation is routinely implemented to incorporate variants that are not directly genotyped. We had an opportunity to construct an imputation reference panel to Estonians based on genome sequencing data. We showed that the utilization of a population-specific reference panel provided significantly higher imputation confidence for rare variants compared to larger, multi-ethnic panels. In the downstream analysis, we observed a huge gain in gene-based rare variant testing. As one of the main results of this thesis, the Estonian-specific imputation reference panel is created, tested and ready to serve for a long time. This includes data processing in the framework of the ongoing initiative to invite 100,000 Estonians to join the Biobank cohort, with the purpose to develop efficient disease prevention and treatment guides for the implementation of personalized medicine

    Resolving complex structural variants via nanopore sequencing

    Get PDF
    The recent development of high-throughput sequencing platforms provided impressive insights into the field of human genetics and contributed to considering structural variants (SVs) as the hallmark of genome instability, leading to the establishment of several pathologic conditions, including neoplasia and neurodegenerative and cognitive disorders. While SV detection is addressed by next-generation sequencing (NGS) technologies, the introduction of more recent long-read sequencing technologies have already been proven to be invaluable in overcoming the inaccuracy and limitations of NGS technologies when applied to resolve wide and structurally complex SVs due to the short length (100–500 bp) of the sequencing read utilized. Among the long-read sequencing technologies, Oxford Nanopore Technologies developed a sequencing platform based on a protein nanopore that allows the sequencing of “native” long DNA molecules of virtually unlimited length (typical range 1–100 Kb). In this review, we focus on the bioinformatics methods that improve the identification and genotyping of known and novel SVs to investigate human pathological conditions, discussing the possibility of introducing nanopore sequencing technology into routine diagnostics
    corecore