973 research outputs found

    SeqMule: automated pipeline for analysis of human exome/genome sequencing data

    Get PDF
    Next-generation sequencing (NGS) technology has greatly helped us identify disease-contributory variants for Mendelian diseases. However, users are often faced with issues such as software compatibility, complicated configuration, and no access to high-performance computing facility. Discrepancies exist among aligners and variant callers. We developed a computational pipeline, SeqMule, to perform automated variant calling from NGS data on human genomes and exomes. SeqMule integrates computational-cluster-free parallelization capability built on top of the variant callers, and facilitates normalization/intersection of variant calls to generate consensus set with high confidence. SeqMule integrates 5 alignment tools, 5 variant calling algorithms and accepts various combinations all by one-line command, therefore allowing highly flexible yet fully automated variant calling. In a modern machine (2 Intel Xeon X5650 CPUs, 48 GB memory), when fast turn-around is needed, SeqMule generates annotated VCF files in a day from a 30X whole-genome sequencing data set; when more accurate calling is needed, SeqMule generates consensus call set that improves over single callers, as measured by both Mendelian error rate and consistency. SeqMule supports Sun Grid Engine for parallel processing, offers turn-key solution for deployment on Amazon Web Services, allows quality check, Mendelian error check, consistency evaluation, HTML-based reports. SeqMule is available at http://seqmule.openbioinformatics.org

    SVenX: A highly parallelized pipeline for structural variation detection using linked read whole genome sequencing data

    Get PDF
    Genomic rearrangements larger than 50 bp are called structural variants. As a group, they affect the phenotypic diversity among humans and have been associated with many human disorders including neurodevelopmental disorder and cancer. Recent advances in whole genome sequencing (WGS) technologies have made it possible to identify many more disease-causing genetic variants relevant in clinical diagnostics and sometimes affecting treatment. Numerous approaches have been proposed to detect structural variants, but to acquire and filter out the most significant information from the multitude of called variants in the sequencing data has shown to be a challenge. Another obstacle is the high computational cost of data analyses and difficulties in configuring and operating the softwares and databases. Here, we present SVenX, a highly automated and parallelized pipeline that analyzes and call structural variants using linked read WGS data. It performs variant calling using three different approaches, as well as annotation of variants and variant filtering. We also introduce a new tool, SVGenT, that reanalyzes the called structural variants by performing de novo assembly using the aligned reads at the identified breakpoint junctions. By comparing assembled contigs and analyzing the read coverage between the breakpoint junctions, SVGenT improves both variant and genotype classification and the breakpoint localization.Tool for detection of genomic rearrangements in humans Genomic rearrangements larger than 50 base pairs are referred to as structural variants (SVs), and impact phenotypic differences between humans. Some of these variants have been associated with human diseases such as cancer and neurodevelopmental disorders. Recent advances in whole genome sequencing (WGS) technologies have made it possible to analyze and identify many structural variants. Yet, the existing tools used for analyzing these data are not perfect, and require a fair amount of knowledge in bioinformatics to operate. SVenX is a highly parallelized and automated pipeline, executing all steps from whole genome sequencing data to filtered SVs. This includes 1) verifying that all required data exist, 2) making sure no data duplications exist, 3) finding variants using different methods, and 4) annotating and filtering the detected SVs. SVenX performs 10 separate steps including 3 different variant detection tools (also known as variant callers). Normally, these steps are performed one by one, waiting for the output before running the next. Not only does it take longer for the programs to run with this approach, it also requires an employee to execute the steps. Except from the installation, SVenX takes at the most a few minutes to setup and launch and can analyze multiple samples of WGS data at the same time. The whole pipeline takes about 4 to 5 days to complete, requiring minimal work effort and bioinformatic knowledge. Another challenge in SV research is not only detecting the variants, but also to be confident that the detected SVs are true calls. The performance of existing variant callers differ significantly between each other. One tool can perform really good using one dataset and fail totally in detecting SVs in another dataset, while a second tool might be good in detecting only a single type of SV. Using multiple bioinformatics methods to detect SVs have shown to result in a higher detection rate. We have created a novel tool, SVGenT, that re-analyzes already detected SVs by doing de novo assembly. SVGenT classifies the SV type (deletion, duplication, inversion or break-end), genotype (homozygous or heterozygous), and update the genomic position of the SV breakpoints. SVGenT has been tested using two datasets: one public large-scale WGS dataset and one simulated dataset with 4000 SVs. Three different variant callers were used to detect the variants before SVGenT was run on the output files. The detection rate was calculated before and after SVGenT was applied. In most cases, SVGenT improved the classification of both SV-type and SV-genotype. Master’s Degree Project in Biology/Molecular Biology/Bioinformatics 60 credits 2017 Department of Biology, Lund University Advisor: Anna Lindstrand M.D., Ph.D. Karolinska Institutet

    Bioinformatics and computational tools for next-generation sequencing analysis in clinical genetics

    Get PDF
    Clinical genetics has an important role in the healthcare system to provide a definitive diagnosis for many rare syndromes. It also can have an influence over genetics prevention, disease prognosis and assisting the selection of the best options of care/treatment for patients. Next-generation sequencing (NGS) has transformed clinical genetics making possible to analyze hundreds of genes at an unprecedented speed and at a lower price when comparing to conventional Sanger sequencing. Despite the growing literature concerning NGS in a clinical setting, this review aims to fill the gap that exists among (bio)informaticians, molecular geneticists and clinicians, by presenting a general overview of the NGS technology and workflow. First, we will review the current NGS platforms, focusing on the two main platforms Illumina and Ion Torrent, and discussing the major strong points and weaknesses intrinsic to each platform. Next, the NGS analytical bioinformatic pipelines are dissected, giving some emphasis to the algorithms commonly used to generate process data and to analyze sequence variants. Finally, the main challenges around NGS bioinformatics are placed in perspective for future developments. Even with the huge achievements made in NGS technology and bioinformatics, further improvements in bioinformatic algorithms are still required to deal with complex and genetically heterogeneous disorders

    Ambivalentní reprezentace genetických variant ve formátu VCF

    Get PDF
    The variant call format (VCF) is a file format used to represent and store information about DNA variation. Genetic variants in VCF can be represented in multiple ways because the VCF specification allows for ambiguity, which can arise because of different variant calling pipelines or differences in sequence alignment. Ambiguities interfere with the comparison of VCF files and the variants therein, leading to complications in further analysis of variants. This thesis explores the differences in the representation of genetic variants that can occur, as well as their causes and impacts on further analysis. Furthermore, the normalization of VCF files is addressed and an algorithm for the atomization and deatomization of VCF files is shown. Keywords: VCF, variant call format, ambiguous variant representation, variant comparison, variant atomization, variant deatomizationVariant call format (VCF) je formát souborů používaný k reprezentaci a ukládání informací o variantách. Genetické varianty ve VCF mohou být reprezentovány více způsobů, protože specifikace VCF umožňuje nejednoznačnost, která může nastat kvůli různým variant call pipelinům nebo rozdílům v alignmentech sekvencí. Nejednoznačnosti narušují srovnávání souborů ve VCF a jejich variant, což vede ke komplikacím při další analýze variant. Tato práce zkoumá rozdíly v reprezentaci genetických variant, které se mohou vyskytnout, a také jejich pravděpodobné příčiny a dopady na další analýzu. Dále je zkoumána normalizace souborů VCF a je uveden algoritmus pro atomizaci a deatomizaci souborů VCF. Klíčová slova: VCF, variant call format, ambivalentní reprezentace variant, srovnání variantů, atomizace variantů, deatomizace variantůDepartment of Cell BiologyKatedra buněčné biologiePřírodovědecká fakultaFaculty of Scienc

    Comparison of CPU and Parabricks GPU Enabled Bioinformatics Software for High Throughput Clinical Genomic Applications

    Get PDF
    In recent years, high performance computing (HPC) has begun to revolutionize the architecture of software and servers to meet the ever-increasing demand for speed & efficiency. One of the ways this change is manifesting is the adoption of graphics processor units (GPUs). Used correctly, GPUS can increase throughput and decrease compute time for certain computational problems. Bioinformatics, an HPC dependent discipline, is no exception. As bioinformatics continues advance clinical care by sequencing patient’s DNA and RNA for diagnosis of diseases, there is an ever-increasing demand for faster data processing to improve clinical sequencing turnaround time. Parabricks, a GPU enabled bioinformatics software is one of the leaders in ‘lifting over’ common CPU bioinformatics tools to GPU architectures. In the present study, bioinformatics pipelines built with Parabricks GPU enabled software are compared with standard CPU bioinformatics software. Pipeline results and run performance comparisons are performed to show the impact this technology change can have for a medium sized computational cluster. The present study finds that Parabricks’ GPU workflows show a massive increase in overall efficiency by cutting overall run time by roughly 21x, cutting overall computational hours needed by 650x. Parabricks GPU workflows show a 99.5% variant call concordance rate when compared to clinically validated CPU workflows. Substitution of Parabricks GPU alignment into a clinically validated CPU based pipeline reduces the number of compute hours from 836 hours to 727 hours and returns the same results, showing CPU and GPU’s can be used together to reduce pipeline turnaround time & compute resource burden. Overall, integration of GPUs into bioinformatic pipelines leads to massive reduction of turnaround time, reduction of computation times, and increased throughput, with little to no sacrifice in overall output quality. The findings of this study show GPU based bioinformatic workflows, like Parabricks, could greatly improve whole genome sequencing accessibility for clinical use by reduction of testing turnaround time

    Open issues in extracting nuclear structure information from the breakup of exotic nuclei

    Full text link
    The open issues in the development of models for the breakup of exotic nuclei and the link with the extraction of structure information from experimental data are reviewed. The question of the improvement of the description of exotic nuclei within reaction models is approached in the perspective of previous analyses of the sensitivity of these models to that description. Future developments of reaction models are suggested, such as the inclusion of various channels within one model. The search for new reaction observables that can emphasise more details of exotic nuclear structure is also proposed.Comment: 18 pages, 4 figures, submitted as a contribution to the Secial Issue on "Nuclear reaction theory" of the Journal of Physics G, guest edited by R.C. Johnson and F.M. Nune

    Bioinformatics for personal genomics: development and application of bioinformatic procedures for the analysis of genomic data

    Get PDF
    In the last decade, the huge decreasing of sequencing cost due to the development of high-throughput technologies completely changed the way for approaching the genetic problems. In particular, whole exome and whole genome sequencing are contributing to the extraordinary progress in the study of human variants opening up new perspectives in personalized medicine. Being a relatively new and fast developing field, appropriate tools and specialized knowledge are required for an efficient data production and analysis. In line with the times, in 2014, the University of Padua funded the BioInfoGen Strategic Project with the goal of developing technology and expertise in bioinformatics and molecular biology applied to personal genomics. The aim of my PhD was to contribute to this challenge by implementing a series of innovative tools and by applying them for investigating and possibly solving the case studies included into the project. I firstly developed an automated pipeline for dealing with Illumina data, able to sequentially perform each step necessary for passing from raw reads to somatic or germline variant detection. The system performance has been tested by means of internal controls and by its application on a cohort of patients affected by gastric cancer, obtaining interesting results. Once variants are called, they have to be annotated in order to define their properties such as the position at transcript and protein level, the impact on protein sequence, the pathogenicity and more. As most of the publicly available annotators were affected by systematic errors causing a low consistency in the final annotation, I implemented VarPred, a new tool for variant annotation, which guarantees the best accuracy (>99%) compared to the state-of-the-art programs, showing also good processing times. To make easy the use of VarPred, I equipped it with an intuitive web interface, that allows not only a graphical result evaluation, but also a simple filtration strategy. Furthermore, for a valuable user-driven prioritization of human genetic variations, I developed QueryOR, a web platform suitable for searching among known candidate genes as well as for finding novel gene-disease associations. QueryOR combines several innovative features that make it comprehensive, flexible and easy to use. The prioritization is achieved by a global positive selection process that promotes the emergence of the most reliable variants, rather than filtering out those not satisfying the applied criteria. QueryOR has been used to analyze the two case studies framed within the BioInfoGen project. In particular, it allowed to detect causative variants in patients affected by lysosomal storage diseases, highlighting also the efficacy of the designed sequencing panel. On the other hand, QueryOR simplified the recognition of LRP2 gene as possible candidate to explain such subjects with a Dent disease-like phenotype, but with no mutation in the previously identified disease-associated genes, CLCN5 and OCRL. As final corollary, an extensive analysis over recurrent exome variants was performed, showing that their origin can be mainly explained by inaccuracies in the reference genome, including misassembled regions and uncorrected bases, rather than by platform specific errors
    corecore