722 research outputs found

    Computational pan-genomics: status, promises and challenges

    Get PDF
    International audienceMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains

    Computational Methods for Gene Expression and Genomic Sequence Analysis

    Get PDF
    Advances in technologies currently produce more and more cost-effective, high-throughput, and large-scale biological data. As a result, there is an urgent need for developing efficient computational methods for analyzing these massive data. In this dissertation, we introduce methods to address several important issues in gene expression and genomic sequence analysis, two of the most important areas in bioinformatics.Firstly, we introduce a novel approach to predicting patterns of gene response to multiple treatments in case of small sample size. Researchers are increasingly interested in experiments with many treatments such as chemicals compounds or drug doses. However, due to cost, many experiments do not have large enough samples, making it difficult for conventional methods to predict patterns of gene response. Here we introduce an approach which exploited dependencies of pairwise comparisons outcomes and resampling techniques to predict true patterns of gene response in case of insufficient samples. This approach deduced more and better functionally enriched gene clusters than conventional methods. Our approach is therefore useful for multiple-treatment studies which have small sample size or contain highly variantly expressed genes.Secondly, we introduce a novel method for aligning short reads, which are DNA fragments extracted across genomes of individuals, to reference genomes. Results from short read alignment can be used for many studies such as measuring gene expression or detecting genetic variants. Here we introduce a method which employed an iterated randomized algorithm based on FM-index, an efficient data structure for full-text indexing, to align reads to the reference. This method improved alignment performance across a wide range of read lengths and error rates compared to several popular methods, making it a good choice for community to perform short read alignment.Finally, we introduce a novel approach to detecting genetic variants such as SNPs (single nucleotide polymorphisms) or INDELs (insertions/deletions). This study has great significance in a wide range of areas, from bioinformatics and genetic research to medical field. For example, one can predict how genomic changes are related to phenotype in their organism of interest, or associate genetic changes to disease risk or medical treatment efficacy. Here we introduce a method which leveraged known genetic variants existing in well-established databases to improve accuracy of detecting variants. This method had higher accuracy than several state-of-the-art methods in many cases, especially for detecting INDELs. Our method therefore has potential to be useful in research and clinical applications which rely on identifying genetic variants accurately

    Computational pan-genomics: status, promises and challenges

    Get PDF

    Computational pan-genomics: status, promises and challenges

    Get PDF

    Enhanced mitochondrial genome analysis: bioinformatic and long-read sequencing advances and their diagnostic implications

    Get PDF
    Introduction: Primary mitochondrial diseases (PMDs) comprise a large and heterogeneous group of genetic diseases that result from pathogenic variants in either nuclear DNA (nDNA) or mitochondrial DNA (mtDNA). Widespread adoption of next-generation sequencing (NGS) has improved the efficiency and accuracy of mtDNA diagnoses; however, several challenges remain. Areas covered: In this review, we briefly summarize the current state of the art in molecular diagnostics for mtDNA and consider the implications of improved whole genome sequencing (WGS), bioinformatic techniques, and the adoption of long-read sequencing, for PMD diagnostics. Expert opinion: We anticipate that the application of PCR-free WGS from blood DNA will increase in diagnostic laboratories, while for adults with myopathic presentations, WGS from muscle DNA may become more widespread. Improved bioinformatic strategies will enhance WGS data interrogation, with more accurate delineation of mtDNA and NUMTs (nuclear mitochondrial DNA segments) in WGS data, superior coverage uniformity, indirect measurement of mtDNA copy number, and more accurate interpretation of heteroplasmic large-scale rearrangements (LSRs). Separately, the adoption of diagnostic long-read sequencing could offer greater resolution of complex LSRs and the opportunity to phase heteroplasmic variants
    • …
    corecore