741 research outputs found

    SVIM: Structural Variant Identification using Mapped Long Reads

    No full text
    Motivation: Structural variants are defined as genomic variants larger than 50bp. They have been shown to affect more bases in any given genome than SNPs or small indels. Additionally, they have great impact on human phenotype and diversity and have been linked to numerous diseases. Due to their size and association with repeats, they are difficult to detect by shotgun sequencing, especially when based on short reads. Long read, single molecule sequencing technologies like those offered by Pacific Biosciences or Oxford Nanopore Technologies produce reads with a length of several thousand base pairs. Despite the higher error rate and sequencing cost, long read sequencing offers many advantages for the detection of structural variants. Yet, available software tools still do not fully exploit the possibilities. Results: We present SVIM, a tool for the sensitive detection and precise characterization of structural variants from long read data. SVIM consists of three components for the collection, clustering and combination of structural variant signatures from read alignments. It discriminates five different variant classes including similar types, such as tandem and interspersed duplications and novel element insertions. SVIM is unique in its capability of extracting both the genomic origin and destination of duplications. It compares favorably with existing tools in evaluations on simulated data and real datasets from PacBio and Nanopore sequencing machines. Availability and implementation: The source code and executables of SVIM are available on Github: github.com/eldariont/svim. SVIM has been implemented in Python 3 and published on bioconda and the Python Package Index. Supplementary information: Supplementary data are available at Bioinformatics online

    ModHMM: A Modular Supra-Bayesian Genome Segmentation Method

    Get PDF
    Genome segmentation methods are powerful tools to obtain cell type or tissue-specific genome-wide annotations and are frequently used to discover regulatory elements. However, traditional segmentation methods show low predictive accuracy and their data-driven annotations have some undesirable properties. As an alternative, we developed ModHMM, a highly modular genome segmentation method. Inspired by the supra-Bayesian approach, it incorporates predictions from a set of classifiers. This allows to compute genome segmentations by utilizing state-of-the-art methodology. We demonstrate the method on ENCODE data and show that it outperforms traditional segmentation methods not only in terms of predictive performance, but also in qualitative aspects. Therefore, ModHMM is a valuable alternative to study the epigenetic and regulatory landscape across and within cell types or tissues

    A biophysical approach to large-scale protein-DNA binding data

    Get PDF
    About this book * Cutting-edge genome analysis methods from leading bioinformaticians An accurate description of current scientific developments in the field of bioinformatics and computational implementation is presented by research of the BioSapiens Network of Excellence. Bioinformatics is essential for annotating the structure and function of genes, proteins and the analysis of complete genomes and to molecular biology and biochemistry. Included is an overview of bioinformatics, the full spectrum of genome annotation approaches including; genome analysis and gene prediction, gene regulation analysis and expression, genome variation and QTL analysis, large scale protein annotation of function and structure, annotation and prediction of protein interactions, and the organization and annotation of molecular networks and biochemical pathways. Also covered is a technical framework to organize and represent genome data using the DAS technology and work in the annotation of two large genomic sets: HIV/HCV viral genomes and splicing alternatives potentially encoded in 1% of the human genome

    Predicting the outcome of renal transplantation

    Get PDF
    ObjectiveRenal transplantation has dramatically improved the survival rate of hemodialysis patients. However, with a growing proportion of marginal organs and improved immunosuppression, it is necessary to verify that the established allocation system, mostly based on human leukocyte antigen matching, still meets today's needs. The authors turn to machine-learning techniques to predict, from donor-recipient data, the estimated glomerular filtration rate (eGFR) of the recipient 1 year after transplantation.DesignThe patient's eGFR was predicted using donor-recipient characteristics available at the time of transplantation. Donors' data were obtained from Eurotransplant's database, while recipients' details were retrieved from Charite Campus Virchow-Klinikum's database. A total of 707 renal transplantations from cadaveric donors were included.MeasurementsTwo separate datasets were created, taking features with <10% missing values for one and <50% missing values for the other. Four established regressors were run on both datasets, with and without feature selection.ResultsThe authors obtained a Pearson correlation coefficient between predicted and real eGFR (COR) of 0.48. The best model for the dataset was a Gaussian support vector machine with recursive feature elimination on the more inclusive dataset. All results are available at http://transplant.molgen.mpg.de/.LimitationsFor now, missing values in the data must be predicted and filled in. The performance is not as high as hoped, but the dataset seems to be the main cause.ConclusionsPredicting the outcome is possible with the dataset at hand (COR=0.48). Valuable features include age and creatinine levels of the donor, as well as sex and weight of the recipient

    Quantifying the tissue-specific regulatory information within enhancer DNA sequences

    Get PDF
    Recent efforts to measure epigenetic marks across a wide variety of different cell types and tissues provide insights into the cell type-specific regulatory landscape. We use these data to study whether there exists a correlate of epigenetic signals in the DNA sequence of enhancers and explore with computational methods to what degree such sequence patterns can be used to predict cell type-specific regulatory activity. By constructing classifiers that predict in which tissues enhancers are active, we are able to identify sequence features that might be recognized by the cell in order to regulate gene expression. While classification performances vary greatly between tissues, we show examples where our classifiers correctly predict tissue-specific regulation from sequence alone. We also show that many of the informative patterns indeed harbor transcription factor footprints

    Horizontal Gene Transfers in prokaryotes show differential preferences for metabolic and translational genes

    Get PDF
    Background: Horizontal gene transfer (HGT) is an important process, which contributes in bacterial pathogenesis and drug resistance. A number of methods have been proposed for detection of horizontal gene transfer. One successful approach to the detection of HGT events is due to Novichkov et al. (J. Bacteriology 186, 6575-85), who rely on comparing phylogenetic distances within a gene family with genomic distances of the source organisms. Building on their approach, we introduce outlier detection in the correlation between those two sets of distances. This approach is designed to detect horizontal transfers of core set of genes present in many bacteria. The principle behind method allows detection of xenologous gene displacements as well as acquisition of novel genes.Results: Simulations indicated that our method performs better than Novichkov et al's original approach. The approach very efficiently identified HGT between distantly related bacteria and also a limited number of gene transfers between closely related bacteria. In combination with sequence similarity and likelihood tests, it yields a measure robust enough to derive a set of 171 genes deemed likely to have been horizontally transferred. Further analysis of these 171 established horizontal transfer events gave interesting insights in the direction of transfer.Conclusion: The majority of transfers between archaea and bacteria have occurred in the direction from bacteria to archaea rather than the other way round. Genes transferred between the archaea and bacteria are mostly metabolic genes. On the other hand, genes transferred within the bacterial phyla are mainly involved in translation

    Association Plots: Visualizing associations in high-dimensional correspondence analysis biplots

    Get PDF
    In molecular biology, just as in many other fields of science, data often come in the form of matrices or contingency tables with many measurements (rows) for a set of variables (columns). While projection methods like Principal Component Analysis or Correspondence Analysis can be applied for obtaining an overview of such data, in cases where the matrix is very large the associated loss of information upon projection into two or three dimensions may be dramatic. However, when the set of variables can be grouped into clusters, this opens up a new angle on the data. We focus on the question which measurements are associated to a cluster and distinguish it from other clusters. Correspondence Analysis employs a geometry geared towards answering this question. We exploit this feature in order to introduce Association Plots for visualizing cluster-specific measurements in complex data. Association Plots are two-dimensional, independent of the size of data matrix or cluster, and depict the measurements associated to a cluster of variables. We demonstrate our method first on a small data set and then on a genomic example comprising more than 10,000 conditions. We will show that Association Plots can clearly highlight those measurements which characterize a cluster of variables
    • …
    corecore