195 research outputs found

    Reliability of panel-based mutational signatures for immune-checkpoint-inhibition efficacy prediction in non-small cell lung cancer

    Get PDF
    OBJECTIVES: Mutational signatures (MS) are gaining traction for deriving therapeutic insights for immune checkpoint inhibition (ICI). We asked if MS attributions from comprehensive targeted sequencing assays are reliable enough for predicting ICI efficacy in non-small cell lung cancer (NSCLC).METHODS: Somatic mutations of m = 126 patients were assayed using panel-based sequencing of 523 cancer-related genes. In silico simulations of MS attributions for various panels were performed on a separate dataset of m = 101 whole genome sequenced patients. Non-synonymous mutations were deconvoluted using COSMIC v3.3 signatures and used to test a previously published machine learning classifier.RESULTS: The ICI efficacy predictor performed poorly with an accuracy of 0.51 -0.09 +0.09, average precision of 0.52 -0.11 +0.11, and an area under the receiver operating characteristic curve of 0.50 -0.09 +0.10. Theoretical arguments, experimental data, and in silico simulations pointed to false negative rates (FNR) related to panel size. A secondary effect was observed, where deconvolution of small ensembles of point mutations lead to reconstruction errors and misattributions. CONCLUSION: MS attributions from current targeted panel sequencing are not reliable enough to predict ICI efficacy. We suggest that, for downstream classification tasks in NSCLC, signature attributions be based on whole exome or genome sequencing instead.</p

    Accurate reconstruction of insertion-deletion histories by statistical phylogenetics

    Get PDF
    The Multiple Sequence Alignment (MSA) is a computational abstraction that represents a partial summary either of indel history, or of structural similarity. Taking the former view (indel history), it is possible to use formal automata theory to generalize the phylogenetic likelihood framework for finite substitution models (Dayhoff's probability matrices and Felsenstein's pruning algorithm) to arbitrary-length sequences. In this paper, we report results of a simulation-based benchmark of several methods for reconstruction of indel history. The methods tested include a relatively new algorithm for statistical marginalization of MSAs that sums over a stochastically-sampled ensemble of the most probable evolutionary histories. For mammalian evolutionary parameters on several different trees, the single most likely history sampled by our algorithm appears less biased than histories reconstructed by other MSA methods. The algorithm can also be used for alignment-free inference, where the MSA is explicitly summed out of the analysis. As an illustration of our method, we discuss reconstruction of the evolutionary histories of human protein-coding genes.Comment: 28 pages, 15 figures. arXiv admin note: text overlap with arXiv:1103.434

    The variant call format and VCFtools

    Get PDF
    Summary: The variant call format (VCF) is a generic format for storing DNA polymorphism data such as SNPs, insertions, deletions and structural variants, together with rich annotations. VCF is usually stored in a compressed manner and can be indexed for fast data retrieval of variants from a range of positions on the reference genome. The format was developed for the 1000 Genomes Project, and has also been adopted by other projects such as UK10K, dbSNP and the NHLBI Exome Project. VCFtools is a software suite that implements various utilities for processing VCF files, including validation, merging, comparing and also provides a general Perl API

    Whole-genome sequencing of bladder cancers reveals somatic CDKN1A mutations and clinicopathological associations with mutation burden

    Get PDF
    Bladder cancers are a leading cause of death from malignancy. Molecular markers might predict disease progression and behaviour more accurately than the available prognostic factors. Here we use whole-genome sequencing to identify somatic mutations and chromosomal changes in 14 bladder cancers of different grades and stages. As well as detecting the known bladder cancer driver mutations, we report the identification of recurrent protein-inactivating mutations in CDKN1A and FAT1. The former are not mutually exclusive with TP53 mutations or MDM2 amplification, showing that CDKN1A dysfunction is not simply an alternative mechanism for p53 pathway inactivation. We find strong positive associations between higher tumour stage/grade and greater clonal diversity, the number of somatic mutations and the burden of copy number changes. In principle, the identification of sub-clones with greater diversity and/or mutation burden within early-stage or low-grade tumours could identify lesions with a high risk of invasive progression

    Multi-level evidence of an allelic hierarchy of USH2A variants in hearing, auditory processing and speech/language outcomes.

    Get PDF
    Language development builds upon a complex network of interacting subservient systems. It therefore follows that variations in, and subclinical disruptions of, these systems may have secondary effects on emergent language. In this paper, we consider the relationship between genetic variants, hearing, auditory processing and language development. We employ whole genome sequencing in a discovery family to target association and gene x environment interaction analyses in two large population cohorts; the Avon Longitudinal Study of Parents and Children (ALSPAC) and UK10K. These investigations indicate that USH2A variants are associated with altered low-frequency sound perception which, in turn, increases the risk of developmental language disorder. We further show that Ush2a heterozygote mice have low-level hearing impairments, persistent higher-order acoustic processing deficits and altered vocalizations. These findings provide new insights into the complexity of genetic mechanisms serving language development and disorders and the relationships between developmental auditory and neural systems

    Alignment and Prediction of cis-Regulatory Modules Based on a Probabilistic Model of Evolution

    Get PDF
    Cross-species comparison has emerged as a powerful paradigm for predicting cis-regulatory modules (CRMs) and understanding their evolution. The comparison requires reliable sequence alignment, which remains a challenging task for less conserved noncoding sequences. Furthermore, the existing models of DNA sequence evolution generally do not explicitly treat the special properties of CRM sequences. To address these limitations, we propose a model of CRM evolution that captures different modes of evolution of functional transcription factor binding sites (TFBSs) and the background sequences. A particularly novel aspect of our work is a probabilistic model of gains and losses of TFBSs, a process being recognized as an important part of regulatory sequence evolution. We present a computational framework that uses this model to solve the problems of CRM alignment and prediction. Our alignment method is similar to existing methods of statistical alignment but uses the conserved binding sites to improve alignment. Our CRM prediction method deals with the inherent uncertainties of binding site annotations and sequence alignment in a probabilistic framework. In simulated as well as real data, we demonstrate that our program is able to improve both alignment and prediction of CRM sequences over several state-of-the-art methods. Finally, we used alignments produced by our program to study binding site conservation in genome-wide binding data of key transcription factors in the Drosophila blastoderm, with two intriguing results: (i) the factor-bound sequences are under strong evolutionary constraints even if their neighboring genes are not expressed in the blastoderm and (ii) binding sites in distal bound sequences (relative to transcription start sites) tend to be more conserved than those in proximal regions. Our approach is implemented as software, EMMA (Evolutionary Model-based cis-regulatory Module Analysis), ready to be applied in a broad biological context

    Gentle Masking of Low-Complexity Sequences Improves Homology Search

    Get PDF
    Detection of sequences that are homologous, i.e. descended from a common ancestor, is a fundamental task in computational biology. This task is confounded by low-complexity tracts (such as atatatatatat), which arise frequently and independently, causing strong similarities that are not homologies. There has been much research on identifying low-complexity tracts, but little research on how to treat them during homology search. We propose to find homologies by aligning sequences with “gentle” masking of low-complexity tracts. Gentle masking means that the match score involving a masked letter is , where is the unmasked score. Gentle masking slightly but noticeably improves the sensitivity of homology search (compared to “harsh” masking), without harming specificity. We show examples in three useful homology search problems: detection of NUMTs (nuclear copies of mitochondrial DNA), recruitment of metagenomic DNA reads to reference genomes, and pseudogene detection. Gentle masking is currently the best way to treat low-complexity tracts during homology search

    Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs

    Get PDF
    Background A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. Results In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. Conclusions The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign webcite

    NOX1 loss-of-function genetic variants in patients with inflammatory bowel disease.

    Get PDF
    Genetic defects that affect intestinal epithelial barrier function can present with very early-onset inflammatory bowel disease (VEOIBD). Using whole-genome sequencing, a novel hemizygous defect in NOX1 encoding NAPDH oxidase 1 was identified in a patient with ulcerative colitis-like VEOIBD. Exome screening of 1,878 pediatric patients identified further seven male inflammatory bowel disease (IBD) patients with rare NOX1 mutations. Loss-of-function was validated in p.N122H and p.T497A, and to a lesser degree in p.Y470H, p.R287Q, p.I67M, p.Q293R as well as the previously described p.P330S, and the common NOX1 SNP p.D360N (rs34688635) variant. The missense mutation p.N122H abrogated reactive oxygen species (ROS) production in cell lines, ex vivo colonic explants, and patient-derived colonic organoid cultures. Within colonic crypts, NOX1 constitutively generates a high level of ROS in the crypt lumen. Analysis of 9,513 controls and 11,140 IBD patients of non-Jewish European ancestry did not reveal an association between p.D360N and IBD. Our data suggest that loss-of-function variants in NOX1 do not cause a Mendelian disorder of high penetrance but are a context-specific modifier. Our results implicate that variants in NOX1 change brush border ROS within colonic crypts at the interface between the epithelium and luminal microbes

    Probabilistic Phylogenetic Inference with Insertions and Deletions

    Get PDF
    A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. However, the most widely used phylogenetic models only account for residue substitution events. We describe a probabilistic model of a multiple sequence alignment that accounts for insertion and deletion events in addition to substitutions, given a phylogenetic tree, using a rate matrix augmented by the gap character. Starting from a continuous Markov process, we construct a non-reversible generative (birth–death) evolutionary model for insertions and deletions. The model assumes that insertion and deletion events occur one residue at a time. We apply this model to phylogenetic tree inference by extending the program dnaml in phylip. Using standard benchmarking methods on simulated data and a new “concordance test” benchmark on real ribosomal RNA alignments, we show that the extended program dnamlε improves accuracy relative to the usual approach of ignoring gaps, while retaining the computational efficiency of the Felsenstein peeling algorithm
    corecore