25 research outputs found

    QQ-SNV: single nucleotide variant detection at low frequency by comparing the quality quantiles

    Get PDF
    Background: Next generation sequencing enables studying heterogeneous populations of viral infections. When the sequencing is done at high coverage depth ("deep sequencing"), low frequency variants can be detected. Here we present QQ-SNV (http://sourceforge.net/projects/qqsnv), a logistic regression classifier model developed for the Illumina sequencing platforms that uses the quantiles of the quality scores, to distinguish true single nucleotide variants from sequencing errors based on the estimated SNV probability. To train the model, we created a dataset of an in silico mixture of five HIV-1 plasmids. Testing of our method in comparison to the existing methods LoFreq, ShoRAH, and V-Phaser 2 was performed on two HIV and four HCV plasmid mixture datasets and one influenza H1N1 clinical dataset. Results: For default application of QQ-SNV, variants were called using a SNV probability cutoff of 0.5 (QQ-SNVD). To improve the sensitivity we used a SNV probability cutoff of 0.0001 (QQ-SNVHS). To also increase specificity, SNVs called were overruled when their frequency was below the 80th percentile calculated on the distribution of error frequencies (QQ-SNVHS-P80). When comparing QQ-SNV versus the other methods on the plasmid mixture test sets, QQ-SNVD performed similarly to the existing approaches. QQ-SNVHS was more sensitive on all test sets but with more false positives. QQ-SNVHS-P80 was found to be the most accurate method over all test sets by balancing sensitivity and specificity. When applied to a paired-end HCV sequencing study, with lowest spiked-in true frequency of 0.5 %, QQ-SNVHS-P80 revealed a sensitivity of 100 % (vs. 40-60 % for the existing methods) and a specificity of 100 % (vs. 98.0-99.7 % for the existing methods). In addition, QQ-SNV required the least overall computation time to process the test sets. Finally, when testing on a clinical sample, four putative true variants with frequency below 0.5 % were consistently detected by QQ-SNVHS-P80 from different generations of Illumina sequencers. Conclusions: We developed and successfully evaluated a novel method, called QQ-SNV, for highly efficient single nucleotide variant calling on Illumina deep sequencing virology data

    BIGL : Biochemically Intuitive Generalized Loewe null model for prediction of the expected combined effect compatible with partial agonism and antagonism

    Get PDF
    Clinical efficacy regularly requires the combination of drugs. For an early estimation of the clinical value of (potentially many) combinations of pharmacologic compounds during discovery, the observed combination effect is typically compared to that expected under a null model. Mechanistic accuracy of that null model is not aspired to; to the contrary, combinations that deviate favorably from the model (and thereby disprove its accuracy) are prioritized. Arguably the most popular null model is the Loewe Additivity model, which conceptually maps any assay under study to a (virtual) single-step enzymatic reaction. It is easy-to-interpret and requires no other information than the concentration-response curves of the individual compounds. However, the original Loewe model cannot accommodate concentration-response curves with different maximal responses and, by consequence, combinations of an agonist with a partial or inverse agonist. We propose an extension, named Biochemically Intuitive Generalized Loewe (BIGL), that can address different maximal responses, while preserving the biochemical underpinning and interpretability of the original Loewe model. In addition, we formulate statistical tests for detecting synergy and antagonism, which allow for detecting statistically significant greater/lesser observed combined effects than expected from the null model. Finally, we demonstrate the novel method through application to several publicly available datasets

    A comparative analysis of HIV drug resistance interpretation based on short reverse transcriptase sequences versus full sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>As second-line antiretroviral treatment (ART) becomes more accessible in resource-limited settings (RLS), the need for more affordable monitoring tools such as point-of-care viral load assays and simplified genotypic HIV drug resistance (HIVDR) tests increases substantially. The prohibitive expenses of genotypic HIVDR assays could partly be addressed by focusing on a smaller region of the HIV reverse transcriptase gene (RT) that encompasses the majority of HIVDR mutations for people on ART in RLS. In this study, an <it>in silico </it>analysis of 125,329 RT sequences was performed to investigate the effect of submitting short RT sequences (codon 41 to 238) to the commonly used virco<sup>®</sup>TYPE and Stanford genotype interpretation tools.</p> <p>Results</p> <p>Pair-wise comparisons between full-length and short RT sequences were performed. Additionally, a non-inferiority approach with a concordance limit of 95% and two-sided 95% confidence intervals was used to demonstrate concordance between HIVDR calls based on full-length and short RT sequences.</p> <p>The results of this analysis showed that HIVDR interpretations based on full-length versus short RT sequences, using the Stanford algorithms, had concordance significantly above 95%. When using the virco<sup>®</sup>TYPE algorithm, similar concordance was demonstrated (>95%), but some differences were observed for d4T, AZT and TDF, where predictions were affected in more than 5% of the sequences. Most differences in interpretation, however, were due to shifts from fully susceptible to reduced susceptibility (d4T) or from reduced response to minimal response (AZT, TDF) or vice versa, as compared to the predicted full RT sequence. The virco<sup>®</sup>TYPE prediction uses many more mutations outside the RT 41-238 amino acid domain, which significantly contribute to the HIVDR prediction for these 3 antiretroviral agents.</p> <p>Conclusions</p> <p>This study illustrates the acceptability of using a shortened RT sequences (codon 41-238) to obtain reliable genotype interpretations by virco<sup>®</sup>TYPE and Stanford algorithms. Implementation of this simplified protocol could significantly reduce the cost of both resistance testing and ARV treatment monitoring in RLS.</p

    Cross-validated stepwise regression for identification of novel non-nucleoside reverse transcriptase inhibitor resistance associated mutations

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Linear regression models are used to quantitatively predict drug resistance, the phenotype, from the HIV-1 viral genotype. As new antiretroviral drugs become available, new resistance pathways emerge and the number of resistance associated mutations continues to increase. To accurately identify which drug options are left, the main goal of the modeling has been to maximize predictivity and not interpretability. However, we originally selected linear regression as the preferred method for its transparency as opposed to other techniques such as neural networks. Here, we apply a method to lower the complexity of these phenotype prediction models using a 3-fold cross-validated selection of mutations.</p> <p>Results</p> <p>Compared to standard stepwise regression we were able to reduce the number of mutations in the reverse transcriptase (RT) inhibitor models as well as the number of interaction terms accounting for synergistic and antagonistic effects. This reduction in complexity was most significant for the non-nucleoside reverse transcriptase inhibitor (NNRTI) models, while maintaining prediction accuracy and retaining virtually all known resistance associated mutations as first order terms in the models. Furthermore, for etravirine (ETR) a better performance was seen on two years of unseen data. By analyzing the phenotype prediction models we identified a list of forty novel NNRTI mutations, putatively associated with resistance. The resistance association of novel variants at known NNRTI resistance positions: 100, 101, 181, 190, 221 and of mutations at positions not previously linked with NNRTI resistance: 102, 139, 219, 241, 376 and 382 was confirmed by phenotyping site-directed mutants.</p> <p>Conclusions</p> <p>We successfully identified and validated novel NNRTI resistance associated mutations by developing parsimonious resistance prediction models in which repeated cross-validation within the stepwise regression was applied. Our model selection technique is computationally feasible for large data sets and provides an approach to the continued identification of resistance-causing mutations.</p

    Prediction of drug response from genetic sequence data using regression techniques

    No full text
    Regression techniques are increasingly important as automatic methods to study complex high-dimensional biological systems and to separate true signal from experimental noise.In this thesis, we developed novel methodologies to build linear regression models with low complexity that are at the same time accurate to predict drug response ( phenotype ) from HIV-1 genetic sequence mutations ( genotype ), where the choice of methodology depended on the size of the genotype-phenotype data sets.For large data sets we developed a novel cross-validated stepwise linear regression procedure to improve the selection of the model variables, i.e. mutations or interaction terms. The best results with our new methodology were obtained when building models for the non-nucleoside reverse transcriptase inhibitors (NNRTIs), leading to a reduced list of forty novel mutations putatively associated with NNRTI resistance. The effect on resistance of several of these mutations was confirmed experimentally by in vitro phenotyping site-directed mutants, such as for mutations at positions not previously linked with NNRTI resistance (e.g. 102 and 139).Applying our novel method for large data sets on small data sets would not provide an effective solution against overfitting. Therefore, for small data sets we developed a novel methodology where variable selection occurred by inference from multiple genetic algorithm (GA) derived linear regression models. Moreover, we could extend this GA methodology to account for clustering in the data, which led to a more interpretable linear regression model for the integrase inhibitor raltegravir on a clonal genotype-phenotype dataset containing multiple clones derived from the same clinical isolate.Finally, we developed a logistic regression method for the accurate detection of true minor single nucleotide mutations in the presence of experimental noise, in large amounts of clonal data obtained for an individual patient with the Illumina next generation sequencing technology.status: publishe

    Quantitative prediction of integrase inhibitor resistance from genotype through consensus linear regression modeling

    No full text
    Abstract Background Integrase inhibitors (INI) form a new drug class in the treatment of HIV-1 patients. We developed a linear regression modeling approach to make a quantitative raltegravir (RAL) resistance phenotype prediction, as Fold Change in IC50 against a wild type virus, from mutations in the integrase genotype. Methods We developed a clonal genotype-phenotype database with 991 clones from 153 clinical isolates of INI naïve and RAL treated patients, and 28 site-directed mutants. We did the development of the RAL linear regression model in two stages, employing a genetic algorithm (GA) to select integrase mutations by consensus. First, we ran multiple GAs to generate first order linear regression models (GA models) that were stochastically optimized to reach a goal R2 accuracy, and consisted of a fixed-length subset of integrase mutations to estimate INI resistance. Secondly, we derived a consensus linear regression model in a forward stepwise regression procedure, considering integrase mutations or mutation pairs by descending prevalence in the GA models. Results The most frequently occurring mutations in the GA models were 92Q, 97A, 143R and 155H (all 100%), 143G (90%), 148H/R (89%), 148K (88%), 151I (81%), 121Y (75%), 143C (72%), and 74M (69%). The RAL second order model contained 30 single mutations and five mutation pairs (p 2 performance of this model on the clonal training data was 0.97, and 0.78 on an unseen population genotype-phenotype dataset of 171 clinical isolates from RAL treated and INI naïve patients. Conclusions We describe a systematic approach to derive a model for predicting INI resistance from a limited amount of clonal samples. Our RAL second order model is made available as an Additional file for calculating a resistance phenotype as the sum of integrase mutations and mutation pairs.</p

    VirVarSeq: a low-frequency virus variant detection pipeline for Illumina sequencing using adaptive base-calling accuracy filtering

    No full text
    Motivation: In virology, massively parallel sequencing (MPS) opens many opportunities for studying viral quasi-species, e.g. in HIV-1- and HCV-infected patients. This is essential for understanding pathways to resistance, which can substantially improve treatment. Although MPS platforms allow in-depth characterization of sequence variation, their measurements still involve substantial technical noise. For Illumina sequencing, single base substitutions are the main error source and impede powerful assessment of low-frequency mutations. Fortunately, base calls are complemented with quality scores (Qs) that are useful for differentiating errors from the real low-frequency mutations. Results: A variant calling tool, Q-cpileup, is proposed, which exploits the Qs of nucleotides in a filtering strategy to increase specificity. The tool is imbedded in an open-source pipeline, VirVarSeq, which allows variant calling starting from fastq files. Using both plasmid mixtures and clinical samples, we show that Q-cpileup is able to reduce the number of false-positive findings. The filtering strategy is adaptive and provides an optimized threshold for individual samples in each sequencing run. Additionally, linkage information is kept between single-nucleotide polymorphisms as variants are called at the codon level. This enables virologists to have an immediate biological interpretation of the reported variants with respect to their antiviral drug responses. A comparison with existing SNP caller tools reveals that calling variants at the codon level with Q-cpileup results in an outstanding sensitivity while maintaining a good specificity for variants with frequencies down to 0.5%