239 research outputs found

    Accurate Detection of Recombinant Breakpoints in Whole-Genome Alignments

    Get PDF
    We propose a novel method for detecting sites of molecular recombination in multiple alignments. Our approach is a compromise between previous extremes of computationally prohibitive but mathematically rigorous methods and imprecise heuristic methods. Using a combined algorithm for estimating tree structure and hidden Markov model parameters, our program detects changes in phylogenetic tree topology over a multiple sequence alignment. We evaluate our method on benchmark datasets from previous studies on two recombinant pathogens, Neisseria and HIV-1, as well as simulated data. We show that we are not only able to detect recombinant regions of vastly different sizes but also the location of breakpoints with great accuracy. We show that our method does well inferring recombination breakpoints while at the same time maintaining practicality for larger datasets. In all cases, we confirm the breakpoint predictions of previous studies, and in many cases we offer novel predictions

    Pervasive and non-random recombination in near full-length HIV genomes from Uganda

    Get PDF
    Recombination is an important feature of HIV evolution, occurring both within and between the major branches of diversity (subtypes). The Ugandan epidemic is primarily composed of two subtypes, A1 and D, that have been co-circulating for 50 years, frequently recombining in dually infected patients. Here, we investigate the frequency of recombinants in this population and the location of breakpoints along the genome. As part of the PANGEA-HIV consortium, 1,472 consensus genome sequences over 5 kb have been obtained from 1,857 samples collected by the MRC/UVRI & LSHTM Research unit in Uganda, 465 (31.6 per cent) of which were near full-length sequences (>8 kb). Using the subtyping tool SCUEAL, we find that of the near full-length dataset, 233 (50.1 per cent) genomes contained only one subtype, 30.8 per cent A1 (n = 143), 17.6 per cent D (n = 82), and 1.7 per cent C (n = 8), while 49.9 per cent (n = 232) contained more than one subtype (including A1/D (n = 164), A1/C (n = 13), C/D (n = 9); A1/C/D (n = 13), and 33 complex types). K-means clustering of the recombinant A1/D genomes revealed a section of envelope (C2gp120-TMgp41) is often inherited intact, whilst a generalized linear model was used to demonstrate significantly fewer breakpoints in the gag-pol and envelope C2-TM regions compared with accessory gene regions. Despite similar recombination patterns in many recombinants, no clearly supported circulating recombinant form (CRF) was found, there was limited evidence of the transmission of breakpoints, and the vast majority (153/164; 93 per cent) of the A1/D recombinants appear to be unique recombinant forms. Thus, recombination is pervasive with clear biases in breakpoint location, but CRFs are not a significant feature, characteristic of a complex, and diverse epidemic

    Characterization and frequency of a newly identified HIV-1 BF1 intersubtype circulating recombinant form in São Paulo, Brazil

    Get PDF
    Background: HIV circulating recombinant forms (CRFs) play an important role in the global and regional HIV epidemics, particularly in regions where multiple subtypes are circulating. To date, several (>40) CRFs are recognized worldwide with five currently circulating in Brazil. Here, we report the characterization of near full-length genome sequences (NFLG) of six phylogenetically related HIV-1 BF1 intersubtype recombinants (five from this study and one from other published sequences) representing CRF46_BF1.Methods: Initially, we selected 36 samples from 888 adult patients residing in São Paulo who had previously been diagnosed as being infected with subclade F1 based on pol subgenomic fragment sequencing. Proviral DNA integrated in peripheral blood mononuclear cells (PBMC) was amplified from the purified genomic DNA of all 36-blood samples by five overlapping PCR fragments followed by direct sequencing. Sequence data were obtained from the five fragments that showed identical genomic structure and phylogenetic trees were constructed and compared with previously published sequences. Genuine subclade F1 sequences and any other sequences that exhibited unique mosaic structures were omitted from further analysisResults: of the 36 samples analyzed, only six sequences, inferred from the pol region as subclade F1, displayed BF1 identical mosaic genomes with a single intersubtype breakpoint identified at the nef-U3 overlap (HXB2 position 9347-9365; LTR region). Five of these isolates formed a rigid cluster in phylogentic trees from different subclade F1 fragment regions, which we can now designate as CRF46_BF1. According to our estimate, the new CRF accounts for 0.56% of the HIV-1 circulating strains in São Paulo. Comparison with previously published sequences revealed an additional five isolates that share an identical mosaic structure with those reported in our study. Despite sharing a similar recombinant structure, only one sequence appeared to originate from the same CRF46_BF1 ancestor.Conclusion: We identified a new circulating recombinant form with a single intersubtype breakpoint identified at the nef-LTR U3 overlap and designated CRF46_BF1. Given the biological importance of the LTR U3 region, intersubtype recombination in this region could play an important role in HIV evolution with critical consequences for the development of efficient genetic vaccines.Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)Hemoctr, Fundacao Prosangue, São Paulo, BrazilUniversidade Federal de São Paulo, Retrovirol Lab, São Paulo, BrazilUniversidade Federal de São Paulo, Retrovirol Lab, São Paulo, BrazilFAPESP: 06/50096-0FAPESP: 2004/15856-9FAPESP: 2007/04890-0Web of Scienc

    Detecting recombination and its mechanistic association with genomic features via statistical models

    Get PDF
    Recombination is a powerful weapon in the evolutionary arsenal of retroviruses such as HIV. It enables the production of chimeric variants or recombinants that may confer a selective advantage to the pathogen over the host immune response. Recombinants further accentuate differences in virulence, disease progression and drug resistance mutation patterns already observed in non-recombinant variants of HIV. This thesis describes the development of a rapid genotyper for HIV sequences employing supervised learning algorithms and its application to complex HIV recombinant data, the application of a hierarchical model for detection of recombination hotspots in the HIV-1 genome and the extension of this model enabling estimation of the association between recombination probabilities and covariates of interest. The rapid genotyper for HIV-1 explores a solution to the genotyping problem in the machine learning paradigm. Of the algorithms tested, the genotyper built using Bayesian additive regression trees (BART) was most successful in efficiently classifying complex recombinants that pose a challenge to other currently available genotyping methods. We also developed a novel method, bootSMOTE, for generating synthetic data in order to supplement insufficient training data. We found that supplementation with synthetic recombinants especially boosts identification of complex recombinants. We describe the genotyper software available for download as well as a web interface enabling rapid classiffication of HIV-1 sequences. Hotspots for recombination in the HIV-1 genome are modeled using spatially smoothed changepoint processes. This hierarchical model uses a phylogenetic recombination detection model of dual changepoint processes at the lower level. The upper level applies a Gaussian Markov random eld (GMRF) hyperprior to population-level recombination probabilities in order to efficiently combine the information from many individual recombination events as inferred at the lower level. Focusing on 544 unique recombinant sequences, we found a novel hotspot in the pol gene of HIV-1 while confirming the presence of a high recombination activity in the env gene. Valuable insights into the molecular mechanism of recombination may be gained by extending the GMRF model to include covariates of interest. We add a level to the hierarchical model and allow for the simultaneous inference of recombination probabilities as well their association with genomic covariates of interest. Using a set of 527 unique recombinants, we confirmed the presence of the pol hotspot. Interestingly, we found significant positive associations of spatial fluctuations in recombination probabilities with genomic regions prone to forming secondary structure as well as significant negative associations with regions that support tight RNA-DNA hybrid formation. Overall, our results support the theory that pause sites along the genome promote recombination

    The HIV-1 Subtype C Epidemic in South America Is Linked to the United Kingdom

    Get PDF
    Background: The global spread of HIV-1 has been accompanied by the emergence of genetically distinct viral strains. Over the past two decades subtype C viruses, which predominate in Southern and Eastern Africa, have spread rapidly throughout parts of South America. Phylogenetic studies indicate that subtype C viruses were introduced to South America through a single founder event that occurred in Southern Brazil. However, the external route via which subtype C viruses spread to the South American continent has remained unclear.Methodology/Principal Findings: We used automated genotyping to screen 8,309 HIV-1 subtype C pol gene sequences sampled within the UK for isolates genetically linked to the subtype C epidemic in South America. Maximum likelihood and Bayesian approaches were used to explore the phylogenetic relationships between 54 sequences identified in this screen, and a set of globally sampled subtype C reference sequences. Phylogenetic trees disclosed a robustly supported relationship between sequences from Brazil, the UK and East Africa. A monophyletic cluster comprised exclusively of sequences from the UK and Brazil was identified and dated to approximately the early 1980s using a Bayesian coalescent-based method. A sub-cluster of 27 sequences isolated from homosexual men of UK origin was also identified and dated to the early 1990s.Conclusions: Phylogenetic, demographic and temporal data support the conclusion that the UK was a crucial staging post in the spread of subtype C from East Africa to South America. This unexpected finding demonstrates the role of diffuse international networks in the global spread of HIV-1 infection, and the utility of globally sampled viral sequence data in revealing these networks. Additionally, we show that subtype C viruses are spreading within the UK amongst men who have sex with men

    Identification of broadly neutralizing antibody epitopes in the HIV-1 envelope glycoprotein using evolutionary models

    Get PDF
    Background: Identification of the epitopes targeted by antibodies that can neutralize diverse HIV-1 strains can provide important clues for the design of a preventative vaccine. Methods: We have developed a computational approach that can identify key amino acids within the HIV-1 envelope glycoprotein that influence sensitivity to broadly cross-neutralizing antibodies. Given a sequence alignment and neutralization titers for a panel of viruses, the method works by fitting a phylogenetic model that allows the amino acid frequencies at each site to depend on neutralization sensitivities. Sites at which viral evolution influences neutralization sensitivity were identified using Bayes factors (BFs) to compare the fit of this model to that of a null model in which sequences evolved independently of antibody sensitivity. Conformational epitopes were identified with a Metropolis algorithm that searched for a cluster of sites with large Bayes factors on the tertiary structure of the viral envelope. Results: We applied our method to ID50 neutralization data generated from seven HIV-1 subtype C serum samples with neutralization breadth that had been tested against a multi-clade panel of 225 pseudoviruses for which envelope sequences were also available. For each sample, between two and four sites were identified that were strongly associated with neutralization sensitivity (2ln(BF) > 6), a subset of which were experimentally confirmed using site-directed mutagenesis. Conclusions: Our results provide strong support for the use of evolutionary models applied to cross-sectional viral neutralization data to identify the epitopes of serum antibodies that confer neutralization breadth

    Detection of viral sequence fragments of HIV-1 subfamilies yet unknown

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Methods of determining whether or not any particular HIV-1 sequence stems - completely or in part - from some unknown HIV-1 subtype are important for the design of vaccines and molecular detection systems, as well as for epidemiological monitoring. Nevertheless, a single algorithm only, the Branching Index (BI), has been developed for this task so far. Moving along the genome of a query sequence in a sliding window, the BI computes a ratio quantifying how closely the query sequence clusters with a subtype clade. In its current version, however, the BI does not provide predicted boundaries of unknown fragments.</p> <p>Results</p> <p>We have developed <it>Unknown Subtype Finder </it>(USF), an algorithm based on a probabilistic model, which automatically determines which parts of an input sequence originate from a subtype yet unknown. The underlying model is based on a simple profile hidden Markov model (pHMM) for each <it>known </it>subtype and an additional pHMM for an <it>unknown </it>subtype. The emission probabilities of the latter are estimated using the emission frequencies of the known subtypes by means of a (position-wise) probabilistic model for the emergence of new subtypes. We have applied USF to SIV and HIV-1 sequences formerly classified as having emerged from an unknown subtype. Moreover, we have evaluated its performance on artificial HIV-1 recombinants and non-recombinant HIV-1 sequences. The results have been compared with the corresponding results of the BI.</p> <p>Conclusions</p> <p>Our results demonstrate that USF is suitable for detecting segments in HIV-1 sequences stemming from yet unknown subtypes. Comparing USF with the BI shows that our algorithm performs as good as the BI or better.</p

    Molecular phylodynamics and protein modeling of infectious salmon anemia virus (ISAV)

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>ISAV is a member of the <it>Orthomyxoviridae </it>family that affects salmonids with disastrous results. It was first detected in 1984 in Norway and from then on it has been reported in Canada, United States, Scotland and the Faroe Islands. Recently, an outbreak was recorded in Chile with negative consequences for the local fishing industry. However, few studies have examined available data to test hypotheses associated with the phylogeographic partitioning of the infecting viral population, the population dynamics, or the evolutionary rates and demographic history of ISAV. To explore these issues, we collected relevant sequences of genes coding for both surface proteins from Chile, Canada, and Norway. We addressed questions regarding their phylogenetic relationships, evolutionary rates, and demographic history using modern phylogenetic methods.</p> <p>Results</p> <p>A recombination breakpoint was consistently detected in the Hemagglutinin-Esterase (<it>he</it>) gene at either side of the Highly Polymorphic Region (HPR), whereas no recombination breakpoints were detected in Fusion protein (<it>f</it>) gene. Evolutionary relationships of ISAV revealed the 2007 Chilean outbreak group as a monophyletic clade for <it>f </it>that has a sister relationship to the Norwegian isolates. Their tMRCA is consistent with epidemiological data and demographic history was successfully recovered showing a profound bottleneck with further population expansion. Finally, selection analyses detected ongoing diversifying selection in <it>f </it>and <it>he </it>codons associated with protease processing and the HPR region, respectively.</p> <p>Conclusions</p> <p>Our results are consistent with the Norwegian origin hypothesis for the Chilean outbreak clade. In particular, ISAV HPR0 genotype is not the ancestor of all ISAV strains, although SK779/06 (HPR0) shares a common ancestor with the Chilean outbreak clade. Our analyses suggest that ISAV shows hallmarks typical of RNA viruses that can be exploited in epidemiological and surveillance settings. In addition, we hypothesized that genetic diversity of the HPR region is governed by recombination, probably due to template switching and that novel fusion gene proteolytic sites confer a selective advantage for the isolates that carry them. Additionally, protein modeling allowed us to relate the results of phylogenetic studies with the predicted structures. This study demonstrates that phylogenetic methods are important tools to predict future outbreaks of ISAV and other salmon pathogens.</p

    Algoritmi za učinkovitu usporedbu sekvenci bez korištenja sravnjivanja

    Get PDF
    Sequence comparison is an essential tool in modern biology. It is used to identify homologous regions between sequences, and to detect evolutionary relationships between organisms. Sequence comparison is usually based on alignments. However, aligning whole genomes is computationally difficult. As an alternative approach, alignment-free sequence comparison can be used. In my thesis, I concentrate on two problems that can be solved without alignment: (i) estimation of substitution rates between nucleotide sequences, and (ii) detection of local sequence homology. In the first part of my thesis, I developed and implemented a new algorithm for the efficient alignment-free computation of the number of nucleotide substitutions per site, and applied it to the analysis of large data sets of complete genomes. In the second part of my thesis, I developed and implemented a new algorithm for detecting matching regions between nucleotide sequences. I applied this solution to the classification of circulating recombinant forms of HIV, and to the analysis of bacterial genomes subject to horizontal gene transfer.Table of Contents 1. GENERAL INTRODUCTION.........................................................................1 1.1. Suffix trees and other index data structures used in biological sequence analysis.....................................................................................................................9 1.1.1. Suffix Tree..........................................................................................11 1.1.2. The space and the time complexity of the algorithms for the suffix tree construction.......................................................................................................13 1.1.3. Suffix Array........................................................................................14 1.1.4. The space and the time complexity of the algorithms for suffix array construction.......................................................................................................15 1.1.5. Enhanced Suffix Array.......................................................................17 1.1.6. The 64-bit implementation of the lightweight suffix array construction algorithm 21 1.1.7. Self-index...........................................................................................22 1.1.8. Burrows-Wheeler transform..............................................................23 1.1.9. The FM-Index and the backward search algorithm..........................25 1.1.10. The space and the time-complexity of the FM-index.........................29 2. EFFICIENT ESTIMATION OF PAIRWISE DISTANCES BETWEEN GENOMES...............................................................................................................31 2.1. Introduction................................................................................................31 2.2. Methods.....................................................................................................33 2.2.1. Definition of an alignment-free estimator of the rate of substitution, Kr 33 2.2.2. Problem statement.............................................................................35 2.2.3. Time complexity analysis of the previous approach (kr 1)................35 2.2.4. Time complexity analysis of the new approach (kr 2).......................37 2.2.5. Algorithm 1: Computation of all Kr values during the traversal of a generalized suffix tree of n sequences................................................................38 2.2.6. The implementation of kr version 2...................................................44 2.3. Analysis of Kr on simulated data sets........................................................45 2.3.1. Auxiliary programs............................................................................45 2.3.2. Consistency of Kr...............................................................................46 i 2.3.3. The affect of horizontal gene transfer on the accuracy of Kr............48 2.3.4. The effect of genome duplication on the accuracy of Kr....................49 2.3.5. Run time comparison of kr 1 and kr 2...............................................50 2.4. Application of kr version 2........................................................................53 2.4.1. Auxililary software used for the analysis of real data sets................56 2.4.2. The analysis of 12 Drosophila genomes............................................57 2.4.3. The analysis of 13 Escherichia coli and Shigella genomes...............58 2.4.4. The analysis of 825 HIV-1 pure subtype genomes.............................61 2.5. Discussion..................................................................................................62 3. EFFICIENT ALIGNMENT-FREE DETECTION OF LOCAL SEQUENCE HOMOLOGY....................................................................................66 3.1. Introduction................................................................................................66 3.2. Methods.....................................................................................................69 3.2.1. Problem statement – determining subtype(s) of a query sequence....69 3.2.2. Construction of locally homologous segments..................................71 3.2.3. Time complexity of computing a list of intervals Ii............................72 3.2.4. Algorithm 2: Construction of an interval tree...................................73 3.2.5. Computing a list of segements Gi.......................................................80 3.3. Analysis of st on simulated data sets.........................................................82 3.3.1. Run-time and memory usage analysis of st........................................82 3.3.2. Consistency of st................................................................................85 3.3.3. Comparison to SCUEAL on simulated data sets...............................92 3.4. Application of st.........................................................................................97 3.4.1. The analysis of Neisseria meningitidis..............................................98 3.4.2. The analysis of a recombinant form of HIV-1...................................99 3.4.3. The analysis of circulating recombinant forms of HIV-1................103 3.4.4. The analysis of an avian pathogenic Escherichia coli strain..........104 3.5. Discussion................................................................................................107 4. CONCLUSION..............................................................................................110 5. REFERENCES..............................................................................................112 6. ELECTRONIC SOURCES...........................................................................121 7. LIST OF ABBREVIATIONS AND SYMBOLS.........................................122 ii iii ABSTRACT............................................................................................................124 SAŽETAK..............................................................................................................125 CURRICULUM VITAE........................................................................................126 ŽIVOTOPIS...........................................................................................................12
    corecore