209 research outputs found

    Phylogenetic estimates of HIV-1 gp120 indel rates across the group M subtypes

    Get PDF
    Insertions and deletions (indels) in the HIV-1 envelope glycoprotein gp120 play a significant role in the evolution of HIV pathogenesis and transmission fitness. While substitution rates in HIV-1 are well characterized by phylogenetic models, there is a lack of quantitative measures of indel rates in HIV-1. Here we use a dated-tip phylogenetic analysis of gp120 sequences to estimate indel rates for 7 subtypes and CRFs of HIV-1 group M. We obtained and processed 26,359 HIV-1 gp120 sequences from the Los Alamos National Laboratory HIV Sequence database. After filtering these sequences, we extracted the conserved and variable regions from the remaining 6,605 sequences by pairwise alignment. We used FastTree2 to reconstruct phylogenies from the alignment of concatenated conserved regions, and used least-squares dating (LSD) to rescale these trees in time. We estimated variable region indel rates by fitting a binomial-Poisson model to length discordance in sequences related by cherries. Indel rate estimates ranged from 3e-5 to 1.5e-3/nt/year and varied significantly among variable regions and subtypes; e.g., rates were significantly lower for subtype B. Variable regions V1, V2 and V4 accumulated significantly longer indels irrespective of subtype, and we found evidence of positive selection for indels affecting N-linked glycosylation sites in V1/V2. Further, we observed that indel sequences were enriched for G and depleted for T relative to the flanking sequences. Our results comprise the first phylogenetic measures of indel rates in HIV-1 gp120 across subtypes and variable regions, and identify novel and unexpected patterns for further investigation into HIV-1 evolution

    An Application of the Modifiable Areal Unit Problem: Optimizing Cluster Method Parameters to Produce Predictive Data for HIV Outbreaks

    Get PDF
    Background A popular approach to study HIV outbreaks is to cluster cases based on genetic similarity. However, there is no widely-used statistical criterion which optimizes the parameters for sequence-based clustering methods. The relationship between a cluster-defining similarity threshold and it’s associated set of clusters can be analogized to the aggregation level in the Modifiable Areal Unit Problem (MAUP). Hypothesis Based on the selection of aggregation level for study partitions in MAUP, we present a statistical framework to optimize the similarity threshold for pairwise distance algorithm TN93 (http://github.com/veg/tn93). We hypothesize that defining this threshold includes case connections such that the most predictive clusters are defined for the purposes of public health. Methods We obtained 1,653 published HIV-1 pol sequences from Seattle, USA. The sequences were aligned using MAFFT and coupled with sampling dates from Genbank. Years ranged from 2000 to 2013, with 2013 cases reflecting cluster growth. TN93 obtained pairwise distances between sequences and an R script interpreted these distances as an annotated, undirected network, annotated. Edges between cases were included in this network based on cutoff d, which was modulated from 0 to 0.06 in steps of 0.001. Based on a Poisson-linked linear model with the cluster growth outcome predicted by cluster size, we calculated the Generalized Akaike Information Criterion (GAIC) for networks at each value of d. Results GAIC was minimized at d = 0.036; notably larger than values often used in literature. Common Values in literature fall within maximum deviance peaks

    Using Amino Acid Correlation and Community Detection Algorithms to Identify Functional Determinants in Protein Families

    Get PDF
    Correlated mutation analysis has a long history of interesting applications, mostly in the detection of contact pairs in protein structures. Based on previous observations that, if properly assessed, amino acid correlation data can also provide insights about functional sub-classes in a protein family, we provide a complete framework devoted to this purpose. An amino acid specific correlation measure is proposed, which can be used to build networks summarizing all correlation and anti-correlation patterns in a protein family. These networks can be submitted to community structure detection algorithms, resulting in subsets of correlated amino acids which can be further assessed by specific parameters and procedures that provide insight into the relationship between different communities, the individual importance of community members and the adherence of a given amino acid sequence to a given community. By applying this framework to three protein families with contrasting characteristics (the Fe/Mn-superoxide dismutases, the peroxidase-catalase family and the C-type lysozyme/α-lactalbumin family), we show how our method and the proposed parameters and procedures are related to biological characteristics observed in these protein families, highlighting their potential use in protein characterization and gene annotation

    Atlantic Cod Piscidin and Its Diversification through Positive Selection

    Get PDF
    Piscidins constitute a family of cationic antimicrobial peptides that are thought to play an important role in the innate immune response of teleosts. On the one hand they show a remarkable diversity, which indicates that they are shaped by positive selection, but on the other hand they are ancient and have specific targets, suggesting that they are constrained by purifying selection. Until now piscidins had only been found in fish species from the superorder Acanthopterygii but we have recently identified a piscidin gene in Atlantic cod (Gadus morhua), thus showing that these antimicrobial peptides are not restricted to evolutionarily modern teleosts. Nucleotide diversity was much higher in the regions of the piscidin gene that code for the mature peptide and its pro domain than in the signal peptide. Maximum likelihood analyses with different evolution models revealed that the piscidin gene is under positive selection. Charge or hydrophobicity-changing amino acid substitutions observed in positively selected sites within the mature peptide influence its amphipathic structure and can have a marked effect on its function. This diversification might be associated with adaptation to new habitats or rapidly evolving pathogens

    Adaptive GDDA-BLAST: Fast and Efficient Algorithm for Protein Sequence Embedding

    Get PDF
    A major computational challenge in the genomic era is annotating structure/function to the vast quantities of sequence information that is now available. This problem is illustrated by the fact that most proteins lack comprehensive annotations, even when experimental evidence exists. We previously theorized that embedded-alignment profiles (simply “alignment profiles” hereafter) provide a quantitative method that is capable of relating the structural and functional properties of proteins, as well as their evolutionary relationships. A key feature of alignment profiles lies in the interoperability of data format (e.g., alignment information, physio-chemical information, genomic information, etc.). Indeed, we have demonstrated that the Position Specific Scoring Matrices (PSSMs) are an informative M-dimension that is scored by quantitatively measuring the embedded or unmodified sequence alignments. Moreover, the information obtained from these alignments is informative, and remains so even in the “twilight zone” of sequence similarity (<25% identity) [1]–[5]. Although our previous embedding strategy was powerful, it suffered from contaminating alignments (embedded AND unmodified) and high computational costs. Herein, we describe the logic and algorithmic process for a heuristic embedding strategy named “Adaptive GDDA-BLAST.” Adaptive GDDA-BLAST is, on average, up to 19 times faster than, but has similar sensitivity to our previous method. Further, data are provided to demonstrate the benefits of embedded-alignment measurements in terms of detecting structural homology in highly divergent protein sequences and isolating secondary structural elements of transmembrane and ankyrin-repeat domains. Together, these advances allow further exploration of the embedded alignment data space within sufficiently large data sets to eventually induce relevant statistical inferences. We show that sequence embedding could serve as one of the vehicles for measurement of low-identity alignments and for incorporation thereof into high-performance PSSM-based alignment profiles

    A novel codon insert in protease of clade B HIV type 1.

    Get PDF
    A novel combination of three codon inserts in the pol coding region of HIV-1 RNA was identified in a highly antiretroviral experienced study subject with HIV-1 infection. A one codon insert was observed in the protease region between codon 40 and 41 simultaneously with a two codon insert present in the reverse transcriptase region at codon 69

    Evolutionary Interactions between N-Linked Glycosylation Sites in the HIV-1 Envelope

    Get PDF
    The addition of asparagine (N)-linked polysaccharide chains (i.e., glycans) to the gp120 and gp41 glycoproteins of human immunodeficiency virus type 1 (HIV-1) envelope is not only required for correct protein folding, but also may provide protection against neutralizing antibodies as a “glycan shield.” As a result, strong host-specific selection is frequently associated with codon positions where nonsynonymous substitutions can create or disrupt potential N-linked glycosylation sites (PNGSs). Moreover, empirical data suggest that the individual contribution of PNGSs to the neutralization sensitivity or infectivity of HIV-1 may be critically dependent on the presence or absence of other PNGSs in the envelope sequence. Here we evaluate how glycan–glycan interactions have shaped the evolution of HIV-1 envelope sequences by analyzing the distribution of PNGSs in a large-sequence alignment. Using a “covarion”-type phylogenetic model, we find that the rates at which individual PNGSs are gained or lost vary significantly over time, suggesting that the selective advantage of having a PNGS may depend on the presence or absence of other PNGSs in the sequence. Consequently, we identify specific interactions between PNGSs in the alignment using a new paired-character phylogenetic model of evolution, and a Bayesian graphical model. Despite the fundamental differences between these two methods, several interactions are jointly identified by both. Mapping these interactions onto a structural model of HIV-1 gp120 reveals that negative (exclusive) interactions occur significantly more often between colocalized glycans, while positive (inclusive) interactions are restricted to more distant glycans. Our results imply that the adaptive repertoire of alternative configurations in the HIV-1 glycan shield is limited by functional interactions between the N-linked glycans. This represents a potential vulnerability of rapidly evolving HIV-1 populations that may provide useful glycan-based targets for neutralizing antibodies

    An Evolutionary-Network Model Reveals Stratified Interactions in the V3 Loop of the HIV-1 Envelope

    Get PDF
    The third variable loop (V3) of the human immunodeficiency virus type 1 (HIV-1) envelope is a principal determinant of antibody neutralization and progression to AIDS. Although it is undoubtedly an important target for vaccine research, extensive genetic variation in V3 remains an obstacle to the development of an effective vaccine. Comparative methods that exploit the abundance of sequence data can detect interactions between residues of rapidly evolving proteins such as the HIV-1 envelope, revealing biological constraints on their variability. However, previous studies have relied implicitly on two biologically unrealistic assumptions: (1) that founder effects in the evolutionary history of the sequences can be ignored, and; (2) that statistical associations between residues occur exclusively in pairs. We show that comparative methods that neglect the evolutionary history of extant sequences are susceptible to a high rate of false positives (20%–40%). Therefore, we propose a new method to detect interactions that relaxes both of these assumptions. First, we reconstruct the evolutionary history of extant sequences by maximum likelihood, shifting focus from extant sequence variation to the underlying substitution events. Second, we analyze the joint distribution of substitution events among positions in the sequence as a Bayesian graphical model, in which each branch in the phylogeny is a unit of observation. We perform extensive validation of our models using both simulations and a control case of known interactions in HIV-1 protease, and apply this method to detect interactions within V3 from a sample of 1,154 HIV-1 envelope sequences. Our method greatly reduces the number of false positives due to founder effects, while capturing several higher-order interactions among V3 residues. By mapping these interactions to a structural model of the V3 loop, we find that the loop is stratified into distinct evolutionary clusters. We extend our model to detect interactions between the V3 and C4 domains of the HIV-1 envelope, and account for the uncertainty in mapping substitutions to the tree with a parametric bootstrap

    Complete-Proteome Mapping of Human Influenza A Adaptive Mutations: Implications for Human Transmissibility of Zoonotic Strains

    Get PDF
    BACKGROUND: There is widespread concern that H5N1 avian influenza A viruses will emerge as a pandemic threat, if they become capable of human-to-human (H2H) transmission. Avian strains lack this capability, which suggests that it requires important adaptive mutations. We performed a large-scale comparative analysis of proteins from avian and human strains, to produce a catalogue of mutations associated with H2H transmissibility, and to detect their presence in avian isolates. METHODOLOGY/PRINCIPAL FINDINGS: We constructed a dataset of influenza A protein sequences from 92,343 public database records. Human and avian sequence subsets were compared, using a method based on mutual information, to identify characteristic sites where human isolates present conserved mutations. The resulting catalogue comprises 68 characteristic sites in eight internal proteins. Subtype variability prevented the identification of adaptive mutations in the hemagglutinin and neuraminidase proteins. The high number of sites in the ribonucleoprotein complex suggests interdependence between mutations in multiple proteins. Characteristic sites are often clustered within known functional regions, suggesting their functional roles in cellular processes. By isolating and concatenating characteristic site residues, we defined adaptation signatures, which summarize the adaptive potential of specific isolates. Most adaptive mutations emerged within three decades after the 1918 pandemic, and have remained remarkably stable thereafter. Two lineages with stable internal protein constellations have circulated among humans without reassorting. On the contrary, H5N1 avian and swine viruses reassort frequently, causing both gains and losses of adaptive mutations. CONCLUSIONS: Human host adaptation appears to be complex and systemic, involving nearly all influenza proteins. Adaptation signatures suggest that the ability of H5N1 strains to infect humans is related to the presence of an unusually high number of adaptive mutations. However, these mutations appear unstable, suggesting low pandemic potential of H5N1 in its current form. In addition, adaptation signatures indicate that pandemic H1N1/09 strain possesses multiple human-transmissibility mutations, though not an unusually high number with respect to swine strains that infected humans in the past. Adaptation signatures provide a novel tool for identifying zoonotic strains with the potential to infect humans
    corecore