29 research outputs found
Summary of patients included in the brain training, brain validation, and CSF validation HIV env sequence datasets.
<p>Patient annotations and publication references available in <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0049538#pone.0049538.s003" target="_blank">Table S1</a>.</p>a<p>Cells per microliter.</p>b<p>ART, antiretroviral therapy.</p
Amino acid distributions at individual positions are not correlated with HAD.
<p>A. Amino acid frequencies in the brain dataset plotted as distributions totaling 100% for each class (HAD, non-HAD). The weights of individual sequences are normalized by patient sequencing depth. B. Percentage of sequences of each class (HAD, non-HAD) matching the amino acid requirements of signature 1_04 at each position individually, and for the complete signature. Bars represent only matching sequences and thus do not sum to 100%.</p
Analysis pipeline for identification and validation of genetic signatures associated with HAD.
<p>After initial assembly, alignment, and weighting of the sequence dataset, for each amino acid position in each sequence, four numeric factors describing the biochemical properties of the amino acid at that position are added to the alignment. This factor alignment enters the machine-learning phase where preliminary feature selection is used to select the attributes (amino acid identities or biochemical factors) that best differentiate between classes. Using the PART algorithm, this reduced set of attributes is used to train decision rules describing amino acid signatures correlated to disease outcome. Amino acid positions included in these signatures are removed from the main factor alignment and the process is iterated until no additional discriminatory signatures can be generated. Signatures are then validated by leave-one-out cross-validation, Fisher’s exact test, and assessment in brain and CSF-derived virus from independent cohorts.</p
A Machine Learning Approach for Identifying Amino Acid Signatures in the HIV <em>Env</em> Gene Predictive of Dementia
<div><p>The identification of nucleotide sequence variations in viral pathogens linked to disease and clinical outcomes is important for developing vaccines and therapies. However, identifying these genetic variations in rapidly evolving pathogens adapting to selection pressures unique to each host presents several challenges. Machine learning tools provide new opportunities to address these challenges. In HIV infection, virus replicating within the brain causes HIV-associated dementia (HAD) and milder forms of neurocognitive impairment in 20–30% of patients with unsuppressed viremia. HIV neurotropism is primarily determined by the viral envelope (<em>env</em>) gene. To identify amino acid signatures in the HIV <em>env</em> gene predictive of HAD, we developed a machine learning pipeline using the PART rule-learning algorithm and C4.5 decision tree inducer to train a classifier on a meta-dataset (n = 860 <em>env</em> sequences from 78 patients: 40 HAD, 38 non-HAD). To increase the flexibility and biological relevance of our analysis, we included 4 numeric factors describing amino acid hydrophobicity, polarity, bulkiness, and charge, in addition to amino acid identities. The classifier had 75% predictive accuracy in leave-one-out cross-validation, and identified 5 signatures associated with HAD diagnosis (p<0.05, Fisher’s exact test). These HAD signatures were found in the majority of brain sequences from 8 of 10 HAD patients from an independent cohort. Additionally, 2 HAD signatures were validated against <em>env</em> sequences from CSF of a second independent cohort. This analysis provides insight into viral genetic determinants associated with HAD, and develops novel methods for applying machine learning tools to analyze the genetics of rapidly evolving pathogens.</p> </div
Validation of HAD signatures against brain-derived <i>env</i> sequences from an independent cohort.
<p>A total of 75 brain-derived sequences from 10 independent patients (x-axis) are visualized as matching or not matching each HAD signature (y-axis). All patients in the independent cohort were diagnosed with HAD. One sequence has been omitted from patient E21 because phylogenetic mapping from the original publication indicated it might be a blood-derived contaminant. A second sequence in E21, matching no HAD signatures, was of indeterminate compartment of origin, and was retained.</p
Amino acid positions identified in each HAD signature.
<p>Amino acid positions are plotted for each HAD signature against a schematic of the HIV C2-V3-C3 region examined. Shannon entropy values of all positions in the alignment are plotted as a bar graph, with colored bars marking positions included in HAD signatures.</p
Distribution of matching sequences across HAD signatures.
<p>Visualization of sequences (x-axis) matching HAD signatures (y-axis). Colored bars on top of the x-axis indicate HAD (red) or non-HAD (blue) diagnosis of the patient from which the sequence was sampled. Sequences are clustered by their pattern of signature matches.</p
Proportion of sequences per patient from the brain dataset matching HAD signatures.
<p>For each HAD signature, HAD (red) and non-HAD (blue) patients are plotted according to their total number of sequences (x-axis) and the number of sequences matching the signature (y-axis). Patients with no matching sequences are omitted from the plot for clarity, but are included for statistical calculations. Dashed line indicates slope = 1 at which all sequences in a patient match signature. Jitter has been added to visualize overlapping points. Text indicates p-value by Fisher’s exact test and the number of patients from each class with matching sequences.</p
Statistical validation against patients in the brain HIV envsequence dataset of all HAD and non-HAD signatures generated by the PART algorithm.
<p>The statistical significance of all HAD and non-HAD signatures was determined using Fisher’s exact test to evaluate the distribution of patients in the brain dataset with matching sequences. Diagnosis indicates whether the signature was predictive of HAD or non-HAD. Patient count reflects the total number of patients with sequence spanning the amino acid positions in the relevant signature (i.e. signature 1_01 was tested in 77 patients because 1 patient does not contain sequences spanning positions 304 through 343, which are included in signature 1_01). The number of HAD and non-HAD patients from the brain dataset, containing sequences matching each signature are given, followed by the p-value of that patient distribution, calculated by Fisher’s exact test.</p>*<p> = p-value <0.05.</p
Amino acid identity and biochemical factor requirements for HAD signatures.
<p>Amino acid requirements at each position are plotted. For each “position: factor” pair, all amino acids are plotted at their value for that factor. Amino acids observed at that position within the brain-derived dataset are plotted in black, while those not observed are gray. The B-clade consensus amino acid is plotted in large font. The colored bar indicates the range of acceptable values in that signature. Lower range ends are open, indicated by a dotted line, (signature 1_01, position 328 excludes Q). Upper range ends are closed, indicated by a solid line (signature 2_03, position 321 includes S).</p