6 research outputs found
A visualization of RNA virus phylogenies in the tree shape kernel space (, ) using t-distributed stochastic neighbor embedding (t-SNE).
<p>The t-SNE algorithm attempts to find the optimal map of high-dimensional data into a low-dimensional space while preserving the distances among points as much as possible. Thus, the distance between pair of viruses or virus clades (labelled by the same abbreviations as <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0078122#pone-0078122-g004" target="_blank">Figure 4</a>) is approximately proportional to their mean kernel distance. Groups of virus clades of particular interest are highlighted with the corresponding colours: HIV, red; HCV, yellow; Dengue (DEN), green; IAV-H3, IAV-H1, and IBV (blue).</p
Mapping the Shapes of Phylogenetic Trees from Human and Zoonotic RNA Viruses
<div><p>A phylogeny is a tree-based model of common ancestry that is an indispensable tool for studying biological variation. Phylogenies play a special role in the study of rapidly evolving populations such as viruses, where the proliferation of lineages is constantly being shaped by the mode of virus transmission, by adaptation to immune systems, and by patterns of human migration and contact. These processes may leave an imprint on the shapes of virus phylogenies that can be extracted for comparative study; however, tree shapes are intrinsically difficult to quantify. Here we present a comprehensive study of phylogenies reconstructed from 38 different RNA viruses from 12 taxonomic families that are associated with human pathologies. To accomplish this, we have developed a new procedure for studying phylogenetic tree shapes based on the ‘kernel trick’, a technique that maps complex objects into a statistically convenient space. We show that our kernel method outperforms nine different tree balance statistics at correctly classifying phylogenies that were simulated under different evolutionary scenarios. Using the kernel method, we observe patterns in the distribution of RNA virus phylogenies in this space that reflect modes of transmission and pathogenesis. For example, viruses that can establish persistent chronic infections (such as HIV and hepatitis C virus) form a distinct cluster. Although the visibly ‘star-like’ shape characteristic of trees from these viruses has been well-documented, we show that established methods for quantifying tree shape fail to distinguish these trees from those of other viruses. The kernel approach presented here potentially represents an important new tool for characterizing the evolution and epidemiology of RNA viruses.</p></div
Distribution of mean normalized Colless’ indices.
<p>Each label represents the mean index of a virus or virus clade. The vertical axis is used to elucidate the clustering of points by forcing overlapping labels (phylogenies with similar indices) to ‘pile up’ like a histogram. A higher Colless’ index corresponds to a less ‘balanced’ tree in which branching events tend to occur along the same lineage. A conventional histogram is displayed in the background. Labels are defined as follows: AstV = <i>astrovirus</i>; CCHF = <i>Crimean-Congo hemorrhagic fever virus</i>; ChikV = <i>chikungunya virus</i>; CA24v = <i>coxsackievirus A24</i>; DEN = <i>dengue virus</i>; E30 = <i>echovirus 30</i>; EMCV = <i>encephalomyocarditis virus</i>; EV71 = <i>enterovirus 71</i>; GBVC = <i>GB virus C</i>; HTNV = <i>Hantaan virus</i>; H[A-E]V = <i>hepatitis [A-E] virus</i>; HIV = <i>human immunodeficiency virus type 1</i>; I[A-C]V = <i>influenza [A-C] virus</i>; JEV = <i>Japanese encephalitis virus</i>; MeV = <i>measles virus</i>; MuV = <i>mumps virus</i>; MVEV = <i>Murray valley encephalitis virus</i>; NV = <i>Norwalk virus</i>; OROV = <i>Oropouche virus</i>; hPIV-1 = <i>human parainfluenza virus</i>; PV = <i>poliovirus</i>; Rab = <i>rabies virus</i>; Rot = <i>human rotavirus</i>; RhiV = <i>human rhinovirus</i>; RSV = <i>human respiratory syncytical virus</i>; Rub = <i>rubella virus</i>; RVF = <i>Rift valley fever virus</i>; SapV = <i>sapovirus</i>; SeoV = <i>Seoul virus</i>; TBEV = <i>tick-borne encephalitis virus</i>; WNV = <i>West Nile virus</i>; YFV = <i>yellow fever virus</i>.</p
Kernel-assisted comparison of two tree shapes.
<p>For trees comprising and nodes, respectively, there are pairs of nodes to evaluate. (A) Starting from a given pair of nodes (indicated in figure by circles with double-outlines), the algorithm finds the largest common subset tree rooted at these nodes. First, we find that for both nodes, neither of the branches terminate at a ‘leaf node’ (marked with ‘‘). This match contributes a relatively small amount to our kernel score, not only because the matching subset trees (highlighted in thick blue lines) comprise only one node each, but also because their discordant branch lengths lead to a substantial penalty. (B) Next, we descend down the left branch in both trees. The current nodes (open circles) in both trees spawn one leaf node and one internal node; therefore, the subset trees continue to match. In addition, their branch lengths are similar, so their contribution to the cumulative kernel score is given greater weight. (C) Finally, we descend down the right branch in both trees and find that the subset trees no longer match beyond this point. We also proceed down the right branch of the reference nodes and find no match, so our traversal of the two trees from these nodes is complete and we restart our search at the next pair of nodes.</p
Diversity in phylogenetic tree shapes for animal RNA viruses.
<p>These phylogenies were generated from samples of genetic sequences from HIV-1 subtype B (HIV1-B), dengue virus serotype 1II (DEN-1II), influenza A virus serotype H3N2 (IAV-H3), and coxsackievirus A24 variant (CA24v).</p
Classification of simulated phylogenies using nine balance statistics and the kernel function.
<p>We simulated the growth of two sets of 100 phylogenies relating 100 taxa under different scenarios in which rates of speciation (branching) evolved at different rates. Greater variation in speciation rates tended to produce more imbalanced trees. Nine different balance statistics, including eight from <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0078122#pone.0078122-Agapow1" target="_blank">[12]</a>, were computed for all phylogenies: Colless’ index, Sackin’s index, the mean and variance in path lengths from tips to the root, Shao and Sokal’s and statistics, and the imbalance value () for the sum, total mean, and the mean of the earliest 10 internal nodes of the tree. This plot illustrates the trade-off between sensitivity and specificity of classifying phylogenies by applying a cutoff value each of these balance statistics. A single point (star) indicates the sensitivity and specificity attained by applying the phylogenetic kernel function (with and ) to train a support vector machine (SVM) on a random subset (50%) of the phylogenies, and classifying the remaining half.</p
