21 research outputs found
Inferring phylogenetic trees under the general Markov model via a minimum spanning tree backbone
Phylogenetic trees are models of the evolutionary relationships among species, with species typically placed at the leaves of trees. We address the following problems regarding the calculation of phylogenetic trees. (1) Leaf-labeled phylogenetic trees may not be appropriate models of evolutionary relationships among rapidly evolving pathogens which may contain ancestor-descendant pairs. (2) The models of gene evolution that are widely used unrealistically assume that the base composition of DNA sequences does not evolve. Regarding problem (1) we present a method for inferring generally labeled phylogenetic trees that allow sampled species to be placed at non-leaf nodes of the tree. Regarding problem (2), we present a structural expectation maximization method (SEM-GM) for inferring leaf-labeled phylogenetic trees under the general Markov model (GM) which is the most complex model of DNA substitution that allows the evolution of base composition. In order to improve the scalability of SEM-GM we present a minimum spanning tree (MST) framework called MST-backbone. MST-backbone scales linearly with the number of leaves. However, the unrealistic location of the root as inferred on empirical data suggests that the GM model may be overtrained. MST-backbone was inspired by the topological relationship between MSTs and phylogenetic trees that was introduced by Choi et al. (2011). We discovered that the topological relationship does not necessarily hold if there is no unique MST. We propose so-called vertex-order based MSTs (VMSTs) that guarantee a topological relationship with phylogenetic trees.Phylogenetische Bäume modellieren evolutionäre Beziehungen zwischen Spezies, wobei die Spezies typischerweise an den Blättern der Bäume sitzen. Wir befassen uns mit den folgenden Problemen bei der Berechnung von phylogenetischen Bäumen. (1) Blattmarkierte phylogenetische Bäume sind möglicherweise keine geeigneten Modelle der evolutionären Beziehungen zwischen sich schnell entwickelnden Krankheitserregern, die Vorfahren-Nachfahren-Paare enthalten können. (2) Die weit verbreiteten Modelle der Genevolution gehen unrealistischerweise davon aus, dass sich die Basenzusammensetzung von DNA-Sequenzen nicht ändert. Bezüglich Problem (1) stellen wir eine Methode zur Ableitung von allgemein markierten phylogenetischen Bäumen vor, die es erlaubt, Spezies, für die Proben vorliegen, an inneren des Baumes zu platzieren. Bezüglich Problem (2) stellen wir eine strukturelle Expectation-Maximization-Methode (SEM-GM) zur Ableitung von blattmarkierten phylogenetischen Bäumen unter dem allgemeinen Markov-Modell (GM) vor, das das komplexeste Modell von DNA-Substitution ist und das die Evolution von Basenzusammensetzung erlaubt. Um die Skalierbarkeit von SEM-GM zu verbessern, stellen wir ein Minimale Spannbaum (MST)-Methode vor, die als MST-Backbone bezeichnet wird. MST-Backbone skaliert linear mit der Anzahl der Blätter. Die Tatsache, dass die Lage der Wurzel aus empirischen Daten nicht immer realistisch abgeleitet warden kann, legt jedoch nahe, dass das GM-Modell möglicherweise übertrainiert ist. MST-backbone wurde von einer topologischen Beziehung zwischen minimalen Spannbäumen und phylogenetischen Bäumen inspiriert, die von Choi et al. 2011 eingeführt wurde. Wir entdeckten, dass die topologische Beziehung nicht unbedingt Bestand hat, wenn es keinen eindeutigen minimalen Spannbaum gibt. Wir schlagen so genannte vertex-order-based MSTs (VMSTs) vor, die eine topologische Beziehung zu phylogenetischen Bäumen garantieren
A pan-European phylodynamic study of HIV-1 transmission networks
<p>THE ECDC <a href="http://www.ecdc.europa.eu/en/publications/Publications/Forms/ECDC_DispForm.aspx?ID=785">reports</a> around 100,000 new infections in Europe and Central Asia in 2010 with large variance in incidence and dominant mode of transmission across different countries. This study reconstructed HIV transmission networks and investigated patterns of epidemic growth and spread from a Europe wide dataset of 30,000 patients.</p>
<p>We find that the reconstructed transmission graph, where vertices represent patients and edges, a transmission between them, has has assortativity coeffienct (~ 0.6) which suggests endogenous growth of the HIV epidemic in Europe. Temporal evolution of the epidemic, which is estimated from reconstructed phylogenetic trees, indicates episodic growth which probably reflects the community structure of the contact network.</p>
<p>This poster will be presented at ECCB 2012. </p>
<p> </p
Constructing petridish plots in R
<p>A sample script for constructing petridish-like plots in R</p>
<p>library(igraph)</p>
<p># set graph size</p>
<p>g <- graph.empty(directed=FALSE)</p>
<p>node_id <- c(1:1000)</p>
<p>g <- add.vertices(g,nv=1000,vert_colour=sample(c('blue','red','yellow','green','purple'),1000,replace=T)<br>)<br># edge list</p>
<p>E <- sample(0:999,2000,replace=T)</p>
<p>g <- add.edges(g,E)</p>
<p># optimize layout</p>
<p>lay <- layout.fruchterman.reingold(g)</p>
<p># plot graph</p>
<p><br>plot(g,layout=lay,vertex.label='',vertex.size=2,vertex.color=V(g)$vert_colour)</p
HIV-1 transmission graph
<p>A graph representing the transmission of HIV among patients from Europe. Vertices represent patients and are coloured by country of origin. Edges indicate transmission between connected patients.</p>
<p>This graph is a result from a phylodynamic study of HIV-1 strains collected as part of a Europe wide project. The results from this study have been presented as a poster at ECCB 2012.</p>
<p>The graphic was generated using the igraph package in R.</p
Epistatic Interactions in NS5A of Hepatitis C Virus Suggest Drug Resistance Mechanisms
Hepatitis C virus (HCV) causes a major health burden and can be effectively treated by direct-acting antivirals (DAAs). The non-structural protein 5A (NS5A), which plays a role in the viral genome replication, is one of the DAAs' targets. Resistance-associated viruses (RAVs) harbouring NS5A resistance-associated mutations (RAMs) have been described at baseline and after therapy failure. A mutation from glutamine to arginine at position 30 (Q30R) is a characteristic RAM for the HCV sub/genotype (GT) 1a, but arginine corresponds to the wild type in the GT-1b; still, GT-1b strains are susceptible to NS5A-inhibitors. In this study, we show that GT-1b strains with R30Q often display other specific NS5A substitutions, particularly in positions 24 and 34. We demonstrate that in GT-1b secondary substitutions usually happen after initial R30Q development in the phylogeny, and that the chemical properties of the corresponding amino acids serve to restore the positive charge in this region, acting as compensatory mutations. These findings may have implications for RAVs treatment
Density parameter estimation for finding clusters of homologous proteins-tracing actinobacterial pathogenicity lifestyles
Abstract
Motivation: Homology detection is a long-standing challenge in computational biology. To tackle this problem, typically all-versus-all BLAST results are coupled with data partitioning approaches resulting in clusters of putative homologous proteins. One of the main problems, however, has been widely neglected: all clustering tools need a density parameter that adjusts the number and size of the clusters. This parameter is crucial but hard to estimate without gold standard data at hand. Developing a gold standard, however, is a difficult and time consuming task. Having a reliable method for detecting clusters of homologous proteins between a huge set of species would open opportunities for better understanding the genetic repertoire of bacteria with different lifestyles.
Results: Our main contribution is a method for identifying a suitable and robust density parameter for protein homology detection without a given gold standard. Therefore, we study the core genome of 89 actinobacteria. This allows us to incorporate background knowledge, i.e. the assumption that a set of evolutionarily closely related species should share a comparably high number of evolutionarily conserved proteins (emerging from phylum-specific housekeeping genes). We apply our strategy to find genes/proteins that are specific for certain actinobacterial lifestyles, i.e. different types of pathogenicity. The whole study was performed with transitivity clustering, as it only requires a single intuitive density parameter and has been shown to be well applicable for the task of protein sequence clustering. Note, however, that the presented strategy generally does not depend on our clustering method but can easily be adapted to other clustering approaches.
Availability: All results are publicly available at http://transclust.mmci.uni-saarland.de/actino_core/ or as Supplementary Material of this article.
Contact: [email protected]
Supplementary information: Supplementary data are available at Bioinformatics online.</jats:p
Maximum likelihood pandemic-scale phylogenetics
Phylogenetics has a crucial role in genomic epidemiology. Enabled by unparalleled volumes of genome sequence data generated to study and help contain the COVID-19 pandemic, phylogenetic analyses of SARS-CoV-2 genomes have shed light on the virus's origins, spread, and the emergence and reproductive success of new variants. However, most phylogenetic approaches, including maximum likelihood and Bayesian methods, cannot scale to the size of the datasets from the current pandemic. We present 'MAximum Parsimonious Likelihood Estimation' (MAPLE), an approach for likelihood-based phylogenetic analysis of epidemiological genomic datasets at unprecedented scales. MAPLE infers SARS-CoV-2 phylogenies more accurately than existing maximum likelihood approaches while running up to thousands of times faster, and requiring at least 100 times less memory on large datasets. This extends the reach of genomic epidemiology, allowing the continued use of accurate phylogenetic, phylogeographic and phylodynamic analyses on datasets of millions of genomes
geno2pheno[ngs-freq]: a genotypic interpretation system for identifying viral drug resistance using next-generation sequencing data
Identifying resistance to antiretroviral drugs is crucial for ensuring the successful treatment of patients infected with viruses such as human immunodeficiency virus (HIV) or hepatitis C virus (HCV). In contrast to Sanger sequencing, next-generation sequencing (NGS) can detect resistance mutations in minority populations. Thus, genotypic resistance testing based on NGS data can offer novel, treatment-relevant insights. Since existing web services for analyzing resistance in NGS samples are subject to long processing times and follow strictly rulesbased approaches, we developed geno2pheno[ngs-freq], a web service for rapidly identifying drug resistance in HIV-1 and HCV samples. By relying on frequency files that provide the read counts of nucleotides or codons along a viral genome, the time-intensive step of processing raw NGS data is eliminated. Once a frequency file has been uploaded, consensus sequences are generated for a set of user-defined prevalence cutoffs, such that the constructed sequences contain only those nucleotides whose codon prevalence exceeds a given cutoff. After locally aligning the sequences to a set of references, resistance is predicted using the well-established approaches of geno2pheno[resistance] and geno2pheno[hcv]. geno2pheno[ngs-freq] can assist clinical decision making by enabling users to explore resistance in viral populations with different abundances and is freely available at http: //ngs.geno2pheno.org
Geno2pheno([HCV]) - A Web-based Interpretation System to Support Hepatitis C Treatment Decisions in the Era of Direct-Acting Antiviral Agents
The face of hepatitis C virus (HCV) therapy is changing dramatically. Direct-acting antiviral agents (DAAs) specifically targeting HCV proteins have been developed and entered clinical practice in 2011. However, despite high sustained viral response (SVR) rates of more than 90%, a fraction of patients do not eliminate the virus and in these cases treatment failure has been associated with the selection of drug resistance mutations (RAMs). RAMs may be prevalent prior to the start of treatment, or can be selected under therapy, and furthermore they can persist after cessation of treatment. Additionally, certain DAAs have been approved only for distinct HCV genotypes and may even have subtype specificity. Thus, sequence analysis before start of therapy is instrumental for managing DAA-based treatment strategies. We have created the interpretation system geno2pheno([HCV]) (g2p([HCV])) to analyse HCV sequence data with respect to viral subtype and to predict drug resistance. Extensive reviewing and weighting of literature related to HCV drug resistance was performed to create a comprehensive list of drug resistance rules for inhibitors of the HCV protease in non-structural protein 3 (NS3-protease: Boceprevir, Paritaprevir, Simeprevir, Asunaprevir, Grazoprevir and Telaprevir), the NS5A replicase factor (Daclatasvir, Ledipasvir, Elbasvir and Ombitasvir), and the NS5B RNA-dependent RNA polymerase (Dasabuvir and Sofosbuvir). Upon submission of up to eight sequences, g2p([HCV]) aligns the input sequences, identifies the genomic region(s), predicts the HCV geno- and subtypes, and generates for each DAA a drug resistance prediction report. g2p([HCV]) offers easy-to-use and fast subtype and resistance analysis of HCV sequences, is continuously updated and freely accessible under http://hcv.geno2pheno.org/index.php. The system was partially validated with respect to the NS3-protease inhibitors Boceprevir, Telaprevir and Simeprevir by using data generated with recombinant, phenotypic cell culture assays obtained from patients' virus variants