97 research outputs found

    Fast NJ-like algorithms to deal with incomplete distance matrices

    Get PDF
    RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are.Abstract Background Distance-based phylogeny inference methods first estimate evolutionary distances between every pair of taxa, then build a tree from the so-obtained distance matrix. These methods are fast and fairly accurate. However, they hardly deal with incomplete distance matrices. Such matrices are frequent with recent multi-gene studies, when two species do not share any gene in analyzed data. The few existing algorithms to infer trees with satisfying accuracy from incomplete distance matrices have time complexity in O(n4) or more, where n is the number of taxa, which precludes large scale studies. Agglomerative distance algorithms (e.g. NJ 12) are much faster, with time complexity in O(n3) which allows huge datasets and heavy bootstrap analyses to be dealt with. These algorithms proceed in three steps: (a) search for the taxon pair to be agglomerated, (b) estimate the lengths of the two so-created branches, (c) reduce the distance matrix and return to (a) until the tree is fully resolved. But available agglomerative algorithms cannot deal with incomplete matrices. Results We propose an adaptation to incomplete matrices of three agglomerative algorithms, namely NJ, BIONJ 3 and MVR 4. Our adaptation generalizes to incomplete matrices the taxon pair selection criterion of NJ (also used by BIONJ and MVR), and combines this generalized criterion with that of ADDTREE 5. Steps (b) and (c) are also modified, but O(n3) time complexity is kept. The performance of these new algorithms is studied with large scale simulations, which mimic multi-gene phylogenomic datasets. Our new algorithms – named NJ*, BIONJ* and MVR* – infer phylogenetic trees that are as least as accurate as those inferred by other available methods, but with much faster running times. MVR* presents the best overall performance. This algorithm accounts for the variance of the pairwise evolutionary distance estimates, and is well suited for multi-gene studies where some distances are accurately estimated using numerous genes, whereas others are poorly estimated (or not estimated) due to the low number (absence) of sequenced genes being shared by both species. Conclusion Our distance-based agglomerative algorithms NJ*, BIONJ* and MVR* are fast and accurate, and should be quite useful for large scale phylogenomic studies. When combined with the SDM method 6 to estimate a distance matrix from multiple genes, they offer a relevant alternative to usual supertree techniques 7. Binaries and all simulated data are downloadable from 8.Published versio

    BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The quality of multiple sequence alignments plays an important role in the accuracy of phylogenetic inference. It has been shown that removing ambiguously aligned regions, but also other sources of bias such as highly variable (saturated) characters, can improve the overall performance of many phylogenetic reconstruction methods. A current scientific trend is to build phylogenetic trees from a large number of sequence datasets (semi-)automatically extracted from numerous complete genomes. Because these approaches do not allow a precise manual curation of each dataset, there exists a real need for efficient bioinformatic tools dedicated to this alignment character trimming step.</p> <p>Results</p> <p>Here is presented a new software, named BMGE (Block Mapping and Gathering with Entropy), that is designed to select regions in a multiple sequence alignment that are suited for phylogenetic inference. For each character, BMGE computes a score closely related to an entropy value. Calculation of these entropy-like scores is weighted with BLOSUM or PAM similarity matrices in order to distinguish among biologically expected and unexpected variability for each aligned character. Sets of contiguous characters with a score above a given threshold are considered as not suited for phylogenetic inference and then removed. Simulation analyses show that the character trimming performed by BMGE produces datasets leading to accurate trees, especially with alignments including distantly-related sequences. BMGE also implements trimming and recoding methods aimed at minimizing phylogeny reconstruction artefacts due to compositional heterogeneity.</p> <p>Conclusions</p> <p>BMGE is able to perform biologically relevant trimming on a multiple alignment of DNA, codon or amino acid sequences. Java source code and executable are freely available at <url>ftp://ftp.pasteur.fr/pub/GenSoft/projects/BMGE/</url>.</p

    topIb, a phylogenetic hallmark gene of Thaumarchaeota encodes a functional eukaryote-like topoisomerase IB

    Get PDF
    International audienceType IB DNA topoisomerases can eliminate torsional stresses produced during replication and transcription. These enzymes are found in all eukaryotes and a short version is present in some bacteria and viruses. Among prokaryotes, the long eukaryotic version is only observed in archaea of the phylum Thau-marchaeota. However, the activities and the roles of these topoisomerases have remained an open question. Here, we demonstrate that all available thaumar-chaeal genomes contain a topoisomerase IB gene that defines a monophyletic group closely related to the eukaryotic enzymes. We show that the topIB gene is expressed in the model thaumarchaeon Ni-trososphaera viennensis and we purified the recom-binant enzyme from the uncultivated thaumarchaeon Candidatus Caldiarchaeum subterraneum. This enzyme is active in vitro at high temperature, making it the first thermophilic topoisomerase IB characterized so far. We have compared this archaeal type IB enzyme to its human mitochondrial and nuclear counterparts. The archaeal enzyme relaxes both negatively and positively supercoiled DNA like the eukaryotic enzymes. However, its pattern of DNA cleavage specificity is different and it is resistant to camptothecins (CPTs) and non-CPT Top1 inhibitors, LMP744 and lamellarin D. This newly described ther-mostable topoisomerases IB should be a promising new model for evolutionary, mechanistic and structural studies

    The speciation and hybridization history of the genus Salmonella.

    Get PDF
    Bacteria and archaea make up most of natural diversity, but the mechanisms that underlie the origin and maintenance of prokaryotic species are poorly understood. We investigated the speciation history of the genus Salmonella, an ecologically diverse bacterial lineage, within which S. enterica subsp. enterica is responsible for important human food-borne infections. We performed a survey of diversity across a large reference collection using multilocus sequence typing, followed by genome sequencing of distinct lineages. We identified 11 distinct phylogroups, 3 of which were previously undescribed. Strains assigned to S. enterica subsp. salamae are polyphyletic, with two distinct lineages that we designate Salamae A and B. Strains of the subspecies houtenae are subdivided into two groups, Houtenae A and B, and are both related to Selander's group VII. A phylogroup we designate VIII was previously unknown. A simple binary fission model of speciation cannot explain observed patterns of sequence diversity. In the recent past, there have been large-scale hybridization events involving an unsampled ancestral lineage and three distantly related lineages of the genus that have given rise to Houtenae A, Houtenae B and VII. We found no evidence for ongoing hybridization in the other eight lineages, but detected subtler signals of ancient recombination events. We are unable to fully resolve the speciation history of the genus, which might have involved additional speciation-by-hybridization or multi-way speciation events. Our results imply that traditional models of speciation by binary fission and divergence are not sufficient to account for Salmonella evolution

    Ongoing diphtheria outbreak in Yemen: a cross-sectional and genomic epidemiology study.

    Get PDF
    BACKGROUND: An outbreak of diphtheria, declared in Yemen in October, 2017, is ongoing. We did a cross-sectional study to investigate the epidemiological, clinical, and microbiological features of the outbreak. METHODS: Probable cases of diphtheria that were defined clinically and recorded through a weekly electronic diseases early warning system (from 2017, week 22, to 2020, week 17) were used to identify trends of the outbreak (we divided the epidemic into three time periods: May 29, 2017, to June 10, 2018; June 11, 2018, to June 3, 2019; and June 4, 2019, to April 26, 2020). We used the line list of diphtheria reports for governorate-level descriptions. Vaccination coverage was estimated using the 2017 and 2018 annual reports by the national Expanded Programme on Immunization. To confirm cases biologically, Corynebacterium diphtheriae was isolated and identified from throat swabs using standard microbiological culture and identification procedures. We assessed differences in the temporal and geographical distributions of cases, including between different age groups. For in-depth microbiological analysis, tox gene and species-specific rpoB real-time PCR, Illumina genomic sequencing, antimicrobial susceptibility analysis (disk diffusion, E-test), and the Elek diphtheria toxin production test were done on confirmed cases. We used genomic data for phylogenetic analyses and to estimate the nucleotide substitution rate. FINDINGS: The Yemen diphtheria outbreak affected almost all governorates (provinces), with 5701 probable cases and 330 deaths recorded up to April 26, 2020. We collected clinical data for 888 probable cases with throat swab samples referred for biological confirmation, and genomic data for 42 positive cases, corresponding to 43 isolates (two isolates from one culture were included due to distinct colony morphologies). The median age of patients was 12 years (range 0·2-80). The proportion of cases in children aged 0-4 years was reduced during the second time period, after a vaccination campaign, compared with the first period (19% [95% CI 18-21] in the first period vs 14% [12-15] in the second period, p<0·0001). Among 43 tested isolates, 39 (91%) produced the diphtheria toxin and two had low level (0·25 mg/L) antimicrobial resistance to penicillin. We identified six C diphtheriae phylogenetic sublineages, four of which are genetically related to isolates from Saudi Arabia, Eritrea, and Somalia. Inter-sublineage genomic variations in genes associated with antimicrobial resistance, iron acquisition, and adhesion were observed. The predominant sublineage (30 [70%] of 43 isolates) was resistant to trimethoprim and was associated with unique genomic features, more frequent neck swelling (p=0·0029) and a younger age of patients (p=0·060) compared with the other sublineages. Its evolutionary rate was estimated at 1·67 × 10-6 substitutions per site per year, placing its most recent common ancestor in 2015, and indicating silent circulation of C diphtheriae in Yemen before the outbreak was declared. INTERPRETATION: In the Yemen outbreak, C diphtheriae shows high phylogenetic, genomic, and phenotypic variation. Laboratory capacity and real-time microbiological monitoring of diphtheria outbreaks need to be scaled up to inform case management and transmission control of diphtheria. Catch-up vaccination might have provided some protection to the targeted population (children aged 0-4 years). FUNDING: National Centre of the Public Health Laboratories (Yemen), Institut Pasteur, and the French Government Investissement d'Avenir Programme. TRANSLATION: For the Arabic translation of the abstract see Supplementary Materials section

    PTPA variants and impaired PP2A activity in early-onset parkinsonism with intellectual disability

    Get PDF
    The protein phosphatase 2A complex (PP2A), the major Ser/Thr phosphatase in the brain, is involved in a number of signalling pathways and functions, including the regulation of crucial proteins for neurodegeneration, such as alpha-synuclein, tau and LRRK2. Here, we report the identification of variants in the PTPA/PPP2R4 gene, encoding a major PP2A activator, in two families with early-onset parkinsonism and intellectual disability. We carried out clinical studies and genetic analyses, including genome-wide linkage analysis, whole-exome sequencing, and Sanger sequencing of candidate variants. We next performed functional studies on the disease-associated variants in cultured cells and knock-down of ptpa in Drosophila melanogaster. We first identified a homozygous PTPA variant, c.893T&gt;G (p.Met298Arg), in patients from a South African family with early-onset parkinsonism and intellectual disability. Screening of a large series of additional families yielded a second homozygous variant, c.512C&gt;A (p.Ala171Asp), in a Libyan family with a similar phenotype. Both variants co-segregate with disease in the respective families. The affected subjects display juvenile-onset parkinsonism and intellectual disability. The motor symptoms were responsive to treatment with levodopa and deep brain stimulation of the subthalamic nucleus. In overexpression studies, both the PTPA p.Ala171Asp and p.Met298Arg variants were associated with decreased PTPA RNA stability and decreased PTPA protein levels; the p.Ala171Asp variant additionally displayed decreased PTPA protein stability. Crucially, expression of both variants was associated with decreased PP2A complex levels and impaired PP2A phosphatase activation. PTPA orthologue knock-down in Drosophila neurons induced a significant impairment of locomotion in the climbing test. This defect was age-dependent and fully reversed by L-DOPA treatment. We conclude that bi-allelic missense PTPA variants associated with impaired activation of the PP2A phosphatase cause autosomal recessive early-onset parkinsonism with intellectual disability. Our findings might also provide new insights for understanding the role of the PP2A complex in the pathogenesis of more common forms of neurodegeneration.</p

    A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies

    No full text
    International audienceThis paper describes a novel alignment-free distance-based procedure for inferring phylogenetic trees from genome contig sequences using publicly available bioinformatics tools. For each pair of genomes, a dissimilarity measure is first computed and next transformed to obtain an estimation of the number of substitution events that have occurred during their evolution. These pairwise evolutionary distances are then used to infer a phylogenetic tree and assess a confidence support for each internal branch. Analyses of both simulated and real genome datasets show that this bioinformatics procedure allows accurate phylogenetic trees to be reconstructed with fast running times, especially when launched on multiple threads. Implemented in a publicly available script, named JolyTree, this procedure is a useful approach for quickly inferring species trees without the burden and potential biases of multiple sequence alignments

    Méthodes de distance pour l'inférence phylogénomique

    No full text
    L'inférence phylogénomique cherche à combiner le signal évolutif induit par un ensemble de gènes dans le but de construire un unique arbre phylogénétique.Elle peut être décomposée en trois grandes familles méthodologiques: la combinaison basse, qui s'appuie sur la concaténation des différents gènes, la combinaison haute, qui considère l'ensemble des arbres inférés à partir de chaque gène, et la combinaison moyenne, qui encode les différents signaux phylogénétiques puis combine ces différents encodages.Une méthode d'inférence d'arbre est ensuite appliquée sur le résultat de la combinaison.Cette thèse développe de nouveaux scénarios d'inférence phylogénomique, principalement basés sur l'estimation de distances évolutives entre chaque paire de taxons.Elle propose une nouvelle méthode de combinaison moyenne, nommée SDM, qui considère les matrices de distance estimées à partir de chaque gène et qui les combine en une unique supermatrice de distance.Cette dernière pouvant parfois contenir des distances manquantes, cette thèse décrit également de nouveaux algorithmes, nommés NJ*, UNJ*, BioNJ* et MVR*, permettant d'inférer très rapidement un arbre à partir d'une matrice de distance complète ou incomplète.De nombreuses simulations ont permis d'observer les bonnes performances de ces nouvelles méthodes de distance.Initialement développées pour la combinaison moyenne, elles permettent toutefois d'améliorer significativement les résultats de certaines approches standards en combinaison basse, et représentent une alternative efficace à MRP, la plus utilisée des techniques de combinaison haute, en termes de fiabilité et de rapidité.La taille des jeux de données phylogénomiques étant de plus en plus importante, les méthodes développées dans cette thèse constituent ainsi des outils de choix pour construire l'Arbre de la Vie

    Méthodes de distance pour l'inférence phylogénomique

    No full text
    L'inférence phylogénomique cherche à combiner le signal évolutif induit par un ensemble de gènes dans le but de construire un unique arbre phylogénétique.Elle peut être décomposée en trois grandes familles méthodologiques: la combinaison basse, qui s'appuie sur la concaténation des différents gènes, la combinaison haute, qui considère l'ensemble des arbres inférés à partir de chaque gène, et la combinaison moyenne, qui encode les différents signaux phylogénétiques puis combine ces différents encodages.Une méthode d'inférence d'arbre est ensuite appliquée sur le résultat de la combinaison.Cette thèse développe de nouveaux scénarios d'inférence phylogénomique, principalement basés sur l'estimation de distances évolutives entre chaque paire de taxons.Elle propose une nouvelle méthode de combinaison moyenne, nommée SDM, qui considère les matrices de distance estimées à partir de chaque gène et qui les combine en une unique supermatrice de distance.Cette dernière pouvant parfois contenir des distances manquantes, cette thèse décrit également de nouveaux algorithmes, nommés NJ*, UNJ*, BioNJ* et MVR*, permettant d'inférer très rapidement un arbre à partir d'une matrice de distance complète ou incomplète.De nombreuses simulations ont permis d'observer les bonnes performances de ces nouvelles méthodes de distance.Initialement développées pour la combinaison moyenne, elles permettent toutefois d'améliorer significativement les résultats de certaines approches standards en combinaison basse, et représentent une alternative efficace à MRP, la plus utilisée des techniques de combinaison haute, en termes de fiabilité et de rapidité.La taille des jeux de données phylogénomiques étant de plus en plus importante, les méthodes développées dans cette thèse constituent ainsi des outils de choix pour construire l'Arbre de la Vie

    On the transformation of MinHash-based uncorrected distances into proper evolutionary distances for phylogenetic inference

    No full text
    International audienceRecently developed MinHash-based techniques were proven successful in quickly estimating the level of similarity between large nucleotide sequences. This article discusses their usage and limitations in practice to approximating uncorrected distances between genomes, and transforming these pairwise dissimilarities into proper evolutionary distances. It is notably shown that complex distance measures can be easily approximated using simple transformation formulae based on few parameters. MinHash-based techniques can therefore be very useful for implementing fast yet accurate alignment-free phylogenetic reconstruction procedures from large sets of genomes. This last point of view is assessed with a simulation study using a dedicated bioinformatics tool
    corecore