3,121 research outputs found
Who Watches the Watchmen? An Appraisal of Benchmarks for Multiple Sequence Alignment
Multiple sequence alignment (MSA) is a fundamental and ubiquitous technique
in bioinformatics used to infer related residues among biological sequences.
Thus alignment accuracy is crucial to a vast range of analyses, often in ways
difficult to assess in those analyses. To compare the performance of different
aligners and help detect systematic errors in alignments, a number of
benchmarking strategies have been pursued. Here we present an overview of the
main strategies--based on simulation, consistency, protein structure, and
phylogeny--and discuss their different advantages and associated risks. We
outline a set of desirable characteristics for effective benchmarking, and
evaluate each strategy in light of them. We conclude that there is currently no
universally applicable means of benchmarking MSA, and that developers and users
of alignment tools should base their choice of benchmark depending on the
context of application--with a keen awareness of the assumptions underlying
each benchmarking strategy.Comment: Revie
High-Performance approaches for Phylogenetic Placement, and its application to species and diversity quantification
In den letzten Jahren haben Fortschritte in der Hochdurchsatz-Genesequenzierung, in Verbindung mit dem anhaltenden exponentiellen Wachstum und der Verfügbarkeit von Rechenressourcen, zu fundamental neuen analytischen Ansätzen in der Biologie geführt.
Es ist nun möglich den genetischen Inhalt ganzer Organismengemeinschaften anhand einzelner Umweltproben umfassend zu sequenzieren.
Solche Methoden sind besonders fĂĽr die Mikrobiologie relevant.
Die Mikrobiologie war zuvor weitgehend auf die Untersuchung jener Mikroben beschränkt, welche im Labor (d.h., in vitro) kultiviert werden konnten, was jedoch lediglich einen kleinen Teil der in der Natur vorkommenden Diversität abdeckt.
Im Gegensatz dazu ermöglicht die Hochdurchsatzsequenzierung nun die direkte Erfassung der genetischen Sequenzen eines Mikrobioms, wie es in seiner natürlichen Umgebung vorkommt (d.h., in situ).
Ein typisches Ziel von Mikrobiomstudien besteht in der taxonomischen Klassifizierung der in einer Probe enthaltenen Sequenzen (Querysequenzen).
Ăśblicherweise werden phylogenetische Methoden eingesetzt, um detaillierte taxonomische Beziehungen zwischen Querysequenzen und vertrauenswĂĽrdigen Referenzsequenzen, die von bereits klassifizierten Organismen stammen, zu bestimmen.
Aufgrund des hohen Volumens ( bis ) von Querysequenzen, die aus einer Mikrobiom-Probe mittels Hochdurchsatzsequenzierung generiert werden können, ist eine akkurate phylogenetische Baumrekonstruktion rechnerisch nicht mehr möglich.
Darüber hinaus erzeugen derzeit üblicherweise verwendete Sequenzierungstechnologien vergleichsweise kurze Sequenzen, die ein begrenztes phylogenetisches Signal aufweisen, was zu einer Instabilität bei der Inferenz der Phylogenien aus diesen Sequenzen führt.
Ein weiteres typisches Ziel von Mikrobiomstudien besteht in der Quantifizierung der Diversität innerhalb einer Probe, bzw. zwischen mehreren Proben.
Auch hierfĂĽr werden ĂĽblicherweise phylogenetische Methoden verwendet.
Oftmals setzen diese Methoden die Inferenz eines phylogenetischen Baumes voraus, welcher entweder alle Sequenzen, oder eine geclusterte Teilmenge dieser Sequenzen, umfasst.
Wie bei der taxonomischen Identifizierung können Analysen, die auf dieser Art von Bauminferenz basieren, zu ungenauen Ergebnissen führen und/oder rechnerisch nicht durchführbar sein.
Im Gegensatz zu einer umfassenden phylogenetischen Inferenz ist die phylogenetische Platzierung eine Methode, die den phylogenetischen Kontext einer Querysequenz innerhalb eines etablierten Referenzbaumes bestimmt.
Dieses Verfahren betrachtet den Referenzbaum typischerweise als unveränderlich, d.h. der Referenzbaum wird vor, während oder nach der Platzierung einer Sequenz nicht geändert.
Dies erlaubt die phylogenetische Platzierung einer Sequenz in linearer Zeit in Bezug auf die Größe des Referenzbaums durchzuführen.
In Kombination mit taxonomischen Informationen über die Referenzsequenzen ermöglicht die phylogenetische Platzierung somit die taxonomische Identifizierung einer Sequenz.
Darüber hinaus erlaubt eine phylogenetische Platzierung die Anwendung einer Vielzahl zusätzlicher Analyseverfahren, die beispielsweise die Zuordnung der Zusammensetzungen humaner Mikrobiome zu klinisch-diagnostischen Eigenschaften ermöglicht.
In dieser Dissertation präsentiere ich meine Arbeit bezüglich des Entwurfs, der Implementierung, und Verbesserung von EPA-ng, einer Hochleistungsimplementierung der phylogenetischen Platzierung anhand des Maximum-Likelihood Modells.
EPA-ng wurde entwickelt um auf Milliarden von Querysequenzen zu skalieren und auf Tausenden von Kernen in Systemen mit gemeinsamem und verteiltem Speicher ausgefĂĽhrt zu werden.
EPA-ng beschleunigt auch die Verarbeitungsgeschwindigkeit auf einzelnen Kernen um das bis zu -fache, im Vergleich zu dessen direkten Konkurrenzprogrammen.
Vor kurzem haben wir eine zusätzliche Methode für EPA-ng eingeführt, welche die Platzierung in wesentlich größeren Referenzbäumen ermöglicht.
Hierfür verwenden wir einen aktiven Speicherverwaltungsansatz, bei dem reduzierter Speicherverbrauch gegen größere Ausführungszeiten eingetauscht wird.
Zusätzlich präsentiere ich einen massiv-parallelen Ansatz um die Diversität einer Probe zu quantifizieren, welcher auf den Ergebnissen phylogenetischer Platzierungen basiert.
Diese Software, genannt \toolname{SCRAPP}, kombiniert aktuelle Methoden fĂĽr die Maximum-Likelihood basierte phylogenetische Inferenz mit Methoden zur Abgrenzung molekularer Spezien.
Daraus resultiert eine Verteilung der Artenanzahl auf den Kanten eines Referenzbaums fĂĽr eine gegebene Probe.
DarĂĽber hinaus beschreibe ich einen neuartigen Ansatz zum Clustering von Platzierungsergebnissen, anhand dessen der Benutzer den Rechenaufwand reduzieren kann
Recommended from our members
Shallow Genome Sequencing for Phylogenomics of Mycorrhizal Fungi from Endangered Orchids
ABSTRACT Most plant species form symbioses with mycorrhizal fungi and this relationship is especially important for orchids. Fungi in the genera Tulasnella, Ceratobasidium, and Serendipita are critically important for orchid germination, growth and development. The goals of this study are to understand the phylogenetic relationships of mycorrhizal fungi and to improve the taxonomic resources for these groups. We identified 32 fungal isolates with the internal transcribed spacer region and used shallow genome sequencing to functionally annotate these isolates. We constructed phylogenetic trees from 408 orthologous nuclear genes for 50 taxa representing 14 genera, 11 families, and five orders in Agaricomycotina. While confirming relationships among the orders Cantharellales, Sebacinales, and Auriculariales, our results suggest novel relationships between families in the Cantharellales. Consistent with previous studies, we found the genera Ceratobasidium and Thanatephorus of Cerabotasidiaceae to not be monophyletic. Within the monophyletic genus Tulasnella , we found strong phylogenetic signals that suggest a potentially new species and a revision of current species boundaries (e.g. Tulasnella calospora ); however it is premature to make taxonomic revisions without further sampling and morphological descriptions. There is low resolution of Serendipita isolates collected. More sampling is needed from areas around the world before making evolutionary-informed changes in taxonomy. Our study adds value to an important living collection of fungi isolated from endangered orchid species, but also informs future investigations of the evolution of orchid mycorrhizal fungi
Accidental Father-to-Son HIV-1 Transmission During the Seroconversion Period
A 4-year-old child born to an HIV-1 seronegative mother was diagnosed with HIV-1, the main risk factor being transmission from the child's father who was seroconverting at the time of the child's birth. In the context of a forensic investigation, we aimed to identify the source of infection of the child and date of the transmission event. Samples were collected from the father and child at two time points about 4 years after the child's birth. Partial segments of three HIV-1 genes (gag, pol, and env) were sequenced and maximum likelihood (ML) and Bayesian methods were used to determine direction and estimate date of transmission. Neutralizing antibodies were determined using a single cycle assay. Bayesian trees displayed a paraphyletic-monophyletic topology in all three genomic regions, with the father's host label at the root, which is consistent with father-to-son transmission. ML trees found similar topologies in gag and pol and a monophyletic-monophyletic topology in env. Analysis of the time of the most recent common ancestor of each HIV-1 gene population indicated that the child was infected shortly after the father. Consistent with the infection history, both father and son developed broad and potent HIV-specific neutralizing antibody responses. In conclusion, the direction of transmission implicated the father as the source of transmission. Transmission occurred during the seroconversion period when the father was unaware of the infection and was likely accidental. This case shows how genetic, phylogenetic, and serological data can contribute for the forensic investigation of HIV transmission.info:eu-repo/semantics/publishedVersio
The effect of primer choice and short read sequences on the outcome of 16S rRNA gene based diversity studies
Different regions of the bacterial 16S rRNA gene evolve at different evolutionary rates. The scientific outcome of short read sequencing studies therefore alters with the gene region sequenced. We wanted to gain insight in the impact of primer choice on the outcome of short read sequencing efforts. All the unknowns associated with sequencing data, i.e. primer coverage rate, phylogeny, OTU-richness and taxonomic assignment, were therefore implemented in one study for ten well established universal primers (338f/r, 518f/r, 799f/r, 926f/r and 1062f/r) targeting dispersed regions of the bacterial 16S rRNA gene. All analyses were performed on nearly full length and in silico generated short read sequence libraries containing 1175 sequences that were carefully chosen as to present a representative substitute of the SILVA SSU database. The 518f and 799r primers, targeting the V4 region of the 16S rRNA gene, were found to be particularly suited for short read sequencing studies, while the primer 1062r, targeting V6, seemed to be least reliable. Our results will assist scientists in considering whether the best option for their study is to select the most informative primer, or the primer that excludes interferences by host-organelle DNA. The methodology followed can be extrapolated to other primers, allowing their evaluation prior to the experiment
Recommended from our members
Inference of single-cell phylogenies from lineage tracing data using Cassiopeia.
The pairing of CRISPR/Cas9-based gene editing with massively parallel single-cell readouts now enables large-scale lineage tracing. However, the rapid growth in complexity of data from these assays has outpaced our ability to accurately infer phylogenetic relationships. First, we introduce Cassiopeia-a suite of scalable maximum parsimony approaches for tree reconstruction. Second, we provide a simulation framework for evaluating algorithms and exploring lineage tracer design principles. Finally, we generate the most complex experimental lineage tracing dataset to date, 34,557 human cells continuously traced over 15 generations, and use it for benchmarking phylogenetic inference approaches. We show that Cassiopeia outperforms traditional methods by several metrics and under a wide variety of parameter regimes, and provide insight into the principles for the design of improved Cas9-enabled recorders. Together, these should broadly enable large-scale mammalian lineage tracing efforts. Cassiopeia and its benchmarking resources are publicly available at www.github.com/YosefLab/Cassiopeia
Performance, Accuracy, and Web Server for Evolutionary Placement of Short Sequence Reads under Maximum Likelihood
We present an evolutionary placement algorithm (EPA) and a Web server for the rapid assignment of sequence fragments (short reads) to edges of a given phylogenetic tree under the maximum-likelihood model. The accuracy of the algorithm is evaluated on several real-world data sets and compared with placement by pair-wise sequence comparison, using edit distances and BLAST. We introduce a slow and accurate as well as a fast and less accurate placement algorithm. For the slow algorithm, we develop additional heuristic techniques that yield almost the same run times as the fast version with only a small loss of accuracy. When those additional heuristics are employed, the run time of the more accurate algorithm is comparable with that of a simple BLAST search for data sets with a high number of short query sequences. Moreover, the accuracy of the EPA is significantly higher, in particular when the sample of taxa in the reference topology is sparse or inadequate. Our algorithm, which has been integrated into RAxML, therefore provides an equally fast but more accurate alternative to BLAST for tree-based inference of the evolutionary origin and composition of short sequence reads. We are also actively developing a Web server that offers a freely available service for computing read placements on trees using the EPA
- …