13 research outputs found
Split Analysis Methods and Parametric Bootstrapping in Molecular Phylogenetics : Taking a closer look at model adequacy
Even though the size of datasets in molecular analyses increased rapidly during the last years, undetected systematic errors as well as unsolved problems concerning the evaluation of data quality and adequate substitution model selection still persist. This not only hampers the correct analysis of these datasets but leads to undetectable effects in phylogenetic tree reconstruction. Model-based tree reconstruction methods like maximum likelihood estimation and Bayesian inference have become the methods of choice for reconstruction of phylogenetic trees. Although maximum likelihood methods are known to be consistent if all necessary conditions are met, it depends strongly on the quality of the multiple sequence alignment and the ability of the chosen evolutionary model to reflect the underlying historical processes. This thesis addresses the assessment of model adequacy of estimated evolutionary models to multiple sequence alignments in the light of parametric bootstrapping and aims to find new methods for detection of model misspecifications with the help of split analyses. The second chapter focuses on the influence of the number of gamma rate categories used in modelling among-site rate variation when trying to assess model adequacy using an absolute goodness-of-fit test. The analyses of simulated alignments show that the Goldmann-Cox test rejects models which were only approximated by four discrete gamma rate categories for various tree shapes and branch length setups, if they were simulated with a continuous gamma distribution. Increasing the number of discrete rate categories leads to an acceptance of model adequacy for stationary datasets and a correct detection of non-stationarity and inhomogenetity in simulated data. The results illustrate that the application of the proposed Goldmann-Cox test to evaluate model adequacy might be too strict and rigorous with empirical data, in particular for large phylogenomic datasets. Approaches such as the Goldman-Cox test evaluate the absolute fit of data and model but, do not deliver a deeper insight into the structure of the misfit. The third chapter presents the visualisation of overrepresented splits within splits graphs, which provides a good tool for gaining an overview of possible patterns and contradictory signal or noise within datasets. The analysis of these split residuals, observed by comparison to parametric bootstrap datasets based on the estimated models can help to gain a deeper insight into model adequacy. Highly overrepresented splits can give hints whether heterotachy applies or non symmetric substitution processes. The fourth chapter aims to define a new split weighting scheme by formalising aspects like 'contrast of character states' or 'character state homogeneity' within split subsets. Splits which are detected by the proposed SAMS (Splits Analysis MethodS) algorithm are re-evaluated for a more objective and formal split weighting. A comparison of the published and the new approach showed that the developed weighting scheme delivers reasonable results but needs further improvement. The development of a new GUI offers a much more capable tool to perform a split analysis and visualise the results. The shape of a visualised split spectra can indicate, whether a dataset delivers a clear split signal or if there is a lot of noise present
The ASV registry: a place for ASVs to be
Despite the effectiveness of DNA metabarcoding for gaining insights into biodiversity and environmental species composition, a centralized management and storage option including easy accessibility of already published data is lacking. Since most data is published as supplementary material or in private repositories, DNA metabarcoding has a huge untapped potential to be used for analysis across multiple taxa, sample locations or multiple research projects. We developed a platform to register, manage and identify amplicon sequence variants (ASVs) or zero-radius OTUs (ZOTUs), respectively, against several barcode reference datasets. Moreover, ASV tables can be uploaded, managed, versioned, and published with DOIs thus contributing to the full research Data Life Cycle
The ASV registry: a place for ASVs to be
Despite the effectiveness of DNA metabarcoding for gaining insights into biodiversity and environmental species composition, a centralized management and storage option including easy accessibility of already published data is lacking. Since most data is published as supplementary material or in private repositories, DNA metabarcoding has a huge untapped potential to be used for analysis across multiple taxa, sample locations or multiple research projects. We developed a platform to register, manage and identify amplicon sequence variants (ASVs) or zero-radius OTUs (ZOTUs), respectively, against several barcode reference datasets. Moreover, ASV tables can be uploaded, managed, versioned, and published with DOIs thus contributing to the full research Data Life Cycle
Using Semantics for morphological Descriptions in Morph•D•Base
Providing data in a semantically structured format has become the gold standard in data science. However, a significant amount of data is still provided as unstructured text - either because it is legacy data or because adequate tools for storing and disseminating data in a semantically structured format are still missing. We have developed a description module for Morph∙D∙Base, a semantic knowledge base for taxonomic and morphologic data, that enables users to generate highly standardized and formalized descriptions of anatomical entities using free text and ontology-based descriptions. The main organizational backbone of a description in Morph∙D∙Base is a partonomy, to which the user adds all the anatomical entities of the specimen that they want to describe. Each element of this partonomy is an instance of an ontology class and can be further described in two different ways:
as semantically enriched free-text description that is annotated with terms from ontologies, and
semantically through defined input forms with a wide range of ontology-terms to choose from.
To facilitate the integration of the free text into a semantic context, text can be automatically annotated using jAnnotator, a javascript library that uses about 700 ontologies with more than 8.5 million classes of the National Center for Biomedical Ontology (NCBO) bioportal. Users get to choose from suggested class definitions and link them to terms in the text, resulting in a semantic markup of the text. This markup may also include labels of elements that the user already added to the partonomy. Anatomical entities marked in the text can be added to the partonomy as new elements that can subsequently be described semantically using the input forms. Each free text together with its semantic annotations is stored following the W3C Web Annotation Data Model standard (https://www.w3.org/TR/annotation-model). The whole description with the annotated free text and the formalized semantic descriptions for each element of the partonomy are saved in the tuplestore of Morph∙D∙Base.
The demonstration is targeted at developers and users of data portals and will give an insight to the semantic Morph∙D∙Base knowledge base (https://proto.morphdbase.de) and jAnnotator (http://git.morphdbase.de/christian/jAnnotator)
AliGROOVE – visualization of heterogeneous sequence divergence within multiple sequence alignments and detection of inflated branch support
BACKGROUND: Masking of multiple sequence alignment blocks has become a powerful method to enhance the tree-likeness of the underlying data. However, existing masking approaches are insensitive to heterogeneous sequence divergence which can mislead tree reconstructions. We present AliGROOVE, a new method based on a sliding window and a Monte Carlo resampling approach, that visualizes heterogeneous sequence divergence or alignment ambiguity related to single taxa or subsets of taxa within a multiple sequence alignment and tags suspicious branches on a given tree. RESULTS: We used simulated multiple sequence alignments to show that the extent of alignment ambiguity in pairwise sequence comparison is correlated with the frequency of misplaced taxa in tree reconstructions. The approach implemented in AliGROOVE allows to detect nodes within a tree that are supported despite the absence of phylogenetic signal in the underlying multiple sequence alignment. We show that AliGROOVE equally well detects heterogeneous sequence divergence in a case study based on an empirical data set of mitochondrial DNA sequences of chelicerates. CONCLUSIONS: The AliGROOVE approach has the potential to identify single taxa or subsets of taxa which show predominantly randomized sequence similarity in comparison with other taxa in a multiple sequence alignment. It further allows to evaluate the reliability of node support in a novel way
Entry Life-Cycle with automatic Change-History & Provenance Tracking in collaborative Semantic Web Content Management Systems as implemented in SOCCOMAS
SOCCOMAS is a ready-to-use Semantic Ontology-Controlled Content Management System (http://escience.biowikifarm.net/wiki/SOCCOMAS). Each web content management system (WCMS) run by SOCCOMAS is controlled by a set of ontologies and an accompanying Java-based middleware with the data housed in a Jena tuple store. The ontologies describe the behavior of the WCMS, including all of its input forms, input controls, data schemes and workflow processes (Fig. 1).
Data is organized into different types of data entries, which represent collections of data referring to a particular material entity, for instance an individual specimen. SOCCOMAS implements a suite of general processes, which can be used to manage and organize all data entry types. One category of processes manages the life-cycle of a data entry, including all required for changing between the following possible entry states:
current draft version;
backup draft version;
recycle bin draft version;
deleted draft version;
current published version;
previously published version.
The processes also allow a user to create a revised draft based on the current published version. Another category of processes automatically tracks the overall provenance (i.e. creator, authors, creation and publication date, contributers, relation between different versions, etc.) for each particular data entry. Additionally, on a significantly finer level of granularity, SOCCOMAS also tracks in a detailed change-history log all changes made to a particular data record at the level of individual input fields. All information (data, provenance metadata, change-history metadata) is stored based on Resource Description Framework (RDF) compliant data schemes into different named graphs (i.e. a URI under which triple statements are stored in the tuple store). All recorded information can be accessed through a SPARQL endpoint. All data entries are Linked Open Data and thus provide access to an HTML representation of the data for visualization in a web-browser or as a machine-readable RDF file. The ontology-controlled design of SOCCOMAS allows administrators to easily customize already existing templates for input forms of data entries, define new templates for new types of data entries, and define underlying RDF-compliant data schemes and apply them to each relevant input field. SOCCOMAS provides an engine for running and developing semantic WCMSs, where only ontology editing, but no middleware and front end programming, are required for adapting the WCMS to one's own specific requirements
Developing a Module for Generating Formalized Semantic Morphological Descriptions for Morph∙D∙Base
We demonstrate the early prototype of a new module for Morph∙D∙Base that allows the generation of highly formalized semantic morphological descriptions (http://escience.biowikifarm.net/wiki/EScience-Compliant_Standards_for_Morphology). The resulting morphological descriptions follow the individuals-based Instance Anatomy data scheme (as opposed to the class-based Semantic Phenotypes data scheme). The module allows the description of a specimen's anatomy by generating a granular representation of the parts of the specimen to be described, using ontology-terms from known ontologies. This results in a hierarchy of parts and subparts (partonomy), which serves as organizational backbone of the entire description, with each part representing a section of the description to which you can navigate using the partonomy. The module allows the description of each part from the partonomy using (1) a set of formalized input forms, which also allow the specification of metadata for each input field, (2) a text-widget for providing conventional free-text descriptions, which can be semantically enriched through annotating them with ontology-terms of (user-)selected ontologies, and (3) an image-widget for linking images, which allows semantically enriching each image by specifying regions of interest and annotating them with ontology-terms of (user-)selected ontologies. This new module is based on SOCCOMAS, an application for semantic ontology-controlled Web-Content-Management-Systems that we are currently developing (http://escience.biowikifarm.net/wiki/SOCCOMAS:_an_application_for_semantic_ontology-controlled_Web-Content-Management-Systems)
Semantic Annotations of Text and Images in Morph∙D∙Base
Semantic Annotations of Text and Images in Morph∙D∙Bas
Developing a Module for Generating Formalized Semantic Morphological Descriptions for Morph∙D∙Base
We demonstrate the early prototype of a new module for Morph∙D∙Base that allows the generation of highly formalized semantic morphological descriptions (http://escience.biowikifarm.net/wiki/EScience-Compliant_Standards_for_Morphology). The resulting morphological descriptions follow the individuals-based Instance Anatomy data scheme (as opposed to the class-based Semantic Phenotypes data scheme). The module allows the description of a specimen's anatomy by generating a granular representation of the parts of the specimen to be described, using ontology-terms from known ontologies. This results in a hierarchy of parts and subparts (partonomy), which serves as organizational backbone of the entire description, with each part representing a section of the description to which you can navigate using the partonomy. The module allows the description of each part from the partonomy using (1) a set of formalized input forms, which also allow the specification of metadata for each input field, (2) a text-widget for providing conventional free-text descriptions, which can be semantically enriched through annotating them with ontology-terms of (user-)selected ontologies, and (3) an image-widget for linking images, which allows semantically enriching each image by specifying regions of interest and annotating them with ontology-terms of (user-)selected ontologies. This new module is based on SOCCOMAS, an application for semantic ontology-controlled Web-Content-Management-Systems that we are currently developing (http://escience.biowikifarm.net/wiki/SOCCOMAS:_an_application_for_semantic_ontology-controlled_Web-Content-Management-Systems)