8 research outputs found

    Using MIxS: An Implementation Report from Two Metagenomic Information Systems

    No full text
    MIxS (Minimum Information about any Sequence) (Yilmaz et al. 2011) is a metadata standard of the Genomics Standards Consortium (GSC), designed to make sequence data findable, accessible, and interoperable. It contains fields for recording physical and chemical characteristics of the sampling environment, geographical and habitat information, and other metadata about the sample and its provenance, which are critical for downstream intepretation of data derived from the sample. We will present our experience implementing MIxS in two metagenomic information systems – the Earth Microbiome Project (EMP) and the Government of Canada (GoC) Ecobiomics project. The EMP (Gilbert et al. 2014) is an ongoing effort to crowdsource environmental microbiome samples from around Earth, then sequence and analyze them using a standardized workflow. The EMP has aggregated and sequenced over 50,000 samples, which are queryable using a publicly available catalogue. A meta-analysis of the first 25,000 samples is currently in review. MIxS and the Environment Ontology (ENVO) (Buttigieg et al. 2016) have been useful in structuring environmental metadata from EMP studies. For the particular application of the EMP meta-analysis, however, several issues were encountered: often there are multiple possible 'correct' assignments to the biome, feature, and material fields; the fields are not hierarchical, limiting logical organization; and the primary ecological factors differentiating microbial communites are not captured. In response to these challenges, the EMP team worked with the ENVO team to devise a new hierarchical structure, the EMP ontology (EMPO), that captures the primary axes along which microbial communities tend to be structured (host-associated or not, saline or not). EMPO is an application ontology, with a formally defined W3C Web Ontology Language (OWL) document mapping to existing ontologies, enabling reuse by the microbial ecology community. Ecobiomics is a joint project of multiple GoC departments and involves the complete workflow, from sampling in a variety of aquatic, soil, and benthic environments, through sample prep, DNA extraction, library prep, sequencing, and analysis. In contrast to the EMP—where some of the samples and metadata had been collected before the establishment of the MIxS standards—the Ecobiomics project has been able to create metadata profiles for each sub-project to conform to, extend, and build, upon the existing MIxS standards. Despite these two different contexts, EMP and Ecobiomics encountered a number of common issues that prevented a complete implementation of MIxS. These issues include ambiguous term names and definitions; inconsistencies amongst the environmental packages; non-standard ways of dealing with units; and a number of issues surrounding ENVO (the Environment Ontology), which is required for filling out the mandatory MIxS fields "Environmental material", "Biome", and "Environmental feature". We will describe these issues, and, more generally, the successes and challenges of our implementations

    DINA—Development of open source and open services for natural history collections & research

    No full text
    The DINA Consortium (DINA = “DIgital information system for NAtural history data”, https://dina-project.net) is a framework for like-minded practitioners of natural history collections to collaborate on the development of distributed, open source software that empowers and sustains collections management. Target collections include zoology, botany, mycology, geology, paleontology, and living collections. The DINA software will also permit the compilation of biodiversity inventories and will robustly support both observation and molecular data.The DINA Consortium focuses on an open source software philosophy and on community-driven open development. Contributors share their development resources and expertise for the benefit of all participants. The DINA System is explicitly designed as a loosely coupled set of web-enabled modules. At its core, this modular ecosystem includes strict guidelines for the structure of Web application programming interfaces (APIs), which guarantees the interoperability of all components (https://github.com/DINA-Web). Important to the DINA philosophy is that users (e.g., collection managers, curators) be actively engaged in an agile development process. This ensures that the product is pleasing for everyday use, includes efficient yet flexible workflows, and implements best practices in specimen data capture and management.There are three options for developing a DINA module:create a new module compliant with the specifications (Fig. 1),modify an existing code-base to attain compliance (Fig. 2), orwrap a compliant API around existing code that cannot be or may not be modified (e.g., infeasible, dependencies on other systems, closed code) (Fig. 3).All three of these scenarios have been applied in the modules recently developed: a module for molecular data (SeqDB), modules for multimedia, documents and agents data and a service module for printing labels and reports:The SeqDB collection management and molecular tracking system (Bilkhu et al. 2017) has evolved through two of these scenarios. Originally, the required architectural changes were going to be added into the codebase, but after some time, the development team recognised that the technical debt inherent in the project wasn’t worth the effort of modification and refactoring. Instead a new codebase was created bringing forward the best parts of the system oriented around the molecular data model for Sanger Sequencing and Next Generation Sequencing (NGS) workflows.In the case of the Multimedia and Document Store module and the Agents module, a brand new codebase was established whose technology choices were aligned with the DINA vision. These two modules have been created from fundamental use cases for collection management and digitization workflows and will continue to evolve as more modules come online and broaden their scope.The DINA Labels & Reporting module is a generic service for transforming data in arbitrary printable layouts based on customizable templates. In order to use the module in combination with data managed in collection management software Specify (http://specifysoftware.org) for printing labels of collection objects, we wrapped the Specify 7 API with a DINA-compliant API layer called the “DINA Specify Broker”. This allows for using the easy-to-use web-based template engine within the DINA Labels & Reports module without changing Specify’s codebase.In our presentation we will explain the DINA development philosophy and will outline benefits for different stakeholders who directly or indirectly use collections data and related research data in their daily workflows. We will also highlight opportunities for joining the DINA Consortium and how to best engage with members of DINA who share their expertise in natural science, biodiversity informatics and geoinformatics

    Representation of Object Provenance for Research on Natural Science Objects: Samples, parts and derivatives in DINA-compliant collection data management

    No full text
    Collection objects in natural science collections span a diverse set of object types of substantially different origin, physical composition, and relevance for different fields and methodologies of research and application. Object provenance is often characterized by elaborate series of interventions from collecting or observing originals in a natural state to generating derived objects that can be physically persistent or are suitable for a given use. This sequence of events gives rise to intermediate objects or object states that can be of a persistent or ephemeral nature in their own right. Detailed metadata on object provenance is vital to enable informed use of collection objects for research and other application areas. Providing the ability to generate, maintain, update and access such accounts is an important requirement for Collection Management Software (CMS).DINA (Digital Information System for Natural History Data, Glöckler et al. 2020)-compliant collection management software meets this challenge by using process- and state-based representation of object histories and modular application architecture as the main conceptual and architectural principles, respectively (Bölling et al. 2021).In applying these principles, we showcase how object provenance can be represented in the DINA system in cases wheremultiple objects, possibly of varying types, are derived from a single object,objects consist of parts of different biological individuals,object histories involve different types of objects such as living biological individuals, samples, and preserved specimens.We highlight how the abstractions and categories used in the DINA model can be used to meet a variety of challenging use cases for representing collection object provenance. For instance, while the connections and relationships between living, preserved, and even destructively processed samples can be documented in DINA, these are ordinarily difficult to accommodate in a single information system

    Robust Integration of Biodiversity Data by Process- and State-based Representation of Object Histories and Modular Application Architecture

    No full text
    Biodiversity data is obtained by a variety of methodological approaches—including observation surveys, environmental sampling and biological object collection—employing diverse sample processing protocols and data transformations. While complete and accurate accounts of these data-generating processes are important to enable integration and informed reuse of data, the structure and content of published biodiversity data currently are often shaped by specific application goals. For example, data publishers that export specimen-based data from collection management systems for inclusion in aggregations like those in the Global Biodiversity Information Facility (GBIF) must frequently relax their internal models and produce unnatural joins to fit GBIF’s occurrences-based data structure. Third-party assertions over these aggregated data therefore assume the risk of irreproducibility or concept drift.Here we introduce process- and state-based representation of object histories as the main organizing principle for data about specimens and samples in Digital Information System for Natural History Data (DINA, Glöckler et al. 2020)-compliant collection management software (Fig. 1). Specimens, samples and objects in general are subjected to a variety of processes, including planned actions involving the object, e.g., collecting, preparing, subsampling, loaning. Object states are any particular mode of being of an object at a certain point in time. For example, any one intermediate step in preparing a collected specimen for long-term conservation in a collection would constitute an individual object state. An object’s history is the entire chain of these interrelated processes and states.We argue that using object histories as main conceptual modeling paradigm in DINA offers the generality required to accommodate a diverse, open set of use cases in biodiversity data representation, yet also offers the versatility to serve as basis for use-case specific data aggregation and presentation. Specifically, a representation based on object histories providesa coherent structure for documenting individual processes and states for any given object and for linking this documentation (e.g., textual descriptions or images pertaining to a given process or state),a natural representational structure of the real-world sequence of processes an object participates in and for the data generated in these processes (e.g., a DNA-extraction procedure and sequence information generated on its basis),a straightforward structure to link data about related objects (e.g., tissue samples, the biological specimen a bone is derived from) in a network of connected object histories.The approach is designed to be embedded in DINA’s modular application architecture, so that information on object histories can be accessed via corresponding APIs either through its own interfaces (Fig. 2) or by integration with external web services (Fig. 3). Viewing collection management tasks as part of object histories also informs delineation of modules to support these tasks with specialized functions and interfaces. It also admits the use of persistent, dereferencable identifiers for individual processes and states in object histories and for linking their representations to elements in ontologies and controlled vocabularies.In this contribution to the symposium, DINA's object histories as a main organizing principle for collection object data will be discussed and the utility of using it in the context of modular application architecture, data federation, and data integration in projects like BiCIKL will be illustrated

    Management of Molecular Data in DINA with SeqDB

    No full text
    Agriculture and Agri-Food Canada (AAFC) is home to numerous specimen and environmental collections generating highly relational data sets that are analyzed using molecular methods (Sanger and NGS). The need to have a system to properly manage these data sets and to capture accurate, standardized metadata over entire laboratory workflows has been a long-term strategic vision of the Biodiversity group at AAFC. Without robust tracking, many difficulties arise when trying to publish or submit data to external repositories. To even know what work has been carried out on individual collection records over a researchers career becomes a demanding task if the information is retrievable at all. SeqDB was built to resolve these issues by centralizing, standardizing and improving the availability and data quality of source specimen collection data that is being studied using molecular methods. SeqDB also facilitates integration with tools and external repositories in order to take the burden off researchers and technicians having to create adequate systems to track and mobilize their data sets, allowing them to focus on research and collection management. The development of SeqDB aligns with agile development methodologies and attempts to fulfill rapidly emerging needs from genetics and genomics research, which can evolve and fade quickly at times or be without clear requirements. The success of SeqDB as an application supporting DNA sequencing workflows has put it in the same space as other monolithic architectures before it. As the feature set to support the application continues to increase, the number of software developers vs operations and maintenance staff is difficult to rebalance in our organisation. In an effort to manage the scope for the project and ensure we are able to continue to deliver on our mandate, the sequence tracking workflows of the application will become part of the DINA ecosystem ("DIgital information system for NAtural history data", https://dina-project.net). Other functions of SeqDB such as collections management and taxonomy tree curation, will be replaced with the DINA modules implementing these functions. In order to allow SeqDB to become a module of DINA, it has been decided to refactor the application to base it on a Service Oriented Architecture. By doing so, all molecular data of SeqDB will be exposed as JSON API Web Services (JavaScript object notation application programming interface) allowing other modules, user interfaces and the current SeqDB application to communicate in a standardised way. The new architecture will also bring an important technology upgrade for SeqDB where the front end will eventually become a project in itself

    SeqDB: Biological Collection Management with Integrated DNA Sequence Tracking 

    No full text
    Agriculture and Agri-Food Canada (AAFC) is home to a world-class taxonomy program based on Canada's national agricultural collections for Botany, Mycology and Entomology. These collections contain valuable resources, such as type specimen for authoritative identification using approaches that include phenotyping, DNA barcoding, and whole genome sequencing. These authoritative references allow for accurate identification of the taxonomic biodiversity found in environmental samples in fields such as metagenomics. AAFC's internally developed web application, termed SeqDB, tracks the complete workflow and provenance chain from source specimen information through DNA extractions, PCR reactions, and sequencing leading to binary DNA sequence files. In the context of Next Generation Sequencing (NGS) of environmental samples, SeqDB tracks sampling metadata, DNA extractions, and library preparation workflow leading to demultiplexed sequence files. SeqDB implements the Taxonomic Databases Working Group (TDWG) Darwin Core standard Wieczorek et al. 2012 for Biodiversity Occurrence Data, as well as the Genome Standards Consortium (GSC) Minimum Information about any (X) Sequences (MIxS) specification Yilmaz et al. 2011. When coupled with the built-in data standards validation system, this has led to the ability to search consistent metadata across multiple studies. Furthermore, the application enables tracking the physical storage of the aforementioned specimens and their derivative molecular extracts using an integrated barcode printing and reading system. All the information is presented using a graphical user interface that features intuitive molecular workflows as well as a RESTful API that facilitates integration with external applications and programmatic access of the data. The success of SeqDB has been due to the close collaboration with scientists and technicians undertaking molecular research involving the national collection, and the centralization of their data sets in an access controlled relational database implementing internationally recognized standards. We will describe the overall system, and some of our lessons learned in building it

    SeqDB: Biological Collection Management with Integrated DNA Sequence Tracking 

    No full text
    Agriculture and Agri-Food Canada (AAFC) is home to a world-class taxonomy program based on Canada’s national agricultural collections for Botany, Mycology and Entomology.  These collections contain valuable resources, such as type specimen for authoritative identification using approaches that include phenotyping, DNA barcoding, and whole genome sequencing.  These authoritative references allow for accurate identification of the taxonomic biodiversity found in environmental samples in fields such as metagenomics. AAFC’s internally developed web application, termed SeqDB, tracks the complete workflow and provenance chain from source specimen information through DNA extractions, PCR reactions, and sequencing leading to binary DNA sequence files.  In the context of Next Generation Sequencing (NGS) of environmental samples, SeqDB tracks sampling metadata, DNA extractions, and library preparation workflow leading to demultiplexed sequence files.  SeqDB implements the Taxonomic Databases Working Group (TDWG) Darwin Core standard Wieczorek et al. 2012 for Biodiversity Occurrence Data, as well as the Genome Standards Consortium (GSC) Minimum Information about any (X) Sequences (MIxS) specification Yilmaz et al. 2011. When coupled with the built-in data standards validation system, this has  led to the ability to search consistent metadata across multiple studies. Furthermore, the application enables tracking the physical storage of the aforementioned specimens and their derivative molecular extracts using an integrated barcode printing and reading system.   All the information is presented using a graphical user interface that features intuitive molecular workflows as well as a RESTful API that facilitates integration with external applications and programmatic access of the data. The success of SeqDB has been due to the close collaboration with scientists and technicians undertaking molecular research involving the national collection, and the centralization of their data sets in an access controlled relational database implementing internationally recognized standards. We will describe the overall system, and some of our lessons learned in building it

    The Ecobiomics project:Advancing metagenomics assessment of soil health and freshwater quality in Canada

    No full text
    Transformative advances in metagenomics are providing an unprecedented ability to characterize the enormous diversity of microorganisms and invertebrates sustaining soil health and water quality. These advances are enabling a better recognition of the ecological linkages between soil and water, and the biodiversity exchanges between these two reservoirs. They are also providing new perspectives for understanding microorganisms and invertebrates as part of interacting communities (i.e. microbiomes and zoobiomes), and considering plants, animals, and humans as holobionts comprised of their own cells as well as diverse microorganisms and invertebrates often acquired from soil and water. The Government of Canada's Genomics Research and Development Initiative (GRDI) launched the Ecobiomics Project to coordinate metagenomics capacity building across federal departments, and to apply metagenomics to better characterize microbial and invertebrate biodiversity for advancing environmental assessment, monitoring, and remediation activities. The Project has adopted standard methods for soil, water, and invertebrate sampling, collection and provenance of metadata, and nucleic acid extraction. High-throughput sequencing is located at a centralized sequencing facility. A centralized Bioinformatics Platform was established to enable a novel government-wide approach to harmonize metagenomics data collection, storage and bioinformatics analyses. Sixteen research projects were initiated under Soil Microbiome, Aquatic Microbiome, and Invertebrate Zoobiome Themes. Genomic observatories were established at long-term environmental monitoring sites for providing more comprehensive biodiversity reference points to assess environmental change
    corecore