18 research outputs found
MapOptics: A light-weight, cross-platform visualisation tool for optical mapping alignment
Availability and implementation:
MapOptics is implemented in Java 1.8 and released under an MIT licence. MapOptics can be downloaded from https://github.com/FadyMohareb/mapoptics and run on any standard desktop computer equipped with a Java Virtual Machine (JVM).
Supplementary data are available at Bioinformatics online.Bionano optical mapping is a technology that can assist in the final stages of genome assembly by lengthening and ordering scaffolds in a draft assembly by aligning the assembly to a genomic map. However, currently, tools for visualisation are limited to use on a Windows operating system or are developed initially for visualising large-scale structural variation. MapOptics is a lightweight cross-platform tool that enables the user to visualise and interact with the alignment of Bionano optical mapping data and can be used for in depth exploration of hybrid scaffolding alignments. It provides a fast, simple alternative to the large optical mapping analysis programs currently available for this area of research
CRAMER: A lightweight, highly customisable web-based genome browser supporting multiple visualisation instances
In recent years the ability to generate genomic data has increased dramatically along with the demand for easily personalised and customisable genome browsers for effective visualisation of diverse types of data. Despite the large number of web-based genome browsers available nowadays, none of the existing tools provide means for creating multiple visualisation instances without manual set up on the deployment server side. The Cranfield Genome Browser (CRAMER) is an open-source, lightweight and highly customisable web application for interactive visualisation of genomic data. Once deployed, CRAMER supports seamless creation of multiple visualisation instances in parallel while allowing users to control and customise multiple tracks. The application is deployed on a Node.js server and is supported by a MongoDB database which stored all customisations made by the users allowing quick navigation between instances. Currently, the browser supports visualising a large number of file formats for genome annotation, variant calling, reads coverage and gene expression. Additionally, the browser supports direct Javascript coding for personalised tracks, providing a whole new level of customisation both functionally and visually. Tracks can be added via direct file upload or processed in real-time via links to files stored remotely on an FTP repository. Furthermore, additional tracks can be added by users via simple drag and drop to an existing visualisation instance
The European Reference Genome Atlas: piloting a decentralised approach to equitable biodiversity genomics.
ABSTRACT: A global genome database of all of Earth’s species diversity could be a treasure trove of scientific discoveries. However, regardless of the major advances in genome sequencing technologies, only a tiny fraction of species have genomic information available. To contribute to a more complete planetary genomic database, scientists and institutions across the world have united under the Earth BioGenome Project (EBP), which plans to sequence and assemble high-quality reference genomes for all ∼1.5 million recognized eukaryotic species through a stepwise phased approach. As the initiative transitions into Phase II, where 150,000 species are to be sequenced in just four years, worldwide participation in the project will be fundamental to success. As the European node of the EBP, the European Reference Genome Atlas (ERGA) seeks to implement a new decentralised, accessible, equitable and inclusive model for producing high-quality reference genomes, which will inform EBP as it scales. To embark on this mission, ERGA launched a Pilot Project to establish a network across Europe to develop and test the first infrastructure of its kind for the coordinated and distributed reference genome production on 98 European eukaryotic species from sample providers across 33 European countries. Here we outline the process and challenges faced during the development of a pilot infrastructure for the production of reference genome resources, and explore the effectiveness of this approach in terms of high-quality reference genome production, considering also equity and inclusion. The outcomes and lessons learned during this pilot provide a solid foundation for ERGA while offering key learnings to other transnational and national genomic resource projects.info:eu-repo/semantics/publishedVersio
The ENA Source Attribute Helper: An API for improved biological source data
Metadata management for sequence data is essential for the accurate description of Earth’s biodiversity. Within metadata attributes, those that reference the biological sources of sequences and samples and allow linking to the specimen or sample of origin are fundamental for facilitating connections between molecular biology, taxonomy, systematic biology and biodiversity research, increasing the discoverability and usability of data by researchers worldwide.Sequence data is publicly archived at the International Nucleotide Sequence Database Collaboration (INSDC) that includes the National Centre for Biotechnology Information (NCBI), the DNA Data Bank of Japan (DDBJ) and the European Nucleotide Archive (ENA). Sequences stored at INSDC have associated a considerable range of metadata, including attributes related to its biological source, such as references to natural history collections or culture collections. But, these source attributes are not always submitted or may be incomplete, limiting the association of the sequence records to the original source material, hampering further data connections (e.g., biological data associated with the voucher or species distribution data). Therefore, we have developed the ENA Source Attribute Helper API, a tool that aims to assist users on the submission of accurate attributes referring to the biological source of samples and sequence data. This tool was developed within the scope of BiCIKL (Biodiversity Community Integrated Knowledge Library) (Penev et al. 2022), a Horizon 2020 project which targets building a wide, biodiversity related community for connecting data along the different axes of biodiversity research.The first version of the tool was designed to support correct annotation of the attributes that identify the source material from which the sample or sequence were obtained, namely /specimen_voucher, /culture_collection, and /biomaterial (INSDC 2021). These attributes follow a Darwin Core Triplet format (Wieczorek et al. 2012), composed of institution code, collection code and the specimen, culture, or material identifier, accordingly. Since the submission of the biological source attributes to the INSDC may be performed both when data is initially uploaded or on following updates using a variety of tools, we developed the API as an open source tool that is publicly accessible and may be used as a free-standing service. The API is built using Representational State Transfer (REST) API Architecture and it is designed to use the data available in the NCBI BioCollections (Sharma et al. 2018). NCBI Biocollections is a curated database of metadata for natural history collections, associated with records in INSDC, that includes the institution and collection codes. The API main functions include the querying of the metadata (the API presents both exact matches and similar matches) for the institutions and collections based on the user input, validation of institution and collection codes in the attribute strings provided by the user, and the construction of the attribute string based on the user-provided information. The API does not include the search or validation of the voucher specimen codes. The API is designed in a way that it can be extended easily for any future enhancements and initially expected to promote and support the submission and any subsequent curation of better structured and more richly described source data. We expect this tool to contribute to better connected biodiversity data and hence provide a stronger foundation to strengthen the value of natural history collections, taxonomic expertise, and biodiversity knowledge
Towards Connecting Molecular Data and the Biodiversity Research Community: An ENA and ELIXIR biodiversity community perspective
Global and regional efforts for generating molecular sequencing data are fundamental to characterise and monitor the Earth’s biodiversity. However, exploiting the full potential of molecular data for biodiversity monitoring and conservation is still a challenge. There is still the need to fully connect the generation and archiving of sequence data with other biodiversity infrastructures, thereby promoting Findability, Accessability, Interoperability and Reusability (FAIR) of data.Here we present the ongoing activities and future plans of the European Life-Science Infrastructure (ELIXIR) and the European Molecular Biology Laboratory European Bioinformatics Institute’s (EMBL-EBI) European Nucleotide Archive (ENA, the European node of the International Nucleotide Sequence Database Collaboration - INSDC) towards an enriched set of sequence data connected to the wider biodiversity research community.ELIXIR has an emerging Biodiversity Community that was originally created as a focus group in 2019, to better align the work in biodiversity across the ELIXIR Nodes and with global initiatives in the biodiversity domain. This group has been working on understanding the capabilities, interests and ongoing projects that exist across the Nodes, developing connections with external partners in the biodiversity area (e.g. Global Biodiversity Information Facilitiy, GBIF; LifeWatch Eric) and developing a longer term strategy for support of biodiversity by ELIXIR. A recent opinion piece by the group (Waterhouse et al. 2021) highlights opportunities for infrastructure developments in the area of biodiversity and provides recommendations for closer integration of molecular data with biodiversity research. These recommendations include the alignment of taxonomies across domains and the general adoption of standardized metadata.ELIXIR and EMBL-EBI are involved in several biodiversity genomics initiatives, including the Earth BioGenome Project (EBP), the Darwin Tree of Life Project (DToL), the European Reference Genome Atlas (ERGA), and the BIOSCAN Europe, where support is being provided to data curation, submission and visibility and in the definition of standards for the associated metadata (e.g. Lawniczak et al. 2022). Moreover, EMBL-EBI is a partner of UniEuk, an initiative that is working towards building a flexible universal taxonomic framework for eukaryotes. ELIXIR and EMBL-EBI are also part of the Biodiversity Community Integrated Knowledge Library (BiCIKL), an Horizon 2020 project that is working towards establishing FAIR practices in the biodiversity domain, and thereby developing tools and workflows for connecting data along the biodiversity research cycle (Penev et al. 2022).These projects and community efforts are contributing to improving metadata standards and pushing the development of tools and workflows to support enriched metadata and increased linkage with other biodiversity infrastructures. Overall, we need to continue to work towards a strong foundation of interlinked knowledge to be able to effectively respond to global challenges such as biodiversity loss and ecosystem change
Improving FAIRness of eDNA and Metabarcoding Data: Standards and tools for European Nucleotide Archive data deposition
The advancements in sequencing technologies have promoted the generation of molecular data for cataloguing and describing biodiversity. The analysis of environmental DNA (eDNA) through the application of metabarcoding techniques enables comprehensive descriptions of communities and their function, being fundamental for understanding and preserving biodiversity. Metabarcoding is becoming widely used and standard methods are being generated for a growing range of applications with high scalability. The generated data can be made available in its unprocessed form, as raw data (the sequenced reads) or as interpreted data, including sets of sequences derived after bioinformatics processing (Amplicon Sequence Variants (ASVs) or Operational Taxonomic Units (OTUs)) and occurrence tables (tables that describe the occurrences and abundances of species or OTUs/ASVs). However, for this data to be Findable, Accessible, Interoperable and Reusable (FAIR), and therefore fully available for meaningful interpretation, it needs to be deposited in public repositories together with enriched sample metadata, protocols and analysis workflows (ten Hoopen et al. 2017). Metabarcoding raw data and associated sample metadata is often stored and made available through the International Nucleotide Sequence Database Collaboration (INSDC) archives (Arita et al. 2020), of which the European Nucleotide Archive (ENA, Burgin et al. 2022) is its European database, but it is often deposited with minimal information, which hinders data reusability. Within the scope of the Horizon 2020 project, Biodiversity Community Integrated Knowledge Library (BiCIKL), which is building a community of interconnected data for biodiversity research (Penev et al. 2022), we are working towards improving the standards for molecular ecology data sharing, developing tools to facilitate data deposition and retrieval, and linking between data types. Here we will present the ENA data model, showcasing how metabarcoding data can be shared, while providing enriched metadata, and how this data is linked with existing data in other research infrastructures in the biodiversity domain, such as the Global Biodiversity Information Facility (GBIF), where data is deposited following the guidelines published in Abarenkov et al. (2023). We will also present the results of our recent discussions on standards for this data type and discuss future plans towards continuing to improve data sharing and interoperability for molecular ecology
The ELIXIR Biodiversity Community: Understanding short- and long-term changes in biodiversity [version 2; peer review: 2 approved, 1 not approved]
Biodiversity loss is now recognised as one of the major challenges for humankind to address over the next few decades. Unless major actions are taken, the sixth mass extinction will lead to catastrophic effects on the Earth’s biosphere and human health and well-being. ELIXIR can help address the technical challenges of biodiversity science, through leveraging its suite of services and expertise to enable data management and analysis activities that enhance our understanding of life on Earth and facilitate biodiversity preservation and restoration. This white paper, prepared by the ELIXIR Biodiversity Community, summarises the current status and responses, and presents a set of plans, both technical and community-oriented, that should both enhance how ELIXIR Services are applied in the biodiversity field and how ELIXIR builds connections across the many other infrastructures active in this area. We discuss the areas of highest priority, how they can be implemented in cooperation with the ELIXIR Platforms, and their connections to existing ELIXIR Communities and international consortia. The article provides a preliminary blueprint for a Biodiversity Community in ELIXIR and is an appeal to identify and involve new stakeholders
MGnify: the microbiome sequence data analysis resource in 2023
The MGnify platform (https://www.ebi.ac.uk/metagenomics) facilitates the assembly, analysis and archiving of microbiome-derived nucleic acid sequences. The platform provides access to taxonomic assignments and functional annotations for nearly half a million analyses covering metabarcoding, metatranscriptomic, and metagenomic datasets, which are derived from a wide range of different environments. Over the past 3 years, MGnify has not only grown in terms of the number of datasets contained but also increased the breadth of analyses provided, such as the analysis of long-read sequences. The MGnify protein database now exceeds 2.4 billion non-redundant sequences predicted from metagenomic assemblies. This collection is now organised into a relational database making it possible to understand the genomic context of the protein through navigation back to the source assembly and sample metadata, marking a major improvement. To extend beyond the functional annotations already provided in MGnify, we have applied deep learning-based annotation methods. The technology underlying MGnify's Application Programming Interface (API) and website has been upgraded, and we have enabled the ability to perform downstream analysis of the MGnify data through the introduction of a coupled Jupyter Lab environment
Establishing the ELIXIR Microbiome Community
Microbiome research has grown substantially over the past decade in terms of the range of biomes sampled, identified taxa, and the volume of data derived from the samples. In particular, experimental approaches such as metagenomics, metabarcoding, metatranscriptomics and metaproteomics have provided profound insights into the vast, hitherto unknown, microbial biodiversity. The ELIXIR Marine Metagenomics Community, initiated amongst researchers focusing on marine microbiomes, has concentrated on promoting standards around microbiome-derived sequence analysis, as well as understanding the gaps in methods and reference databases, and solutions to computational overheads of performing such analyses. Nevertheless, the methods used and the challenges faced are not confined to marine studies, but are broadly applicable to all other biomes. Thus, expanding this Community to a more inclusive ELIXIR Microbiome Community will enable it to encompass a broad range of biomes and link expertise across ‘omics technologies. Furthermore, engaging with a large number of researchers will improve the efficiency and sustainability of bioinformatics infrastructure and resources for microbiome research (standards, data, tools, workflows, training), which will enable a deeper understanding of the function and taxonomic composition of the different microbial communities
Specimen and sample metadata standards for biodiversity genomics: a proposal from the Darwin Tree of Life project
The vision of the Earth BioGenome Project1 is to complete reference genomes for all of the planet’s ~2M described eukaryotic species in the coming decade. To contribute to this global endeavour, the Darwin Tree of Life Project (DToL2) was launched in 2019 with the aim of generating complete genomes for the ~70k described eukaryotic species that can be found in Britain and Ireland. One of the early tasks of the DToL project was to determine, define, and standardise the important metadata that must accompany every sample contributing to this ambitious project. This ensures high-quality contextual information is available for the associated data, enabling a richer set of information upon which to search and filter datasets as well as enabling interoperability between datasets used for downstream analysis. Here we describe some of the key factors we considered in the process of determining, defining, and documenting the metadata required for DToL project samples. The manifest and Standard Operating Procedure that are referred to throughout this paper are likely to be useful for other projects, and we encourage re-use while maintaining the standards and rules set out here.</ns4:p