98 research outputs found

    Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections [Version 1]

    Get PDF
    We describe an effective approach to automated text digitisation with respect to natural history specimen labels. These labels contain much useful data about the specimen including its collector, country of origin, and collection date. Our approach to automatically extracting these data takes the form of a pipeline. Recommendations are made for the pipeline's component parts based on some of the state-of-the-art technologies.Optical Character Recognition (OCR) can be used to digitise text on images of specimens. However, recognising text quickly and accurately from these images can be a challenge for OCR. We show that OCR performance can be improved by prior segmentation of specimen images into their component parts. This ensures that only text-bearing labels are submitted for OCR processing as opposed to whole specimen images, which inevitably contain non-textual information that may lead to false positive readings. In our testing Tesseract OCR version 4.0.0 offers promising text recognition accuracy with segmented images.Not all the text on specimen labels is printed. Handwritten text varies much more and does not conform to standard shapes and sizes of individual characters, which poses an additional challenge for OCR. Recently, deep learning has allowed for significant advances in this area. Google's Cloud Vision, which is based on deep learning, is trained on large-scale datasets, and is shown to be quite adept at this task. This may take us some way towards negating the need for humans to routinely transcribe handwritten text.Determining the countries and collectors of specimens has been the goal of previous automated text digitisation research activities. Our approach also focuses on these two pieces of information. An area of Natural Language Processing (NLP) known as Named Entity Recognition (NER) has matured enough to semi-automate this task. Our experiments demonstrated that existing approaches can accurately recognise location and person names within the text extracted from segmented images via Tesseract version 4.0.0. Potentially, NER could be used in conjunction with other online services, such as those of the Biodiversity Heritage Library to map the named entities to entities in the biodiversity literature (https://www.biodiversitylibrary.org/docs/api3.html).We have highlighted the main recommendations for potential pipeline components. The document also provides guidance on selecting appropriate software solutions. These include automatic language identification, terminology extraction, and integrating all pipeline components into a scientific workflow to automate the overall digitisation process

    Community engagement: The ‘last mile’ challenge for European research e-infrastructures

    Get PDF
    Europe is building its Open Science Cloud; a set of robust and interoperable e-infrastructures with the capacity to provide data and computational solutions through cloud-based services. The development and sustainable operation of such e-infrastructures are at the forefront of European funding priorities. The research community, however, is still reluctant to engage at the scale required to signal a Europe-wide change in the mode of operation of scientific practices. The striking differences in uptake rates between researchers from different scientific domains indicate that communities do not equally share the benefits of the above European investments. We highlight the need to support research communities in organically engaging with the European Open Science Cloud through the development of trustworthy and interoperable Virtual Research Environments. These domain-specific solutions can support communities in gradually bridging technical and socio-cultural gaps between traditional and open digital science practice, better diffusing the benefits of European e-infrastructures

    Assembly of high nuclearity clusters from a family of tripodal tris-carboxylate ligands

    Get PDF
    A family of four tris-carboxylic acid ligands 1,3,5-tris(4′-carboxybiphenyl-2-yl)benzene (H3L1), 1,3,5-tris-2-carboxyphenylbenzene (H3L2), 1,3,5-tris(4″-carboxy-para-terphenyl-2-yl)benzene (H3L3) and 1,3,5-tris(3′-carboxybiphenyl-2-yl)benzene (H3L4) have been synthesised and reacted with first row transition metal cations to give nine complexes which have been structurally characterised by X-ray crystallography. The ligands share a common design motif having three arms connected to a benzene core via three ortho-disubstituted phenyl linkers. The ligands vary in length and direction of the carboxylic acid functionalised arms and are all able to adopt tripodal conformations in which the three arms are directed facially. The structures of [Zn8(μ4-O)(L1)4(HCO2)2(H2O)0.33(DMF)2] (1a-Zn), [Co14(L2)6((μ3-OH)8(HCO2)2(DMF)4(H2O)6] (2-Co), [Ni14(L2)6(μ3-OH)8(HCO2)2(DMF)4(H2O)6] (2-Ni), [Zn8(μ4-O)(L3)4(DMF)(H2O)4(NO3)2] (3-Zn), [Ni5(μ-OH)4(L2)2(H2O)6(DMF)4] (5-Ni), [Co8(μ4-O)4(L4)4(DMF)3(H2O)] (6-Co) and Fe3(μ3-O)(L4)2(H2O)(DMF)2)] (7-Fe) contain polynuclear clusters surrounded by ligands (L1–4)3− in tripodal conformations. The structure of [Zn2(HL1)2(DMF)4] (1b-Zn) shows it to be a binuclear complex in which the two ligands (HL2)2− are partially deprotonated whilst {[Zn3(L2)2(DMF)(H2O)(C5H5N)]·6(DMF)}n (4-Zn) is a 2D coordination network containing {Zn2(RCO2)4(solv)2} paddlewheel units. The conformations of the ligand arms in the complexes have been analysed, confirming that the shared ortho-disubstituted phenyl ring motif is a powerful and versatile tool for designing ligands able to form high-nuclearity coordination clusters when reacted with transition metal cations

    A benchmark dataset of herbarium specimen images with label data

    Get PDF
    More and more herbaria are digitising their collections. Images of specimens are made available online to facilitate access to them and allow extraction of information from them. Transcription of the data written on specimens is critical for general discoverability and enables incorporation into large aggregated research datasets. Different methods, such as crowdsourcing and artificial intelligence, are being developed to optimise transcription, but herbarium specimens pose difficulties in data extraction for many reasons. To provide developers of transcription methods with a means of optimisation, we have compiled a benchmark dataset of 1,800 herbarium specimen images with corresponding transcribed data. These images originate from nine different collections and include specimens that reflect the multiple potential obstacles that transcription methods may encounter, such as differences in language, text format (printed or handwritten), specimen age and nomenclatural type status. We are making these specimens available with a Creative Commons Zero licence waiver and with permanent online storage of the data. By doing this, we are minimising the obstacles to the use of these images for transcription training. This benchmark dataset of images may also be used where a defined and documented set of herbarium specimens is needed, such as for the extraction of morphological traits, handwriting recognition and colour analysis of specimens

    Cross-validation of a semantic segmentation net-work for natural history collection specimens

    Get PDF
    Semantic segmentation has been proposed as a tool to accelerate the processing of natural history collection images. However, developing a flexible and resilient segmentation network requires an approach for adaptation which allows processing different datasets with minimal training and validation. This paper presents a cross-validation approach designed to determine whether a semantic segmentation network possesses the flexibility required for application across different collections and institutions. Consequently, the specific objectives of cross-validating the semantic segmentation network are to (a) evaluate the effectiveness of the network for segmenting image sets derived from collections different from the one in which the network was initially trained on; and (b) test the adaptability of the segmentation network for use in other types of collections. The resilience to data variations from different institutions and the portability of the network across different types of collections are required to confirm its general applicability. The proposed validation method is tested on the Natural History Museum semantic segmentation network, designed to process entomological microscope slides. The proposed semantic segmentation network is evaluated through a series of cross-validation experiments designed to test using data from two types of collections: microscope slides (from three institutions) and herbarium sheets (from seven institutions). The main contribution of this work is the method, software and ground truth sets created for this cross-validation as they can be reused in testing similar segmentation proposals in the context of digitization of natural history collections. The cross-validation of segmentation methods should be a required step in the integration of such methods into image processing workflows for natural history collections

    Unifying European Biodiversity Informatics (BioUnify)

    Get PDF
    In order to preserve the variety of life on Earth, we must understand it better. Biodiversity research is at a pivotal point with research projects generating data at an ever increasing rate. Structuring, aggregating, linking and processing these data in a meaningful way is a major challenge. The systematic application of information management and engineering technologies in the study of biodiversity (biodiversity informatics) help transform data to knowledge. However, concerted action is required to be taken by existing e-infrastructures to develop and adopt common standards, provisions for interoperability and avoid overlapping in functionality. This would result in the unification of the currently fragmented landscape that restricts European biodiversity research from reaching its full potential. The overarching goal of this COST Action is to coordinate existing research and capacity building efforts, through a bottom-up trans-disciplinary approach, by unifying biodiversity informatics communities across Europe in order to support the long-term vision of modelling biodiversity on earth. BioUnify will: 1. specify technical requirements, evaluate and improve models for efficient data and workflow storage, sharing and re-use, within and between different biodiversity communities; 2. mobilise taxonomic, ecological, genomic and biomonitoring data generated and curated by natural history collections, research networks and remote sensing sources in Europe; 3. leverage results of ongoing biodiversity informatics projects by identifying and developing functional synergies on individual, group and project level; 4. raise technical awareness and transfer skills between biodiversity researchers and information technologists; 5. formulate a viable roadmap for achieving the long-term goals for European biodiversity informatics, which ensures alignment with global activities and translates into efficient biodiversity policy

    The Bari Manifesto : An interoperability framework for essential biodiversity variables

    Get PDF
    Essential Biodiversity Variables (EBV) are fundamental variables that can be used for assessing biodiversity change over time, for determining adherence to biodiversity policy, for monitoring progress towards sustainable development goals, and for tracking biodiversity responses to disturbances and management interventions. Data from observations or models that provide measured or estimated EBV values, which we refer to as EBV data products, can help to capture the above processes and trends and can serve as a coherent framework for documenting trends in biodiversity. Using primary biodiversity records and other raw data as sources to produce EBV data products depends on cooperation and interoperability among multiple stakeholders, including those collecting and mobilising data for EBVs and those producing, publishing and preserving EBV data products. Here, we encapsulate ten principles for the current best practice in EBV-focused biodiversity informatics as 'The Bari Manifesto', serving as implementation guidelines for data and research infrastructure providers to support the emerging EBV operational framework based on trans-national and cross-infrastructure scientific workflows. The principles provide guidance on how to contribute towards the production of EBV data products that are globally oriented, while remaining appropriate to the producer's own mission, vision and goals. These ten principles cover: data management planning; data structure; metadata; services; data quality; workflows; provenance; ontologies/vocabularies; data preservation; and accessibility. For each principle, desired outcomes and goals have been formulated. Some specific actions related to fulfilling the Bari Manifesto principles are highlighted in the context of each of four groups of organizations contributing to enabling data interoperability - data standards bodies, research data infrastructures, the pertinent research communities, and funders. The Bari Manifesto provides a roadmap enabling support for routine generation of EBV data products, and increases the likelihood of success for a global EBV framework.Peer reviewe

    Epigenetic regulation of F2RL3 associates with myocardial infarction and platelet function

    Get PDF
    DNA hypomethylation at the F2RL3 (F2R like thrombin or trypsin receptor 3) locus has been associated with both smoking and atherosclerotic cardiovascular disease; whether these smoking-related associations form a pathway to disease is unknown. F2RL3 encodes protease-activated receptor 4, a potent thrombin receptor expressed on platelets. Given the role of thrombin in platelet activation and the role of thrombus formation in myocardial infarction, alterations to this biological pathway could be important for ischemic cardiovascular disease. METHODS: We conducted multiple independent experiments to assess whether DNA hypomethylation at F2RL3 in response to smoking is associated with risk of myocardial infarction via changes to platelet reactivity. Using cohort data (N=3205), we explored the relationship between smoking, DNA hypomethylation at F2RL3, and myocardial infarction. We compared platelet reactivity in individuals with low versus high DNA methylation at F2RL3 (N=41). We used an in vitro model to explore the biological response of F2RL3 to cigarette smoke extract. Finally, a series of reporter constructs were used to investigate how differential methylation could impact F2RL3 gene expression. RESULTS: Observationally, DNA methylation at F2RL3 mediated an estimated 34% of the smoking effect on increased risk of myocardial infarction. An association between methylation group (low/high) and platelet reactivity was observed in response to PAR4 (protease-activated receptor 4) stimulation. In cells, cigarette smoke extract exposure was associated with a 4.9% to 9.3% reduction in DNA methylation at F2RL3 and a corresponding 1.7-(95% CI, 1.2–2.4, P=0.04) fold increase in F2RL3 mRNA. Results from reporter assays suggest the exon 2 region of F2RL3 may help control gene expression. CONCLUSIONS: Smoking-induced epigenetic DNA hypomethylation at F2RL3 appears to increase PAR4 expression with potential downstream consequences for platelet reactivity. Combined evidence here not only identifies F2RL3 DNA methylation as a possible contributory pathway from smoking to cardiovascular disease risk but from any feature potentially influencing F2RL3 regulation in a similar manner

    Conceptual design blueprint for the DiSSCo digitization infrastructure - DELIVERABLE D8.1

    Get PDF
    DiSSCo, the Distributed System of Scientific Collections, is a pan-European Research Infrastructure (RI) mobilising, unifying bio- and geo-diversity information connected to the specimens held in natural science collections and delivering it to scientific communities and beyond. Bringing together 120 institutions across 21 countries and combining earlier investments in data interoperability practices with technological advancements in digitisation, cloud services and semantic linking, DiSSCo makes the data from natural science collections available as one virtual data cloud, connected with data emerging from new techniques and not already linked to specimens. These new data include DNA barcodes, whole genome sequences, proteomics and metabolomics data, chemical data, trait data, and imaging data (Computer-assisted Tomography (CT), Synchrotron, etc.), to name but a few; and will lead to a wide range of end-user services that begins with finding, accessing, using and improving data. DiSSCo will deliver the diagnostic information required for novel approaches and new services that will transform the landscape of what is possible in ways that are hard to imagine today. With approximately 1.5 billion objects to be digitised, bringing natural science collections to the information age is expected to result in many tens of petabytes of new data over the next decades, used on average by 5,000 – 15,000 unique users every day. This requires new skills, clear policies and robust procedures and new technologies to create, work with and manage large digital datasets over their entire research data lifecycle, including their long-term storage and preservation and open access. Such processes and procedures must match and be derived from the latest thinking in open science and data management, realising the core principles of 'findable, accessible, interoperable and reusable' (FAIR). Synthesised from results of the ICEDIG project ('Innovation and Consolidation for Large Scale Digitisation of Natural Heritage', EU Horizon 2020 grant agreement No. 777483) the DiSSCo Conceptual Design Blueprint covers the organisational arrangements, processes and practices, the architecture, tools and technologies, culture, skills and capacity building and governance and business model proposals for constructing the digitisation infrastructure of DiSSCo. In this context, the digitisation infrastructure of DiSSCo must be interpreted as that infrastructure (machinery, processing, procedures, personnel, organisation) offering Europe-wide capabilities for mass digitisation and digitisation-on-demand, and for the subsequent management (i.e., curation, publication, processing) and use of the resulting data. The blueprint constitutes the essential background needed to continue work to raise the overall maturity of the DiSSCo Programme across multiple dimensions (organisational, technical, scientific, data, financial) to achieve readiness to begin construction. Today, collection digitisation efforts have reached most collection-holding institutions across Europe. Much of the leadership and many of the people involved in digitisation and working with digital collections wish to take steps forward and expand the efforts to benefit further from the already noticeable positive effects. The collective results of examining technical, financial, policy and governance aspects show the way forward to operating a large distributed initiative i.e., the Distributed System of Scientific Collections (DiSSCo) for natural science collections across Europe. Ample examples, opportunities and need for innovation and consolidation for large scale digitisation of natural heritage have been described. The blueprint makes one hundred and four (104) recommendations to be considered by other elements of the DiSSCo Programme of linked projects (i.e., SYNTHESYS+, COST MOBILISE, DiSSCo Prepare, and others to follow) and the DiSSCo Programme leadership as the journey towards organisational, technical, scientific, data and financial readiness continues. Nevertheless, significant obstacles must be overcome as a matter of priority if DiSSCo is to move beyond its Design and Preparatory Phases during 2024. Specifically, these include: Organisational: Strengthen common purpose by adopting a common framework for policy harmonisation and capacity enhancement across broad areas, especially in respect of digitisation strategy and prioritisation, digitisation processes and techniques, data and digital media publication and open access, protection of and access to sensitive data, and administration of access and benefit sharing. Pursue the joint ventures and other relationships necessary to the successful delivery of the DiSSCo mission, especially ventures with GBIF and other international and regional digitisation and data aggregation organisations, in the context of infrastructure policy frameworks, such as EOSC. Proceed with the explicit aim of avoiding divergences of approach in global natural science collections data management and research. Technical: Adopt and enhance the DiSSCo Digital Specimen Architecture and, specifically as a matter of urgency, establish the persistent identifier scheme to be used by DiSSCo and (ideally) other comparable regional initiatives. Establish (software) engineering development and (infrastructure) operations team and direction essential to the delivery of services and functionalities expected from DiSSCo such that earnest engineering can lead to an early start of DiSSCo operations. Scientific: Establish a common digital research agenda leveraging Digital (extended) Specimens as anchoring points for all specimen-associated and -derived information, demonstrating to research institutions and policy/decision-makers the new possibilities, opportunities and value of participating in the DiSSCo research infrastructure. Data: Adopt the FAIR Digital Object Framework and the International Image Interoperability Framework as the low entropy means to achieving uniform access to rich data (image and non-image) that is findable, accessible, interoperable and reusable (FAIR). Develop and promote best practice approaches towards achieving the best digitisation results in terms of quality (best, according to agreed minimum information and other specifications), time (highest throughput, fast), and cost (lowest, minimal per specimen). Financial Broaden attractiveness (i.e., improve bankability) of DiSSCo as an infrastructure to invest in. Plan for finding ways to bridge the funding gap to avoid disruptions in the critical funding path that risks interrupting core operations; especially when the gap opens between the end of preparations and beginning of implementation due to unsolved political difficulties. Strategically, it is vital to balance the multiple factors addressed by the blueprint against one another to achieve the desired goals of the DiSSCo programme. Decisions cannot be taken on one aspect alone without considering other aspects, and here the various governance structures of DiSSCo (General Assembly, advisory boards, and stakeholder forums) play a critical role over the coming years
    • …
    corecore