49 research outputs found

    Biodiversity informatics - entomological data processing, analysis and visualization

    Get PDF
    Tese de mestrado, Bioinformática e Biologia Computacional, Universidade de Lisboa, Faculdade de Ciências, 2019Este trabalho foca-se na digitalização, tratamento e análise de dados de colecções de história natural fazendo uso de ferramentas da informática da biodiversidade. Foram usados dados das colecções de insectos do Museu Nacional de História Natural e da Ciência (MNHNC) e do Instituto de Investigação Científica Tropical (IICT), Universidade de Lisboa. Em 2014, um dataset com 30 535 registos da colecção de insectos do MNHNC foi publicado no Global Biodiversity Information Facility (GBIF). Desde então, novos registos foram digitalizados e foram adicionados novos dados, tais como novas identificações taxonómicas, entre outros. Actualmente, o catálogo da colecção de insectos do MNHNC inclui 39 139 registos validados, que correspondem a cerca de 98% do total, referentes a 79 885 espécimes. Para que este dataset actualizado pudesse ser publicado, foram aplicadas ferramentas de limpeza de dados para detecção e correcção de erros, bem como a georreferenciação de registos, de forma a que os dados possam ser localizados num mapa a partir das coordenadas. Relativamente à limpeza e homogeneização de dados, todos os campos foram limpos e formatados de acordo com as normas do modelo de metadados DarwinCore. Este processo incluiu a verificação de identificações taxonómicas para detectar sinonímias e erros ortográficos, alteração do formato de datas e horas, e aplicação de um vocabulário controlado para os restantes campos. Paralelamente a este processo, foram testadas ferramentas para acelerar a digitalização em duas fases diferentes: transcrição e georreferenciação de dados a partir de etiquetas de espécimes. Foram testadas cinco ferramentas de georreferenciação que disponibilizam Application Programming Interfaces (APIs), que podem ser usadas para georreferenciar registos automaticamente a partir de nomes de localidades: Google Maps, MapQuest, GeoNames, OpenStreetMap e GEOLocate. Destes, a ferramenta Google Maps foi a que produziu melhores resultados, com 57.6% dos resultados a uma distância de 1000 m ou menos do local correcto. Foi também desenvolvido e testado um projecto de ciência cidadã na plataforma Zooniverse, que contemplou duas vertentes: uma de transcrição de dados a partir de fotografias de espécimes com etiquetas, direccionada ao público geral, e uma de identificação taxonómica de espécimes a partir de fotografias, direcionada a especialistas na taxonomia do respectivo grupo. A primeira vertente resultou na transcrição com sucesso dos dados de todos os 130 espécimes disponibilizados. A segunda resultou na identificação dos 61 espécimes que não tinham identificação prévia, e na verificação das identificações dos restantes 69 espécimes. Conclui-se, portanto, que os projectos de ciência cidadã serão uma boa maneira de acelerar o projecto de digitalização, desde que sejam implementados métodos de verificação e correcção de erros adequados. Por forma a focar todos os passos do processo de digitalização de uma forma mais completa, foram selecionadas as colecções de tabanídeos (Diptera: Tabanidae) do IICT e do MNHNC. Este grupo é de especial interesse por incluir importantes vectores de transmissão de doenças a humanos e gado, e por ter uma distribuição ampla em todo o Mundo. A colecção de tabanídeos do IICT é particularmente importante por ter sido, na sua maioria, compilada e estudada por J. A. Travassos Santos Dias, um especialista neste grupo que publicou extensos trabalhos com base nos espécimes da colecção. Ambas as colecções incluem espécimes tipo de espécies descritas por Travassos Santos Dias e outros autores. Apesar da sua importância, a informação associada aos espécimes das colecções do IICT/MNHNC ainda não estava digitalizada. Neste trabalho, foram fotografados todos os espécimes e transcritos os seus dados, resultando num dataset com 1 666 exemplares. Foi feita a georreferenciação dos registos sempre que possível. Os espécimes da colecção foram recolhidos entre 1899 e 2018, maioritariamente em Portugal, mas também em São Tomé e Príncipe, Guiné-Bissau, Moçambique, Espanha e outros países. Para uma melhor visualização da distribuição geográfica dos espécimes, foram criados mapas de distribuição, recorrendo a R, para as espécies mais bem representadas nas colecções. A publicação deste dataset na plataforma GBIF será uma mais-valia para o estudo da distribuição deste grupo, devido à sua ampla cobertura geográfica e temporal, bem como ao facto da maioria dos espécimes (85.1%) estarem identificados até à espécie ou subespécie.This work is based on data records associated with the insect Collections from the Museu Nacional de História Natural e da Ciência (MNHNC) and Instituto de Investigação Científica Tropical (IICT), Universidade de Lisboa. In 2014 a dataset with 30 535 records was published in the Global Biodiversity Information Facility (GBIF). Since then data has been improved and new records acquired. Currently, the collection catalogue includes 39 139 validated records, corresponding to 79 885 specimens, with much more to be added from collections donated by private collectors or unprocessed samples. The data for these specimens was cleaned, formatted and geocoded and published on the GBIF. During this work, different APIs were tested to allow automated geocoding of sampling locations. Google Maps achieved the best results, with 57.6% of results within 1000 m of the correct location. A citizen science project was developed and tested to accelerate the digitization process, including two workflows with different objectives. One was focused on the transcription of specimen label data, which resulted in the data for 130 specimens being successfully transcribed. The other was focused on the taxonomic identification of specimens from photographs, directed to specialists in the respective group’s taxonomy, which resulted in 61 new identifications and the verification of identifications for the remaining 69 specimens. The MNHNC and IICT collections contain collections of horseflies (Order Diptera, Family Tabanidae) which are of particular importance due to its size and completeness of associated data. Horseflies are widely distributed worldwide and are important vectors in transmission of diseases to humans and cattle. The IICT collection includes a sub-collection which was compiled and studied by J. A. Travassos Santos Dias, a prominent specialist in this group. The specimens in these collections were photographed, all the associated data were transcribed, taxonomic identifications were verified and records were geocoded, resulting in a dataset of 1666 specimens. These specimens were collected between 1899 and 2018, mainly in Portugal, but also in São Tomé and Príncipe, Guinea-Bissau, Mozambique, Spain and other countries. To better understand the distribution of this group, distribution maps were made for the most well-represented species in the collections

    Crowding the library : how and why libraries are using crowdsourcing to engage the public

    Get PDF
    Over the past 10 years, there has been a noticeable increase of crowdsourcing projects in cultural heritage institutions, where digital technologies are being used to open up their collections and encourage the public to engage with them in a very direct way. Libraries, archives and museums have long had a history and mandate of outreach and public engagement but crowdsourcing marks a move towards a more participatory and inclusive model of engagement. If a library wants to start a crowdsourcing project, what do they need to know? This article is written from a Canadian University library perspective with the goal to help the reader engage with the current crowdsourcing landscape. This article’s contribution includes a literature review and a survey of popular projects and platforms; followed by a case study of a crowdsourcing pilot completed at the McGill Library. The article pulls these two threads of theory and practice together—with a discussion of some of the best practices learned through the literature and real-life experience, giving the reader practical tools to help a library evaluate if crowdsourcing is right for them, and how to get a desired project off the ground

    Crowdsourcing Natural History Archives: Tools for Extracting Transcriptions and Data

    Get PDF
    This paper is a survey of the landscape of current, successful, and innovative platforms for extracting full text transcriptions and structured data using crowdsourcing as a tool. Archival manuscript items are the key case studies reviewed to develop a set of tools for the Biodiversity Heritage Library for use in enhancing access to and extraction information from items in scientific and natural history collections

    The implications of handwritten text recognition for accessing the past at scale

    Get PDF
    Before Handwritten Text Recognition (HTR), manuscripts were costly to convert to machine-processable text for research and analysis. With HTR now achieving high levels of accuracy, we ask what near-future behaviour, interaction, experience, values and infrastructures may occur when HTR is applied to historical documents? When combined with mass-digitisation of GLAM (galleries, libraries, archives and museums) content, how will HTR’s application, use, and affordances generate new knowledge of the past, and affect our information environment? This paper’s findings emerge from a literature review surveying current understanding of the impact of HTR, to explore emerging issues over the coming decade. We aim to deconstruct the simplistic narrative that the speed, efficiency, and scale of HTR will “transform scholarship in the archives” (Muehlberger et al., 2019: 955), providing a more nuanced consideration of its application, possibilities, and opportunities. In doing so, our recommendations will assist researchers, data and platform providers, memory institutions and data scientists to understand how the results of HTR interact with the wider information environment.We find that HTR supports the creation of accurate transcriptions from historical manuscripts, and the enhancement of existing datasets. HTR facilitates access to a greater range of materials, including endangered languages, enabling a new focus on personal and private materials (diaries, letters), increasing access to historical voices not usually incorporated into the historical record, and increasing the scale and heterogeneity of available material. The production of general training models leads to a virtuous digitisation circle where similar datasets are easier – and therefore more likely – to be produced. This leads to the requirement for processes that will facilitate the storage, and discoverability of HTR generated content, and for memory institutions to rethink search and access to collections. Challenges include HTR’s dependency on digitisation, its relation to archival history and omission, and the entrenchment of bias in data sources. The paper details several near future issues, including: the potential of HTR for the basis of automated metadata extraction; the integration of advanced Artificial Intelligence (AI) processes (including Large Language Models (LLMs) and generative AI) into HTR systems; legal and moral issues such as copyright, privacy and data ethics which are challenged by the use of HTR; how individual contributions to shared HTR models can be credited; and the environmental costs of HTR infrastructure. We identify the need for greater collaboration between communities including historians, information scientists, and data scientists to navigate these issues, and for further skills support to allow non-specialist audiences to make the most of HTR. Data literacy will become increasingly important, as will building frameworks to establish data sharing, data consent, and reuse principles, particularly in building open repositories to share models and datasets. Finally, we suggest that an understanding of how HTR is changing the information environment is a crucial aspect of future technological development. <br/

    Draft: Crowdsourcing in cultural heritage: a practical guide to designing and running successful projects

    Get PDF
    Have you ever wanted to recruit hundreds of members of the public to assist with the task of making cultural heritage collections findable online? Or to connect with passionate volunteers who'll share their discoveries with you? Crowdsourcing in cultural heritage is a broad term for projects that ask the public to help with tasks that contribute to a shared, significant goal or research interest related to cultural heritage collections or knowledge. As participants receive no financial reward, the activities and/or goals should be inherently rewarding for those volunteering their time. This definition is partly descriptive and partly proscriptive, and this chapter is largely concerned with explaining/describing how to meet the standards it implies. [A draft (not quite pre-print) version of my chapter for the Routledge International Handbook of Research Methods in Digital Humanities, edited by Kristen Schuster, Stuart Dunn, 2021. ISBN 9781138363021
    corecore