69 research outputs found

    A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching

    Get PDF
    Recognizing toponyms and resolving them to their real-world referents is required to provide advanced semantic access to textual data. This process is often hindered by the high degree of variation in toponyms. Candidate selection is the task of identifying the potential entities that can be referred to by a previously recognized toponym. While it has traditionally received little attention, candidate selection has a significant impact on downstream tasks (i.e. entity resolution), especially in noisy or non-standard text. In this paper, we introduce a deep learning method for candidate selection through toponym matching, using state-of-the-art neural network architectures. We perform an intrinsic toponym matching evaluation based on several datasets, which cover various challenging scenarios (cross-lingual and regional variations, as well as OCR errors) and assess its performance in the context of geographical candidate selection in English and Spanish. </p

    Datasheets for Digital Cultural Heritage Datasets

    Get PDF
    Sparked by issues of quality and lack of proper documentation for datasets, the machine learning community has begun developing standardised processes for establishing datasheets for machine learning datasets, with the intent to provide context and information on provenance, purposes, composition, the collection process, recommended uses or societal biases reflected in training datasets. This approach fits well with practices and procedures established in GLAM institutions, such as establishing collections’ descriptions. However, digital cultural heritage datasets are marked by specific characteristics. They are often the product of multiple layers of selection; they may have been created for different purposes than establishing a statistical sample according to a specific research question; they change over time and are heterogeneous. Punctuated by a series of recommendations to create datasheets for digital cultural heritage, the paper addresses the scope and characteristics of digital cultural heritage datasets; possible metrics and measures; lessons from concepts similar to datasheets and/or established workflows in the cultural heritage sector. This paper includes a proposal for a datasheet template that has been adapted for use in cultural heritage institutions, and which proposes to incorporate information on the motivation and selection criteria, digitisation pipeline, data provenance, the use of linked open data, and version information

    Long-term and large-scale multispecies dataset tracking population changes of common European breeding birds

    Get PDF
    Around fifteen thousand fieldworkers annually count breeding birds using standardized protocols in 28 European countries. The observations are collected by using country-specific and standardized protocols, validated, summarized and finally used for the production of continent-wide annual and long-term indices of population size changes of 170 species. Here, we present the database and provide a detailed summary of the methodology used for fieldwork and calculation of the relative population size change estimates. We also provide a brief overview of how the data are used in research, conservation and policy. We believe this unique database, based on decades of bird monitoring alongside the comprehensive summary of its methodology, will facilitate and encourage further use of the Pan-European Common Bird Monitoring Scheme results.publishedVersio

    Transglutaminase 6: a protein associated with central nervous system development and motor function.

    Get PDF
    Transglutaminases (TG) form a family of enzymes that catalyse various post-translational modifications of glutamine residues in proteins and peptides including intra- and intermolecular isopeptide bond formation, esterification and deamidation. We have characterized a novel member of the mammalian TG family, TG6, which is expressed in a human carcinoma cell line with neuronal characteristics and in mouse brain. Besides full-length protein, alternative splicing results in a short variant lacking the second β-barrel domain in man and a variant with truncated β-sandwich domain in mouse. Biochemical data show that TG6 is allosterically regulated by Ca(2+) and guanine nucleotides. Molecular modelling indicates that TG6 could have Ca(2+) and GDP-binding sites related to those of TG3 and TG2, respectively. Localization of mRNA and protein in the mouse identified abundant expression of TG6 in the central nervous system. Analysis of its temporal and spatial pattern of induction in mouse development indicates an association with neurogenesis. Neuronal expression of TG6 was confirmed by double-labelling of mouse forebrain cells with cell type-specific markers. Induction of differentiation in mouse Neuro 2a cells with NGF or dibutyryl cAMP is associated with an upregulation of TG6 expression. Familial ataxia has recently been linked to mutations in the TGM6 gene. Autoantibodies to TG6 were identified in immune-mediated ataxia in patients with gluten sensitivity. These findings suggest a critical role for TG6 in cortical and cerebellar neurons

    Molecularly defined circuitry reveals input-output segregation in deep layers of the medial entorhinal cortex

    Get PDF
    SummaryDeep layers of the medial entorhinal cortex are considered to relay signals from the hippocampus to other brain structures, but pathways for routing of signals to and from the deep layers are not well established. Delineating these pathways is important for a circuit level understanding of spatial cognition and memory. We find that neurons in layers 5a and 5b have distinct molecular identities, defined by the transcription factors Etv1 and Ctip2, and divergent targets, with extensive intratelencephalic projections originating in layer 5a, but not 5b. This segregation of outputs is mirrored by the organization of glutamatergic input from stellate cells in layer 2 and from the hippocampus, with both preferentially targeting layer 5b over 5a. Our results suggest a molecular and anatomical organization of input-output computations in deep layers of the MEC, reveal precise translaminar microcircuitry, and identify molecularly defined pathways for spatial signals to influence computation in deep layers

    BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

    Full text link
    Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License
    corecore