69 research outputs found
A Deep Learning Approach to Geographical Candidate Selection through Toponym Matching
Recognizing toponyms and resolving them to their real-world referents is required to provide advanced semantic access to textual data. This process is often hindered by the high degree of variation in toponyms. Candidate selection is the task of identifying the potential entities that can be referred to by a previously recognized toponym. While it has traditionally received little attention, candidate selection has a significant impact on downstream tasks (i.e. entity resolution), especially in noisy or non-standard text. In this paper, we introduce a deep learning method for candidate selection through toponym matching, using state-of-the-art neural network architectures. We perform an intrinsic toponym matching evaluation based on several datasets, which cover various challenging scenarios (cross-lingual and regional variations, as well as OCR errors) and assess its performance in the context of geographical candidate selection in English and Spanish. </p
Recommended from our members
Library Carpentry: software skills training for library professionals
Librarians play a crucial role in cultivating world-class research and in most disciplinary areas today world-class research relies on the use of software. This paper describes Library Carpentry, an introductory software skills training programme with a focus on the needs and requirements of library and information professionals. Using Library Carpentry as a case study of the development and delivery of software skills focused professional development, this paper describes the institutional and intellectual contexts in which Library Carpentry was conceived, the syllabus used for the initial exploratory programme, the administrative apparatus through which the programme was delivered, and the analysis of data collection exercises conducted during the programme. As many university librarians already have substantial expertise working with data, it argues that adding software skills (that is, coding and data manipulation that goes beyond the use of familiar office suites) to their armoury is an effective and important use of professional development resource
Datasheets for Digital Cultural Heritage Datasets
Sparked by issues of quality and lack of proper documentation for datasets, the machine learning community has begun developing standardised processes for establishing datasheets for machine learning datasets, with the intent to provide context and information on provenance, purposes, composition, the collection process, recommended uses or societal biases reflected in training datasets. This approach fits well with practices and procedures established in GLAM institutions, such as establishing collections’ descriptions. However, digital cultural heritage datasets are marked by specific characteristics. They are often the product of multiple layers of selection; they may have been created for different purposes than establishing a statistical sample according to a specific research question; they change over time and are heterogeneous. Punctuated by a series of recommendations to create datasheets for digital cultural heritage, the paper addresses the scope and characteristics of digital cultural heritage datasets; possible metrics and measures; lessons from concepts similar to datasheets and/or established workflows in the cultural heritage sector. This paper includes a proposal for a datasheet template that has been adapted for use in cultural heritage institutions, and which proposes to incorporate information on the motivation and selection criteria, digitisation pipeline, data provenance, the use of linked open data, and version information
Protein status elicits compensatory changes in food intake and food preferences123
Background: Protein is an indispensable component within the human diet. It is unclear, however, whether behavioral strategies exist to avoid shortages
Long-term and large-scale multispecies dataset tracking population changes of common European breeding birds
Around fifteen thousand fieldworkers annually count breeding birds using standardized protocols in 28 European countries. The observations are collected by using country-specific and standardized protocols, validated, summarized and finally used for the production of continent-wide annual and long-term indices of population size changes of 170 species. Here, we present the database and provide a detailed summary of the methodology used for fieldwork and calculation of the relative population size change estimates. We also provide a brief overview of how the data are used in research, conservation and policy. We believe this unique database, based on decades of bird monitoring alongside the comprehensive summary of its methodology, will facilitate and encourage further use of the Pan-European Common Bird Monitoring Scheme results.publishedVersio
Transglutaminase 6: a protein associated with central nervous system development and motor function.
Transglutaminases (TG) form a family of enzymes that catalyse various post-translational modifications of glutamine residues in proteins and peptides including intra- and intermolecular isopeptide bond formation, esterification and deamidation. We have characterized a novel member of the mammalian TG family, TG6, which is expressed in a human carcinoma cell line with neuronal characteristics and in mouse brain. Besides full-length protein, alternative splicing results in a short variant lacking the second β-barrel domain in man and a variant with truncated β-sandwich domain in mouse. Biochemical data show that TG6 is allosterically regulated by Ca(2+) and guanine nucleotides. Molecular modelling indicates that TG6 could have Ca(2+) and GDP-binding sites related to those of TG3 and TG2, respectively. Localization of mRNA and protein in the mouse identified abundant expression of TG6 in the central nervous system. Analysis of its temporal and spatial pattern of induction in mouse development indicates an association with neurogenesis. Neuronal expression of TG6 was confirmed by double-labelling of mouse forebrain cells with cell type-specific markers. Induction of differentiation in mouse Neuro 2a cells with NGF or dibutyryl cAMP is associated with an upregulation of TG6 expression. Familial ataxia has recently been linked to mutations in the TGM6 gene. Autoantibodies to TG6 were identified in immune-mediated ataxia in patients with gluten sensitivity. These findings suggest a critical role for TG6 in cortical and cerebellar neurons
Molecularly defined circuitry reveals input-output segregation in deep layers of the medial entorhinal cortex
SummaryDeep layers of the medial entorhinal cortex are considered to relay signals from the hippocampus to other brain structures, but pathways for routing of signals to and from the deep layers are not well established. Delineating these pathways is important for a circuit level understanding of spatial cognition and memory. We find that neurons in layers 5a and 5b have distinct molecular identities, defined by the transcription factors Etv1 and Ctip2, and divergent targets, with extensive intratelencephalic projections originating in layer 5a, but not 5b. This segregation of outputs is mirrored by the organization of glutamatergic input from stellate cells in layer 2 and from the hippocampus, with both preferentially targeting layer 5b over 5a. Our results suggest a molecular and anatomical organization of input-output computations in deep layers of the MEC, reveal precise translaminar microcircuitry, and identify molecularly defined pathways for spatial signals to influence computation in deep layers
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License
- …