10 research outputs found
Towards Robust Named Entity Recognition for Historic German
Recent advances in language modeling using deep neural networks have shown
that these models learn representations, that vary with the network depth from
morphology to semantic relationships like co-reference. We apply pre-trained
language models to low-resource named entity recognition for Historic German.
We show on a series of experiments that character-based pre-trained language
models do not run into trouble when faced with low-resource datasets. Our
pre-trained character-based language models improve upon classical CRF-based
methods and previous work on Bi-LSTMs by boosting F1 score performance by up to
6%. Our pre-trained language and NER models are publicly available under
https://github.com/stefan-it/historic-ner .Comment: 8 pages, 5 figures, accepted at the 4th Workshop on Representation
Learning for NLP (RepL4NLP), held in conjunction with ACL 201
Data Centric Domain Adaptation for Historical Text with OCR Errors
We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks
based on a few demonstrations or natural language instructions. While these
capabilities have led to widespread adoption, most LLMs are developed by
resource-rich organizations and are frequently kept from the public. As a step
towards democratizing this powerful technology, we present BLOOM, a
176B-parameter open-access language model designed and built thanks to a
collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer
language model that was trained on the ROOTS corpus, a dataset comprising
hundreds of sources in 46 natural and 13 programming languages (59 in total).
We find that BLOOM achieves competitive performance on a wide variety of
benchmarks, with stronger results after undergoing multitask prompted
finetuning. To facilitate future research and applications using LLMs, we
publicly release our models and code under the Responsible AI License
hmBERT: Historical Multilingual Language Models for Named Entity Recognition
Compared to standard Named Entity Recognition (NER), identifying persons,
locations, and organizations in historical texts constitutes a big challenge.
To obtain machine-readable corpora, the historical text is usually scanned and
Optical Character Recognition (OCR) needs to be performed. As a result, the
historical corpora contain errors. Also, entities like location or organization
can change over time, which poses another challenge. Overall, historical texts
come with several peculiarities that differ greatly from modern texts and large
labeled corpora for training a neural tagger are hardly available for this
domain. In this work, we tackle NER for historical German, English, French,
Swedish, and Finnish by training large historical language models. We
circumvent the need for large amounts of labeled data by using unlabeled data
for pretraining a language model. We propose hmBERT, a historical multilingual
BERT-based language model, and release the model in several versions of
different sizes. Furthermore, we evaluate the capability of hmBERT by solving
downstream NER as part of this year's HIPE-2022 shared task and provide
detailed analysis and insights. For the Multilingual Classical Commentary
coarse-grained NER challenge, our tagger HISTeria outperforms the other teams'
models for two out of three languages.Comment: Camera-ready HIPE-2022 Working Note Paper for CLEF 2022 (Conference
and Labs of the Evaluation Forum (CLEF 2022)
Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0
International audienceIn this work, we explore whether the recently demonstrated zero-shot abilities of the T0 model extend to Named Entity Recognition for out-of-distribution languages and time periods. Using a historical newspaper corpus in 3 languages as test-bed, we use prompts to extract possible named entities. Our results show that a naive approach for prompt-based zero-shot multilingual Named Entity Recognition is error-prone, but highlights the potential of such an approach for historical languages lacking labeled datasets. Moreover, we also find that T0-like models can be probed to predict the publication date and language of a document, which could be very relevant for the study of historical texts
Entities, Dates, and Languages: Zero-Shot on Historical Texts with T0
In this work, we explore whether the recently demonstrated zero-shot abilities of the T0 model extend to Named Entity Recognition for out-of-distribution languages and time periods. Using a historical newspaper corpus in 3 languages as test-bed, we use prompts to extract possible named entities. Our results show that a naive approach for prompt-based zeroshot multilingual Named Entity Recognition is error-prone, but highlights the potential of such an approach for historical languages lacking labeled datasets. Moreover, we also find that T0-like models can be probed to predict the publication date and language of a document, which could be very relevant for the study of historical texts
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
Large language models (LLMs) have been shown to be able to perform new tasks based on a few demonstrations or natural language instructions. While these capabilities have led to widespread adoption, most LLMs are developed by resource-rich organizations and are frequently kept from the public. As a step towards democratizing this powerful technology, we present BLOOM, a 176B-parameter open-access language model designed and built thanks to a collaboration of hundreds of researchers. BLOOM is a decoder-only Transformer language model that was trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages (59 in total). We find that BLOOM achieves competitive performance on a wide variety of benchmarks, with stronger results after undergoing multitask prompted finetuning. To facilitate future research and applications using LLMs, we publicly release our models and code under the Responsible AI License