Search CORE

6 research outputs found

Can BERT Dig It? -- Named Entity Recognition for Information Retrieval in the Archaeology Domain

Author: Brandsen Alex
Lambers Karsten
Verberne Suzan
Wansleeben Milco
Publication venue
Publication date: 14/06/2021
Field of study

The amount of archaeological literature is growing rapidly. Until recently, these data were only accessible through metadata search. We implemented a text retrieval engine for a large archaeological text collection (

\sim 658

Million words). In archaeological IR, domain-specific entities such as locations, time periods, and artefacts, play a central role. This motivated the development of a named entity recognition (NER) model to annotate the full collection with archaeological named entities. In this paper, we present ArcheoBERTje, a BERT model pre-trained on Dutch archaeological texts. We compare the model's quality and output on a Named Entity Recognition task to a generic multilingual model and a generic Dutch model. We also investigate ensemble methods for combining multiple BERT models, and combining the best BERT model with a domain thesaurus using Conditional Random Fields (CRF). We find that ArcheoBERTje outperforms both the multilingual and Dutch model significantly with a smaller standard deviation between runs, reaching an average F1 score of 0.735. The model also outperforms ensemble methods combining the three models. Combining ArcheoBERTje predictions and explicit domain knowledge from the thesaurus did not increase the F1 score. We quantitatively and qualitatively analyse the differences between the vocabulary and output of the BERT models on the full collection and provide some valuable insights in the effect of fine-tuning for specific domains. Our results indicate that for a highly specific text domain such as archaeology, further pre-training on domain-specific data increases the model's quality on NER by a much larger margin than shown for other domains in the literature, and that domain-specific pre-training makes the addition of domain knowledge from a thesaurus unnecessary

arXiv.org e-Print Archive

Leiden University Scholary Publications

User Requirement Solicitation for an Information Retrieval System Applied to Dutch Grey Literature in the Archaeology Domain

Author: Alex Brandsen
Karsten Lambers
Milco Wansleeben
Suzan Verberne
Publication venue: 'Ubiquity Press, Ltd.'
Publication date: 01/03/2019
Field of study

In this paper, we present the results of user requirement solicitation for a search system of grey literature in archaeology, specifically Dutch excavation reports. This search system uses Named Entity Recognition and Information Retrieval techniques to create an effective and effortless search experience. Specifically, we used Conditional Random Fields to identify entities, with an average accuracy of 56%. This is a baseline result, and we identified many possibilities for improvement. These entities were indexed in ElasticSearch and a user interface was developed on top of the index. This proof of concept was used in user requirement solicitation and evaluation with a group of end users. Feedback from this group indicated that there is a dire need for such a system, and that the first results are promising

Directory of Open Access Journals

Leiden University Scholary Publications

Excavation of the late Mesolithic site of Merselo-Haag (Venray)

Author: Verhart Leo
Wansleeben Milco
Publication venue
Publication date: 20/01/2007
Field of study

Opgravingsdocumentatie en artefactbeschrijvingen van de Merselo-Haag. De databestanden van de opgraving omvatten verspreidingskaarten en beschrijving van het vuursteen-vondstmateriaal.Excavation documentation and artefact analysis. The files of the excavation Merselo-Haag contain spatial distributions and artefact descriptions, mainly of flint artefact

Electronic Archiving System