Search CORE

2,521 research outputs found

Combining Visual and Textual Features for Semantic Segmentation of Historical Newspapers

Author: Barman Raphaël
Clematide Simon
Ehrmann Maud
Kaplan Frédéric
Oliveira Sofia Ares
Publication venue: 'Centre pour la Communication Scientifique Directe (CCSD)'
Publication date: 14/12/2020
Field of study

The massive amounts of digitized historical documents acquired over the last decades naturally lend themselves to automatic processing and exploration. Research work seeking to automatically process facsimiles and extract information thereby are multiplying with, as a first essential step, document layout analysis. If the identification and categorization of segments of interest in document images have seen significant progress over the last years thanks to deep learning techniques, many challenges remain with, among others, the use of finer-grained segmentation typologies and the consideration of complex, heterogeneous documents such as historical newspapers. Besides, most approaches consider visual features only, ignoring textual signal. In this context, we introduce a multimodal approach for the semantic segmentation of historical newspapers that combines visual and textual features. Based on a series of experiments on diachronic Swiss and Luxembourgish newspapers, we investigate, among others, the predictive power of visual and textual features and their capacity to generalize across time and sources. Results show consistent improvement of multimodal models in comparison to a strong visual baseline, as well as better robustness to high material variance

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Analysing Lexical Semantic Change with Contextualised Word Representations

Author: Del Tredici Marco
Fernández Raquel
Giulianelli Mario
Publication venue
Publication date: 01/01/2020
Field of study

This paper presents the first unsupervised approach to lexical semantic change that makes use of contextualised word representations. We propose a novel method that exploits the BERT neural language model to obtain representations of word usages, clusters these representations into usage types, and measures change along time with three proposed metrics. We create a new evaluation dataset and show that the model representations and the detected semantic shifts are positively correlated with human judgements. Our extensive qualitative analysis demonstrates that our method captures a variety of synchronic and diachronic linguistic phenomena. We expect our work to inspire further research in this direction.Comment: To appear in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL-2020

arXiv.org e-Print Archive

Crossref

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Semantic Variation in Online Communities of Practice

Author: Del Tredici Marco
Fernández Raquel
Publication venue
Publication date: 01/01/2017
Field of study

We introduce a framework for quantifying semantic variation of common words in Communities of Practice and in sets of topic-related communities. We show that while some meaning shifts are shared across related communities, others are community-specific, and therefore independent from the discussed topic. We propose such findings as evidence in favour of sociolinguistic theories of socially-driven semantic variation. Results are evaluated using an independent language modelling task. Furthermore, we investigate extralinguistic features and show that factors such as prominence and dissemination of words are related to semantic variation.Comment: 13 pages, Proceedings of the 12th International Conference on Computational Semantics (IWCS 2017

arXiv.org e-Print Archive

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Deep Learning for Period Classification of Historical Texts

Author: Liebeskind Chaya
Liebeskind Shmuel
Publication venue: HAL CCSD
Publication date: 22/10/2019
Field of study

In this study, we address the interesting task of classifying historical texts by their assumed period of writing. This task is useful in digital humanity studies where many texts have unidentified publication dates. For years, the typical approach for temporal text classification was supervised using machine-learning algorithms. These algorithms require careful feature engineering and considerable domain expertise to design a feature extractor to transform the raw text into a feature vector from which the classifier could learn to classify any unseen valid input. Recently, deep learning has produced extremely promising results for various tasks in natural language processing (NLP). The primary advantage of deep learning is that human engineers did not design the feature layers, but the features were extrapolated from data with a general-purpose learning procedure. We investigated deep learning models for period classification of historical texts. We compared three common models: paragraph vectors, convolutional neural networks (CNN), and recurrent neural networks (RNN). We demonstrate that the CNN and RNN models outperformed the paragraph vector model and supervised machine-learning algorithms. In addition, we constructed word embeddings for each time period and analyzed semantic changes of word meanings over time