Search CORE

611 research outputs found

Topic Modelling Discourse Dynamics in Historical Newspapers

Author: Hengchen Simon
Marjanen Jani
Pivovarova Lidia
Tolonen Mikko
Zosa Elaine
Publication venue: CEUR-WS.org
Publication date: 20/11/2020
Field of study

This paper addresses methodological issues in diachronic data analysis for historical research. We apply two families of topic models (LDA and DTM) on a relatively large set of historical newspapers, with the aim of capturing and understanding discourse dynamics. Our case study focuses on newspapers and periodicals published in Finland between 1854 and 1917, but our method can easily be transposed to any diachronic data. Our main contributions are a) a combined sampling, training and inference procedure for applying topic models to huge and imbalanced diachronic text collections; b) a discussion on the differences between two topic models for this type of data; c) quantifying topic prominence for a period and thus a generalization of document-wise topic assignment to a discourse level; and d) a discussion of the role of humanistic interpretation with regard to analysing discourse dynamics through topic models.Peer reviewe

arXiv.org e-Print Archive

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Helsingin yliopiston digitaalinen arkisto

Synthetic Document Generator for Annotation-free Layout Recognition

Author: Raman Natraj
Shah Sameena
Veloso Manuela
Publication venue: 'Elsevier BV'
Publication date: 24/07/2022
Field of study

Analyzing the layout of a document to identify headers, sections, tables, figures etc. is critical to understanding its content. Deep learning based approaches for detecting the layout structure of document images have been promising. However, these methods require a large number of annotated examples during training, which are both expensive and time consuming to obtain. We describe here a synthetic document generator that automatically produces realistic documents with labels for spatial positions, extents and categories of the layout elements. The proposed generative process treats every physical component of a document as a random variable and models their intrinsic dependencies using a Bayesian Network graph. Our hierarchical formulation using stochastic templates allow parameter sharing between documents for retaining broad themes and yet the distributional characteristics produces visually unique samples, thereby capturing complex and diverse layouts. We empirically illustrate that a deep layout detection model trained purely on the synthetic documents can match the performance of a model that uses real documents

arXiv.org e-Print Archive

Bridging Cross-Modal Alignment for OCR-Free Content Retrieval in Scanned Historical Documents

Author: Molina Rodríguez Adrià
Universitat Autònoma de Barcelona. Departament de Ciències de la Computació
Universitat Autònoma de Barcelona. Escola d'Enginyeria
Publication venue
Publication date: 01/01/2023
Field of study

In this work, we address the limitations of current approaches to document retrieval by incorporating vision-based topic extraction. While previous methods have primarily focused on visual elements or relied on optical character recognition (OCR) for text extraction, we propose a paradigm shift by directly incorporating vision into the topic space. We demonstrate that recognizing all visual elements within a document is unnecessary for identifying its underlying topic. Visual cues such as icons, writing style, and font can serve as sufficient indicators. By leveraging ranking loss functions and convolutional neural networks (CNNs), we learn complex topological representations that mimic the behavior of text representations. Our approach aims to eliminate the need for OCR and its associated challenges, including efficiency, performance, data-hunger, and expensive annotation. Furthermore, we highlight the significance of incorporating vision in historical documentation, where visually antiquated documents contain valuable cues. Our research contributes to the understanding of topic extraction from a vision perspective and offers insights into annotation-cheap document retrieval system

Diposit Digital de Documents de la UAB

Proceedings of the Seventh International Conference Formal Approaches to South Slavic and Balkan languages

Author
Publication venue: Croatian Language Technologies Society, Faculty of Humanities and Social Science
Publication date: 01/01/2010
Field of study

Proceedings of the Seventh International Conference Formal Approaches to South Slavic and Balkan Languages publishes 17 papers that were presented at the conference organised in Dubrovnik, Croatia, 4-6 Octobre 2010

Repozitorij Filozofskog fakulteta u Zagrebu' at University of Zagreb

Modeling Visual Rhetoric and Semantics in Multimedia

Author: Thomas Christopher
Publication venue
Publication date: 16/09/2020
Field of study

Recent advances in machine learning have enabled computer vision algorithms to model complicated visual phenomena with accuracies unthinkable a mere decade ago. Their high-performance on a plethora of vision-related tasks has enabled computer vision researchers to begin to move beyond traditional visual recognition problems to tasks requiring higher-level image understanding. However, most computer vision research still focuses on describing what images, text, or other media literally portrays. In contrast, in this dissertation we focus on learning how and why such content is portrayed. Rather than viewing media for its content, we recast the problem as understanding visual communication and visual rhetoric. For example, the same content may be portrayed in different ways in order to present the story the author wishes to convey. We thus seek to model not only the content of the media, but its authorial intent and latent messaging. Understanding how and why visual content is portrayed a certain way requires understanding higher level abstract semantic concepts which are themselves latent within visual media. By latent, we mean the concept is not readily visually accessible within a single image (e.g. right vs left political bias), in contrast to explicit visual semantic concepts such as objects. Specifically, we study the problems of modeling photographic style (how professional photographers portray their subjects), understanding visual persuasion in image advertisements, modeling political bias in multimedia (image and text) news articles, and learning cross-modal semantic representations. While most past research in vision and natural language processing studies the case where visual content and paired text are highly aligned (as in the case of image captions), we target the case where each modality conveys complementary information to tell a larger story. We particularly focus on the problem of learning cross-modal representations from multimedia exhibiting weak alignment between the image and text modalities. A variety of techniques are presented which improve modeling of multimedia rhetoric in real-world data and enable more robust artificially intelligent systems

D-Scholarship@Pitt

Word Embedding Driven Concept Detection in Philosophical Corpora

Author: Hayton-Ruffner Dylan
Publication venue: Bowdoin Digital Commons
Publication date: 01/01/2020
Field of study

During the course of research, scholars often explore large textual databases for segments of text relevant to their conceptual analyses. This study proposes, develops and evaluates two algorithms for automated concept detection in theoretical corpora: ACS and WMD retrieval. Both novel algorithms are compared to key word retrieval, using a test set from the Digital Ricoeur corpus tagged by scholarly experts. WMD retrieval outperforms key word search on the concept detection task. Thus, WMD retrieval is a promising tool for concept detection and information retrieval systems focused on theoretical corpora

Bowdoin College

Event-based Access to Historical Italian War Memoirs

Author: Nanni Federico
Ponzetto Simone Paolo
Rovera Marco
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2021
Field of study

The progressive digitization of historical archives provides new, often domain specific, textual resources that report on facts and events which have happened in the past; among these, memoirs are a very common type of primary source. In this paper, we present an approach for extracting information from Italian historical war memoirs and turning it into structured knowledge. This is based on the semantic notions of events, participants and roles. We evaluate quantitatively each of the key-steps of our approach and provide a graph-based representation of the extracted knowledge, which allows to move between a Close and a Distant Reading of the collection.Comment: 23 pages, 6 figure

arXiv.org e-Print Archive

MAnnheim DOCument Server

How Do You Measure a Constitutional Moment? Using Algorithmic Topic Modeling To Evaluate Bruce Ackerman\u27s Theory of Constitutional Change

Author: Young Daniel Taylor
Publication venue: Yale Law School Legal Scholarship Repository
Publication date: 01/01/2013
Field of study

Yale Law School Legal Scholarship Repository

Event-based Access to Historical Italian War Memoirs

Author: Rovera M
Publication venue
Publication date: 01/01/2021
Field of study

Institutional Research Information System University of Turin