Search CORE

1,923 research outputs found

Research and Development Efforts on the Digitized Historical Newspaper and Journal Collection of The National Library of Finland

Author: Kettunen Kimmo Tapio
Koistinen Jani Mika Olavi
Ruokolainen Teemu Petteri
Publication venue
Publication date: 03/04/2018
Field of study

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.8 million pages mainly in Finnish and Swedish. Out of these about 7.36 million pages are freely available on the web site digi.kansalliskirjasto.fi (Digi). The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. This paper presents work that has been carried out in the NLF related to the historical newspaper and journal collection. We offer an overall account of research and development related to the data.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771–1929: Early Results Using the PIVAJ Software

Author: Antelme Daniel
Kettunen Kimmo
Liukkonen Erno Samuli
Paquet Thierry
Ruokolainen Teemu
Tranouez Pierrick
Publication venue: The Association for Computing Machinery
Publication date: 01/05/2019
Field of study

This paper describes first large scale article detection and extraction efforts on the Finnish Digi newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869-1898 . The historical digital newspaper archive environment of the NLF is based on commercial docWorks software. The software is capable of article detection and extraction, but our material does not seem to behave well in the system in t his respect. Therefore, we have been in search of an alternative article segmentation system and have now focused our efforts on the PIVAJ machine learning based platform developed at the LITIS laborator y of University of Rouen Normandy. As training and evaluation data for PIVAJ we chose one newspaper, Uusi Suometar. We established a data set that contains 56 issues of the newspaper from years 1869 1898 with 4 pages each, i.e. 224 pages in total. Given the selected set of 56 issues, our first data annotation and experiment phase consisted of annotating a subset of 28 issues (112 pages) and conducting preliminary experiments. After the preliminary annotation and annotation of the first 28 issues accordingly. Subsequently, we annotated the remaining 28 issues . We then divided the annotated set in to training and evaluation set s of 168 and 56 pages. We trained PIVAJ successfully and evaluate d the results using the layout evaluation software developed by PRImA research laboratory of University of Salford. The results of our experiments show that PIVAJ achieves success rates of 67.9, 76.1, and 92.2 for the whole data set of 56 pages with three different evaluation scenarios introduced in [6]. On the whole, the results seem reasonable considering the varying layouts of the different issues of Uusi Suometar along the time scale of the data.Peer reviewe

HAL - Normandie Université

Helsingin yliopiston digitaalinen arkisto

Opening Digitized Newspapers for Different User Groups - Successes and Challenges

Author: Rautiainen Juha
Publication venue
Publication date: 21/06/2019
Field of study

In recent years the National Library of Finland (NLF) has taken several initiatives to enable access to the digitized Finnish newspapers for wider use. Technical improvements in the presentation system (Digi) and agreements with Finnish copyright organizations have made it possible to provide access to copyrighted material in different ways. Since 2016 NFL has opened over 4 million pages of digitized newspapers and journals from years 1911-1929 to open online use in the Digi system. This has doubled the digitized material available outside of the legal deposit libraries. The opening has benefited both the general public and researchers.In a pilot project, digitized newspapers and journals from years 1930-2010 are opened for research use. During the one and half year’s pilot period, authorized users are able to access the materials in restricted use through the Digi system from their own premises and with their own devices. The NLF has promoted the use of newspapers as data in research by providing ready-made datasets available in the Digi system. The datasets contain all the digitized newspaper pages from 1771 to 1910 in the ALTO XML format and some other data collections. The datasets are sufficient for many users, but customized packages have also been required. One of these cases is the Horizon 2020 funded EU project NewsEye in which the NLF is participating. The aim of the project is to develop new integrated tools and methods for effective exploration and exploitation of digital newspapers by means of new technologies. The NLF provides to the project a set of 0.5 million pages of digitized newspapers selected with the researchers in the project. All in all, efforts to increase the use of digitized newspapers have been successful in many ways. However, a number of issues still needs to be considered in the future. The paper summarizes experiences so far

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Helsingin yliopiston digitaalinen arkisto

OCR Quality Affects Perceived Usefulness of Historical Newspaper Clippings. A User Study

Author: Keskustalo Heikki
Kettunen Kimmo
Kumpulainen Sanna
Pääkkönen Tuula
Rautiainen Juha
Publication venue
Publication date: 01/01/2022
Field of study

Publisher Copyright: © 2022 Copyright for this paper by its authors.Effects of Optical Character Recognition (OCR) quality on historical information retrieval have so far been studied in data-oriented scenarios regarding the effectiveness of retrieval results. Such studies have either focused on the effects of artificially degraded OCR quality (see, e.g., [1-2]) or utilized test collections containing texts based on authentic low quality OCR data (see, e.g., [3]). In this paper the effects of OCR quality are studied in a user-oriented information retrieval setting. Thirty-two users evaluated subjectively query results of six topics each (out of 30 topics) based on pre-formulated queries using a simulated work task setting. To the best of our knowledge our simulated work task experiment is the first one showing empirically that users' subjective relevance assessments of retrieved documents are affected by a change in the quality of optically read text. Users of historical newspaper collections have so far commented effects of OCR'ed data quality mainly in impressionistic ways, and controlled user environments for studying effects of OCR quality on users' relevance assessments of the retrieval results have so far been missing. To remedy this The National Library of Finland (NLF) set up an experimental query environment for the contents of one Finnish historical newspaper, Uusi Suometar 1869-1918, to be able to compare users' evaluation of search results of two different OCR qualities for digitized newspaper articles. The query interface was able to present the same underlying document for the user based on two alternatives: either based on the lower OCR quality, or based on the higher OCR quality, and the choice was randomized. The users did not know about quality differences in the article texts they evaluated. The main result of the study is that improved optical character recognition quality affects perceived usefulness of historical newspaper articles significantly. The mean average evaluation score for the improved OCR results was 7.94% higher than the mean average evaluation score of the old OCR results.Peer reviewe

arXiv.org e-Print Archive

Helsingin yliopiston digitaalinen arkisto

Trepo - Institutional Repository of Tampere University

Trawling and trolling for terrorists in the digital Gulf of Bothnia : Cross-lingual text mining for the emergence of terrorism in Swedish and Finnish newspapers, 1780—1926

Author: Borin Lars
Brodén Daniel
Fridlund Mats
Jauhiainen Tommi
Malkki Leena
Olsson Leif-Jöran
Publication venue: de Gruyter
Publication date: 24/10/2022
Field of study

In pursuing the historical emergence of the discourse on terrorism, this study trawls the “digital Gulf of Bothnia” in the form of a corpus of combined Swedish and Finnish digitized newspaper texts. Through a cross-lingual exploration of the uses of the concept of terrorism in historical Swedish and Finnish news, we examine meanings anchored in the two culturally close but still decidedly different national political contexts. The study is an outcome of an integrative interdisciplinary effort.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

The Prior-project:From Archive Boxes to a Research Community

Author: Albretsen Jørgen
Engerer Volkmar Paul
Hasle Per Frederik Vilhelm
Roued-Cunliffe Henriette
Publication venue
Publication date: 21/03/2017
Field of study

Copenhagen University Research Information System

VBN

Cool libraries in a melting world : Proceedings of the 23rd Polar Libraries Colloquy 2010, June 13-18, 2010, Bremerhaven, Germany

Author: Brannemann Marcel
Carle Daria O.
Publication venue: Alfred Wegener Institute for Polar and Marine Research
Publication date: 01/01/2010
Field of study

Electronic Publication Information Center

Border crossing and trespassing? : Expanding digital humanities research to developing peripheries with the novel digital technologies

Author: Hyyryläinen Torsti
Ryynänen Toni
Publication venue: University of Oulu
Publication date: 01/01/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Wanca in Korp : Text corpora for underresourced Uralic languages

Author: Jauhiainen Heidi
Jauhiainen Tommi
Linden Krister
Publication venue: University of Oulu
Publication date: 01/01/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Digitised Newspapers – A New Eldorado for Historians?

Author
Publication venue: 'Walter de Gruyter GmbH'
Publication date
Field of study

Digitization technologies applied to historical newspapers have changed the research landscape historians were used to. An Eldorado? Despite unquestionable merits, the new digital affordance of historical newspapers also brings drawbacks and possible pitfalls which need to be carefully assessed

OAPEN Library