1,923 research outputs found

    Research and Development Efforts on the Digitized Historical Newspaper and Journal Collection of The National Library of Finland

    Get PDF
    The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.8 million pages mainly in Finnish and Swedish. Out of these about 7.36 million pages are freely available on the web site digi.kansalliskirjasto.fi (Digi). The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. This paper presents work that has been carried out in the NLF related to the historical newspaper and journal collection. We offer an overall account of research and development related to the data.Peer reviewe

    Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771–1929: Early Results Using the PIVAJ Software

    Get PDF
    This paper describes first large scale article detection and extraction efforts on the Finnish Digi newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869-1898 . The historical digital newspaper archive environment of the NLF is based on commercial docWorks software. The software is capable of article detection and extraction, but our material does not seem to behave well in the system in t his respect. Therefore, we have been in search of an alternative article segmentation system and have now focused our efforts on the PIVAJ machine learning based platform developed at the LITIS laborator y of University of Rouen Normandy. As training and evaluation data for PIVAJ we chose one newspaper, Uusi Suometar. We established a data set that contains 56 issues of the newspaper from years 1869 1898 with 4 pages each, i.e. 224 pages in total. Given the selected set of 56 issues, our first data annotation and experiment phase consisted of annotating a subset of 28 issues (112 pages) and conducting preliminary experiments. After the preliminary annotation and annotation of the first 28 issues accordingly. Subsequently, we annotated the remaining 28 issues . We then divided the annotated set in to training and evaluation set s of 168 and 56 pages. We trained PIVAJ successfully and evaluate d the results using the layout evaluation software developed by PRImA research laboratory of University of Salford. The results of our experiments show that PIVAJ achieves success rates of 67.9, 76.1, and 92.2 for the whole data set of 56 pages with three different evaluation scenarios introduced in [6]. On the whole, the results seem reasonable considering the varying layouts of the different issues of Uusi Suometar along the time scale of the data.Peer reviewe

    Opening Digitized Newspapers for Different User Groups - Successes and Challenges

    Get PDF
    In recent years the National Library of Finland (NLF) has taken several initiatives to enable access to the digitized Finnish newspapers for wider use. Technical improvements in the presentation system (Digi) and agreements with Finnish copyright organizations have made it possible to provide access to copyrighted material in different ways. Since 2016 NFL has opened over 4 million pages of digitized newspapers and journals from years 1911-1929 to open online use in the Digi system. This has doubled the digitized material available outside of the legal deposit libraries. The opening has benefited both the general public and researchers.In a pilot project, digitized newspapers and journals from years 1930-2010 are opened for research use. During the one and half year’s pilot period, authorized users are able to access the materials in restricted use through the Digi system from their own premises and with their own devices. The NLF has promoted the use of newspapers as data in research by providing ready-made datasets available in the Digi system. The datasets contain all the digitized newspaper pages from 1771 to 1910 in the ALTO XML format and some other data collections. The datasets are sufficient for many users, but customized packages have also been required. One of these cases is the Horizon 2020 funded EU project NewsEye in which the NLF is participating. The aim of the project is to develop new integrated tools and methods for effective exploration and exploitation of digital newspapers by means of new technologies. The NLF provides to the project a set of 0.5 million pages of digitized newspapers selected with the researchers in the project. All in all, efforts to increase the use of digitized newspapers have been successful in many ways. However, a number of issues still needs to be considered in the future. The paper summarizes experiences so far

    OCR Quality Affects Perceived Usefulness of Historical Newspaper Clippings. A User Study

    Get PDF
    Publisher Copyright: © 2022 Copyright for this paper by its authors.Effects of Optical Character Recognition (OCR) quality on historical information retrieval have so far been studied in data-oriented scenarios regarding the effectiveness of retrieval results. Such studies have either focused on the effects of artificially degraded OCR quality (see, e.g., [1-2]) or utilized test collections containing texts based on authentic low quality OCR data (see, e.g., [3]). In this paper the effects of OCR quality are studied in a user-oriented information retrieval setting. Thirty-two users evaluated subjectively query results of six topics each (out of 30 topics) based on pre-formulated queries using a simulated work task setting. To the best of our knowledge our simulated work task experiment is the first one showing empirically that users' subjective relevance assessments of retrieved documents are affected by a change in the quality of optically read text. Users of historical newspaper collections have so far commented effects of OCR'ed data quality mainly in impressionistic ways, and controlled user environments for studying effects of OCR quality on users' relevance assessments of the retrieval results have so far been missing. To remedy this The National Library of Finland (NLF) set up an experimental query environment for the contents of one Finnish historical newspaper, Uusi Suometar 1869-1918, to be able to compare users' evaluation of search results of two different OCR qualities for digitized newspaper articles. The query interface was able to present the same underlying document for the user based on two alternatives: either based on the lower OCR quality, or based on the higher OCR quality, and the choice was randomized. The users did not know about quality differences in the article texts they evaluated. The main result of the study is that improved optical character recognition quality affects perceived usefulness of historical newspaper articles significantly. The mean average evaluation score for the improved OCR results was 7.94% higher than the mean average evaluation score of the old OCR results.Peer reviewe

    Trawling and trolling for terrorists in the digital Gulf of Bothnia : Cross-lingual text mining for the emergence of terrorism in Swedish and Finnish newspapers, 1780—1926

    Get PDF
    In pursuing the historical emergence of the discourse on terrorism, this study trawls the “digital Gulf of Bothnia” in the form of a corpus of combined Swedish and Finnish digitized newspaper texts. Through a cross-lingual exploration of the uses of the concept of terrorism in historical Swedish and Finnish news, we examine meanings anchored in the two culturally close but still decidedly different national political contexts. The study is an outcome of an integrative interdisciplinary effort.Peer reviewe

    Digitised Newspapers – A New Eldorado for Historians?

    Get PDF
    Digitization technologies applied to historical newspapers have changed the research landscape historians were used to. An Eldorado? Despite unquestionable merits, the new digital affordance of historical newspapers also brings drawbacks and possible pitfalls which need to be carefully assessed
    • …
    corecore