5 research outputs found

    Clipping the Page – Automatic Article Detection and Marking Software in Production of Newspaper Clippings of a Digitized Historical Journalistic Collection

    Get PDF
    This paper describes utilization of article detection and extraction on the Finnish Digi (https://digi.kansalliskirjasto.fi/etusivu?set_language=en) newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869–1918. We use PIVAJ software [1] for detection and marking of articles in our collection. Out of the separated articles we can produce automatic clippings for the user. The user can collect clippings for own use both as images and as OCRed text. Together these functionalities improve usability of the digitized journalistic collection by providing a structured access to the contents of a page.Peer reviewe

    Research and Development Efforts on the Digitized Historical Newspaper and Journal Collection of The National Library of Finland

    Get PDF
    The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.8 million pages mainly in Finnish and Swedish. Out of these about 7.36 million pages are freely available on the web site digi.kansalliskirjasto.fi (Digi). The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. This paper presents work that has been carried out in the NLF related to the historical newspaper and journal collection. We offer an overall account of research and development related to the data.Peer reviewe

    Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771–1929: Early Results Using the PIVAJ Software

    Get PDF
    This paper describes first large scale article detection and extraction efforts on the Finnish Digi newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869-1898 . The historical digital newspaper archive environment of the NLF is based on commercial docWorks software. The software is capable of article detection and extraction, but our material does not seem to behave well in the system in t his respect. Therefore, we have been in search of an alternative article segmentation system and have now focused our efforts on the PIVAJ machine learning based platform developed at the LITIS laborator y of University of Rouen Normandy. As training and evaluation data for PIVAJ we chose one newspaper, Uusi Suometar. We established a data set that contains 56 issues of the newspaper from years 1869 1898 with 4 pages each, i.e. 224 pages in total. Given the selected set of 56 issues, our first data annotation and experiment phase consisted of annotating a subset of 28 issues (112 pages) and conducting preliminary experiments. After the preliminary annotation and annotation of the first 28 issues accordingly. Subsequently, we annotated the remaining 28 issues . We then divided the annotated set in to training and evaluation set s of 168 and 56 pages. We trained PIVAJ successfully and evaluate d the results using the layout evaluation software developed by PRImA research laboratory of University of Salford. The results of our experiments show that PIVAJ achieves success rates of 67.9, 76.1, and 92.2 for the whole data set of 56 pages with three different evaluation scenarios introduced in [6]. On the whole, the results seem reasonable considering the varying layouts of the different issues of Uusi Suometar along the time scale of the data.Peer reviewe

    Handbook of Document Image Processing and Recognition

    Get PDF
    International audienceThe Handbook of Document Image Processing and Recognition provides a consistent, comprehensive resource on the available methods and techniques in document image processing and recognition. It includes unified comparison and contrast analysis of algorithms in standard table formats. Thus, it educates the reader in order to help them to make informed decisions on their particular problems. The handbook is divided into several parts. Each part starts with an introduction written by the two editors. These introductions set the general framework for the main topic of each part and introduces the contribution of each chapter within the framework. The introductions are followed by several chapters written by established experts of the field. Each chapter provides the reader with a clear overview of the topic and of the state of the art in techniques used (including elements of comparison between them). Each chapter is structured in the same way: It starts with an introductory text, concludes with a summary of the main points addressed in the chapter and ends with a comprehensive list of references. Whenever appropriate, the authors include specific sections describing and pointing to consolidated software and/or reference datasets. Numerous cross-references between the chapters ensure this is a truly integrated work, without unnecessary duplications and overlaps between chapters. This reference work is intended for the use by a wide audience of readers from around the world such as graduate students, researchers, librarians, lecturers, professionals, and many other people
    corecore