15 research outputs found

    Old Content and Modern Tools : Searching Named Entities in a Finnish OCRed Historical Newspaper Collection 1771–1910

    Get PDF
    Named Entity Recognition (NER), search, classification and tagging of names and name-like informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general, the performance of a NER system is genre- and domain-dependent and also used entity categories vary [Nadeau and Sekine 2007]. The most general set of named entities is usually some version of a tripartite categorization of locations, persons, and organizations. In this paper we report trials and evaluation of NER with data from a digitized Finnish historical newspaper collection (Digi). Experiments, results, and discussion of this research serve development of the web collection of historical Finnish newspapers. Digi collection contains 1,960,921 pages of newspaper material from 1771–1910 in both Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75 % [Kettunen and Pääkkönen 2016]. Our principal NE tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We also show results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. Three other tools are also evaluated briefly. This paper reports the first large scale results of NER in a historical Finnish OCRed newspaper collection. Results of this research supplement NER results of other languages with similar noisy data. As the results are also achieved with a small and morphologically rich language, they illuminate the relatively well-researched area of Named Entity Recognition from a new perspective.Peer reviewe

    Modern Tools for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910

    Get PDF
    Named entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771– 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74–75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Seco’s tools achieve 30.0–60.0 F-score with locations and persons. Performance of FiNER and SeCo’s tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed textNamed entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771– 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74–75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Seco’s tools achieve 30.0–60.0 F-score with locations and persons. Performance of FiNER and SeCo’s tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed text.Peer reviewe

    Research and Development Efforts on the Digitized Historical Newspaper and Journal Collection of The National Library of Finland

    Get PDF
    The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.8 million pages mainly in Finnish and Swedish. Out of these about 7.36 million pages are freely available on the web site digi.kansalliskirjasto.fi (Digi). The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. This paper presents work that has been carried out in the NLF related to the historical newspaper and journal collection. We offer an overall account of research and development related to the data.Peer reviewe

    Digitised Newspapers – A New Eldorado for Historians?

    Get PDF
    Digitization technologies applied to historical newspapers have changed the research landscape historians were used to. An Eldorado? Despite unquestionable merits, the new digital affordance of historical newspapers also brings drawbacks and possible pitfalls which need to be carefully assessed

    Digital Histories: Emergent Approaches within the New Digital History.

    Get PDF
    The chapter focuses on the Finnish public service broadcasting company Yle (former Yleisradio), which was founded in 1926 and on the possible uses by digital historians of its online archive. The dataset used in the research are non-traditional in that it consists of Yle’s archival metadata. This digital material is analysed as a historical source material using the method of Named Entity Recognition (NER) as it is implemented in the digital tool the Finnish rule-based named-entity recogniser (FiNER). This chapter explores how a canon of salient Finnish events and persons is built up in the national audio-visual archive in the digital age. The authors suggest that the cultural contextualising and close reading of the themes pointed out by the results of NER-based analysis still play an important role in the analytical process as the metadata material, as well as the digital tool, has its limitations. </p

    Ensemble Named Entity Recognition (NER):Evaluating NER Tools in the Identification of Place Names in Historical Corpora

    Get PDF
    The field of Spatial Humanities has advanced substantially in the past years. The identification and extraction of toponyms and spatial information mentioned in historical text collections has allowed its use in innovative ways, making possible the application of spatial analysis and the mapping of these places with geographic information systems. For instance, automated place name identification is possible with Named Entity Recognition (NER) systems. Statistical NER methods based on supervised learning, in particular, are highly successful with modern datasets. However, there are still major challenges to address when dealing with historical corpora. These challenges include language changes over time, spelling variations, transliterations, OCR errors, and sources written in multiple languages among others. In this article, considering a task of place name recognition over two collections of historical correspondence, we report an evaluation of five NER systems and an approach that combines these through a voting system. We found that although individual performance of each NER system was corpus dependent, the ensemble combination was able to achieve consistent measures of precision and recall, outperforming the individual NER systems. In addition, the results showed that these NER systems are not strongly dependent on preprocessing and translation to Modern English

    Päiväysten poimintaa : automaattisten ja manuaalisten menetelmien vertailua digitoidussa historiallisessa kirjeaineistossa

    Get PDF
    Tutkimuksessa tarkasteltiin sitä, miten historiallisesta digitoidusta aineistosta pystytään etsimään päivämääriä automaattisin menetelmin. Koska historiallisten dokumenttien digitointia tapahtuu jatkuvasti, ja enenevästi myös hyvin monenlaisia aineistotyyppejä muutetaan digitaaliseen muotoon, on samalla tarpeen kehittää erilaisia tietoteknisiä menetelmiä, joiden avulla digitoituja aineistoja pystytään käsittelemään. Tutkimuksen aineistona oli noin tuhat talvi- ja jatkosodan aikaista digitoitua kirjettä, jotka sisältävät viiden eri yksityishenkilön kirjeenvaihtoa. Kirjeitä tarkasteltiin niiden päiväysten pohjalta, sillä tarkoituksena oli selvittää, millä tavoin kahden eri automaattisen menetelmän avulla olisi mahdollista poimia koko tutkimusaineistosta tietyn ajanjakson kirjeet. Tutkimusta varten koko laajasta digitoitujen kirjeiden kokoelmasta muodostettiin tämä pienempi kokeellinen testikokoelma, johon suoritettiin kolme erilaista testihakua. Vertailukohtana toimi itse manuaalisesti läpikäyty tulkinta jokaisen kirjeen päiväyksestä, ja hakutuloksia arvioitiin tarkkuuden ja saannin osalta. Tutkimuksessa päiväystä lähestyttiin osaltaan nimettyjen entiteettien kautta, sillä päivämäärä on yksi nimetyistä entiteeteistä, joka on entiteettitunnistimien avulla mahdollista merkitä tekstiin. Vertailtavana menetelmänä tutkimuksessa käytettiin suomen kielelle kehiteltyä nimettyjen entiteettien tunnistinta nimeltään FiNER, jonka avulla tutkimusaineistosta oli mahdollista poimia tarkasteluun ne kirjeet, joihin oli merkitty päiväysentiteettitunniste. Toisena vertailtavana menetelmänä oli itse kehitelty Python-ohjelmointikielinen hakukoneen kaltaisesti toimiva ohjelma, jonka avulla kirjeitä poimittiin koko tutkimusaineistosta. Myös FiNERin merkitsemille kirjeille oli tulosten saamiseksi tarpeen hyödyntää tätä itse kehiteltyä hakukonetta hieman muokattuna, jolloin tarkasteluun tulivat vain päiväysentiteetin saaneet kirjeet. Tutkimuksessa havaittiin, että FiNER tunnistaa kirjeiden päiväyksiä varsin huonosti eli entiteettitunnisteita merkittiin koko aineistolle vain vähän, minkä lisäksi tunnisteista suurin osa sijaitsi muualla tekstissä kuin varsinaisen päiväyksen kohdalla. Tällä oli vaikutuksensa hakutuloksiin, sillä kahden eri menetelmän tarkastelemassa kohdeaineistossa oli varsin suuri ero. Kirjeitä etsittiin vuoden, vuoden ja kuukauden sekä tarkan päivämäärän avulla. Automaattisin keinoin kirjeiden päiväykset löytyivät varsin hyvin, ja itse kehitellyllä menetelmällä hakutulosten saanti pysyi kohtalaisen hyvänä eli relevantit kirjeet löytyivät. Tarkkuus vaihteli hakujen välillä ollen paikoitellen varsin huono johtuen mukaan tulleista epärelevanteista osumista. Kautta linjan FiNERin tulokset olivat niin tarkkuuden kuin saannin osalta huonommat, mikä johtui siitä, etteivät kaikki relevantit kirjeet olleet saaneet päiväysentiteettitunnistetta tekstiinsä. Päiväyksen merkintätavoissa oli varsin suurta vaihtelua, eikä FiNER tunnistanut kuin tietynlaisen päiväyksen. Tutkimuksen perusteella tultiin siihen tulokseen, että tietoteknisiä menetelmiä olisi syytä parannella ja kehittää, jotta niiden avulla digitoidut aineistot olisivat mahdollisimman käytettäviä. Tietojen etsimisessä erilaiset tiedonlouhintamenetelmät ovat hyvä apu, minkä lisäksi päiväysten mieltäminen nimetyksi entiteetiksi voisi auttaa niiden etsimisessä, sillä tällöin entiteettitunnisteen avulla kirjetekstistä olisi helpompi saada päiväys poimittua. Menetelmien ja tunnistimien parantelu on kuitenkin tarpeen, jotta useammat erilaiset variantit tunnistettaisiin myös. Digitoitujen aineistojen käsittelyssä ja tietojen etsimisessä tietoteknisten menetelmien kehittely ja parantaminen helpottaisivat laajasti eri alojen tutkijoiden työtä ja aineistojen käytettävyyttä, minkä vuoksi siihen tulisi panostaa aina vain enemmän

    Underlying Sentiments in 1867: A Study of News Flows on the Execution of Emperor Maximilian I of Mexico in Digitized Newspaper Corpora

    Get PDF
    This article focuses on the international news flow regarding the execution of Maximilian, the Emperor of Mexico. The execution occurred in June 1867, but it received global attention only at the beginning of July when the news started to spread over the borders, via telegraph, and rapidly through the network of newspapers. The article concentrates on international news on Maximilian's execution between 5 and 20 July 1867. The aim of the study is both empirical and methodological. It explores the sentiments underlying the news about the execution and the regional differences in these sentiments on an empirical level. On a methodological level, the article investigates the strategies to analyze sentiments via newspaper corpora in a multilingual research setting. The study is based on optically recognized historical newspapers in three languages (German, Spanish and English), and four regions (Austria, Germany, Mexico, and the United States). Our analysis shows content variations in the corpora, mainly that news was framed differently in each studied region, indicating that the local perception of the event and political interests shaped the news. In our corpus, the Mexican press –published in the middle of a political crisis– tended towards a neutral stance, the Austrian and German papers mainly were negative, and the United States showed mixed sentiments on the incident.</p

    Digital Histories

    Get PDF
    Historical scholarship is currently undergoing a digital turn. All historians have experienced this change in one way or another, by writing on word processors, applying quantitative methods on digitalized source materials, or using internet resources and digital tools. Digital Histories showcases this emerging wave of digital history research. It presents work by historians who – on their own or through collaborations with e.g. information technology specialists – have uncovered new, empirical historical knowledge through digital and computational methods. The topics of the volume range from the medieval period to the present day, including various parts of Europe. The chapters apply an exemplary array of methods, such as digital metadata analysis, machine learning, network analysis, topic modelling, named entity recognition, collocation analysis, critical search, and text and data mining. The volume argues that digital history is entering a mature phase, digital history ‘in action’, where its focus is shifting from the building of resources towards the making of new historical knowledge. This also involves novel challenges that digital methods pose to historical research, including awareness of the pitfalls and limitations of the digital tools and the necessity of new forms of digital source criticisms. Through its combination of empirical, conceptual and contextual studies, Digital Histories is a timely and pioneering contribution taking stock of how digital research currently advances historical scholarship
    corecore