Search CORE

7 research outputs found

HFST-SweNER – A New NER Resource for Swedish

Author: Borin Lars
Hardwick Sam
Kokkinakis Dimitrios
Linden Krister
Niemi Jyrki
Publication venue: European Language Resources Association (ELRA)
Publication date: 26/05/2014
Field of study

Named entity recognition (NER) is a knowledge-intensive information extraction task that is used for recognizing textual mentions of entities that belong to a predefined set of categories, such as locations, organizations and time expressions. NER is a challenging, difficult, yet essential preprocessing technology for many natural language processing applications, and particularly crucial for language understanding. NER has been actively explored in academia and in industry especially during the last years due to the advent of social media data. This paper describes the conversion, modeling and adaptation of a Swedish NER system from a hybrid environment, with integrated functionality from various processing components, to the Helsinki Finite-State Transducer Technology (HFST) platform. This new HFST-based NER (HFST-SweNER) is a full-fledged open source implementation that supports a variety of generic named entity types and consists of multiple, reusable resource layers, e.g., various n-gram-based named entity lists (gazetteers).Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Modern Tools for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910

Author: Kettunen Kimmo Tapio
Kuokkala Juha Markus
Mäkelä Eetu
Niemi Jyrki Antero
Ruokolainen Teemu Petteri
Publication venue: CEUR Workshop Proceedings
Publication date: 01/01/2016
Field of study

Named entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771– 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74–75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Seco’s tools achieve 30.0–60.0 F-score with locations and persons. Performance of FiNER and SeCo’s tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed textNamed entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771– 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74–75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Seco’s tools achieve 30.0–60.0 F-score with locations and persons. Performance of FiNER and SeCo’s tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed text.Peer reviewe

Aaltodoc Publication Archive

Helsingin yliopiston digitaalinen arkisto

Kekkonen, Euroviisut ja Helsinki – kansallinen audiovisuaalinen perintö NER-analyysin tunnistamana

Author: Kannisto Maiju
Kauppinen Pekka
Publication venue: Society of Social and Economic Research in the Universities of Turku
Publication date: 27/10/2022
Field of study

Kansallista audiovisuaalista arkistoa rakennettaessa tehdään monenlaisia valintoja, millainen kansallinen perintö tallennetaan jaettavaksi ja muistettavaksi. Artikkelissa tutkimme audiovisuaalisen arkiston kaanonia nimentunnistuksen avulla FiNER-työkalulla. Jäljitämme Ylen Elävän arkiston metadatasta henkilöitä, tapahtumia, paikkoja ja vuosia, joita arkistossa painotetaan. Artikkeli avaa metadata-aineiston ja nimentunnistuksen mahdollisuuksia ja rajoituksia historiallisessa tutkimuksessa. </p

UTUPub

Digital Histories: Emergent Approaches within the New Digital History.

Author: Kannisto Maiju
Kauppinen Pekka
Publication venue: 'Helsinki University Press'
Publication date: 28/10/2022
Field of study

The chapter focuses on the Finnish public service broadcasting company Yle (former Yleisradio), which was founded in 1926 and on the possible uses by digital historians of its online archive. The dataset used in the research are non-traditional in that it consists of Yle’s archival metadata. This digital material is analysed as a historical source material using the method of Named Entity Recognition (NER) as it is implemented in the digital tool the Finnish rule-based named-entity recogniser (FiNER). This chapter explores how a canon of salient Finnish events and persons is built up in the national audio-visual archive in the digital age. The authors suggest that the cultural contextualising and close reading of the themes pointed out by the results of NER-based analysis still play an important role in the analytical process as the metadata material, as well as the digital tool, has its limitations. </p

UTUPub

Old Content and Modern Tools : Searching Named Entities in a Finnish OCRed Historical Newspaper Collection 1771–1910

Author: Kettunen Kimmo
Kuokkala Juha
Löfberg Laura
Mäkelä Eetu
Ruokolainen Teemu
Publication venue
Publication date: 09/11/2016
Field of study

Named Entity Recognition (NER), search, classification and tagging of names and name-like informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general, the performance of a NER system is genre- and domain-dependent and also used entity categories vary [Nadeau and Sekine 2007]. The most general set of named entities is usually some version of a tripartite categorization of locations, persons, and organizations. In this paper we report trials and evaluation of NER with data from a digitized Finnish historical newspaper collection (Digi). Experiments, results, and discussion of this research serve development of the web collection of historical Finnish newspapers. Digi collection contains 1,960,921 pages of newspaper material from 1771–1910 in both Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75 % [Kettunen and Pääkkönen 2016]. Our principal NE tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We also show results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. Three other tools are also evaluated briefly. This paper reports the first large scale results of NER in a historical Finnish OCRed newspaper collection. Results of this research supplement NER results of other languages with similar noisy data. As the results are also achieved with a small and morphologically rich language, they illuminate the relatively well-researched area of Named Entity Recognition from a new perspective.Peer reviewe

arXiv.org e-Print Archive

Helsingin yliopiston digitaalinen arkisto

Digital Histories

Author
Publication venue: 'Helsinki University Press'
Publication date
Field of study

Historical scholarship is currently undergoing a digital turn. All historians have experienced this change in one way or another, by writing on word processors, applying quantitative methods on digitalized source materials, or using internet resources and digital tools. Digital Histories showcases this emerging wave of digital history research. It presents work by historians who – on their own or through collaborations with e.g. information technology specialists – have uncovered new, empirical historical knowledge through digital and computational methods. The topics of the volume range from the medieval period to the present day, including various parts of Europe. The chapters apply an exemplary array of methods, such as digital metadata analysis, machine learning, network analysis, topic modelling, named entity recognition, collocation analysis, critical search, and text and data mining. The volume argues that digital history is entering a mature phase, digital history ‘in action’, where its focus is shifting from the building of resources towards the making of new historical knowledge. This also involves novel challenges that digital methods pose to historical research, including awareness of the pitfalls and limitations of the digital tools and the necessity of new forms of digital source criticisms. Through its combination of empirical, conceptual and contextual studies, Digital Histories is a timely and pioneering contribution taking stock of how digital research currently advances historical scholarship

OAPEN Library