7 research outputs found
HFST-SweNER – A New NER Resource for Swedish
Named entity recognition (NER) is a knowledge-intensive information extraction task that is used for recognizing textual mentions of entities that belong to a predefined set of categories, such as locations, organizations and time expressions. NER is a challenging, difficult, yet essential preprocessing technology for many natural language processing applications, and particularly crucial for language understanding. NER has been actively explored in academia and in industry especially during the last years due to the advent of social media data. This paper describes the conversion, modeling and adaptation of a Swedish NER system from a hybrid environment, with integrated functionality from various processing components, to the Helsinki Finite-State Transducer Technology (HFST) platform. This new HFST-based NER (HFST-SweNER) is a full-fledged open source implementation that supports a variety of generic named entity types and consists of multiple, reusable resource layers, e.g., various n-gram-based named entity lists (gazetteers).Peer reviewe
Modern Tools for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910
Named entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771– 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74–75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Seco’s tools achieve 30.0–60.0 F-score with locations and persons. Performance of FiNER and SeCo’s tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed textNamed entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771– 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74–75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Seco’s tools achieve 30.0–60.0 F-score with locations and persons. Performance of FiNER and SeCo’s tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed text.Peer reviewe
Kekkonen, Euroviisut ja Helsinki – kansallinen audiovisuaalinen perintö NER-analyysin tunnistamana
Kansallista audiovisuaalista arkistoa rakennettaessa tehdään monenlaisia
valintoja, millainen kansallinen perintö tallennetaan jaettavaksi ja
muistettavaksi. Artikkelissa tutkimme audiovisuaalisen arkiston kaanonia
nimentunnistuksen avulla FiNER-työkalulla. Jäljitämme Ylen Elävän
arkiston metadatasta henkilöitä, tapahtumia, paikkoja ja vuosia, joita
arkistossa painotetaan. Artikkeli avaa metadata-aineiston ja
nimentunnistuksen mahdollisuuksia ja rajoituksia historiallisessa
tutkimuksessa. </p
Digital Histories: Emergent Approaches within the New Digital History.
The chapter focuses on the Finnish public service broadcasting company
Yle (former Yleisradio), which was founded in 1926 and on the possible
uses by digital historians of its online archive. The dataset used in
the research are non-traditional in that it consists of Yle’s archival
metadata. This digital material is analysed as a historical source
material using the method of Named Entity Recognition (NER) as it is
implemented in the digital tool the Finnish rule-based named-entity
recogniser (FiNER). This chapter explores how a canon of salient Finnish
events and persons is built up in the national audio-visual archive in
the digital age. The authors suggest that the cultural contextualising
and close reading of the themes pointed out by the results of NER-based
analysis still play an important role in the analytical process as the
metadata material, as well as the digital tool, has its limitations.
</p
Old Content and Modern Tools : Searching Named Entities in a Finnish OCRed Historical Newspaper Collection 1771–1910
Named Entity Recognition (NER), search, classification and tagging of names and name-like informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general, the performance of a NER system is genre- and domain-dependent and also used entity categories vary [Nadeau and Sekine 2007]. The most general set of named entities is usually some version of a tripartite categorization of locations, persons, and organizations. In this paper we report trials and evaluation of NER with data from a digitized Finnish historical newspaper collection (Digi). Experiments, results, and discussion of this research serve development of the web collection of historical Finnish newspapers. Digi collection contains 1,960,921 pages of newspaper material from 1771–1910 in both Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75 % [Kettunen and Pääkkönen 2016]. Our principal NE tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We also show results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. Three other tools are also evaluated briefly. This paper reports the first large scale results of NER in a historical Finnish OCRed newspaper collection. Results of this research supplement NER results of other languages with similar noisy data. As the results are also achieved with a small and morphologically rich language, they illuminate the relatively well-researched area of Named Entity Recognition from a new perspective.Peer reviewe
Digital Histories
Historical scholarship is currently undergoing a digital turn. All historians have experienced this change in one way or another, by writing on word processors, applying quantitative methods on digitalized source materials, or using internet resources and digital tools. Digital Histories showcases this emerging wave of digital history research. It presents work by historians who – on their own or through collaborations with e.g. information technology specialists – have uncovered new, empirical historical knowledge through digital and computational methods. The topics of the volume range from the medieval period to the present day, including various parts of Europe. The chapters apply an exemplary array of methods, such as digital metadata analysis, machine learning, network analysis, topic modelling, named entity recognition, collocation analysis, critical search, and text and data mining. The volume argues that digital history is entering a mature phase, digital history ‘in action’, where its focus is shifting from the building of resources towards the making of new historical knowledge. This also involves novel challenges that digital methods pose to historical research, including awareness of the pitfalls and limitations of the digital tools and the necessity of new forms of digital source criticisms. Through its combination of empirical, conceptual and contextual studies, Digital Histories is a timely and pioneering contribution taking stock of how digital research currently advances historical scholarship