Search CORE

20 research outputs found

Tagging Named Entities in 19th Century and Modern Finnish Newspaper Material with a Finnish Semantic Tagger

Author: Kettunen Kimmo Tapio
Löfberg Laura
Publication venue: 'Linkoping University Electronic Press'
Publication date: 01/05/2017
Field of study

Named Entity Recognition (NER), search, classification and tagging of names and name like informational elements in texts, has become a standard information extraction procedure for textual data during the last two decades. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent. Also used entity categories vary a lot (Nadeau and Sekine, 2007). The most general set of named entities is usually some version of three part categorization of locations, persons and corporations. In this paper we report evaluation results of NER with two different data: digitized Finnish historical newspaper collection Digi and modern Finnish technology news, Digitoday. Historical newspaper collection Digi contains 1,960,921 pages of newspaper material from years 1771–1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75%, and its NER evaluation collection consists of 75 931 words (Kettunen and Pääkkönen, 2016; Kettunen et al., 2016). Digitoday’s annotated collection consists of 240 articles in six different sections of the newspaper. Our new evaluated tool for NER tagging is non-conventional: it is a rule-based semantic tagger of Finnish, the FST (Löfberg et al., 2005), and its results are compared to those of a standard rule-based NE tagger, FiNER. The FST achieves up to 55–61 F-score with locations and F-score of 51–52 with persons with the historical newspaper data, and its performance is comparative to FiNER with locations. With the modern Finnish technology news of Digitoday FiNER achieves F-scores of up to 79 with locations at best. Person names show worst performance; their F-score varies from 33 to 66. The FST performs equally well as FiNER with Digitoday’s location names, but is worse with persons. With corporations, FST is at its worst, while FiNER performs reasonably well. Overall our results show that a general semantic tool like the FST is able to perform in a restricted semantic task of name recognition almost as well as a dedicated NE tagger. As NER is a popular task in information extraction and retrieval, our results show that NE tagging does not need to be only a task of dedicated NE taggers, but it can be performed equally well with more general multipurpose semantic tools.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

EFFICIENT QUALITY MANAGEMENT OF HUMAN-BASED ELECTRONIC SERVICES LEVERAGING GROUP DECISION MAKING

Author: Kern Robert
Satzger Gerhard
Thies Hans
Publication venue: AIS Electronic Library (AISeL)
Publication date: 06/10/2011
Field of study

Human-based electronic services (people services) provide a powerful way of outsourcing tasks to a large crowd of remote workers over the Internet. Because of the limited control over the workforce in a potentially globally distributed environment, efficient quality management mechanisms are a prerequisite for successful implementation of the people service concept in a business context. Research has shown that multiple redundant results delivered by different workers can be aggregated in order to achieve a reliable result. However, existing implementations of this approach are highly inefficient as they multiply the effort for task execution and are not able to guarantee a certain quality level. Our weighted majority vote (WMV) approach addresses this issue by dynamically adjusting the level of redundancy depending on the historical error rates of the involved workers and the level of agreement among them. A practical evaluation in an OCR scenario demonstrates that the approach is capable of gaining reliable results at significantly lower costs compared to existing procedures

AIS Electronic Library (AISeL)

Text retrieval from early printed books

Author: Marinai Simone
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2009
Field of study

Florence Research

Text retrieval from early printed books

Author: Marinai Simone
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Florence Research

Claims processing automation - Modernization of an insurance company internal process

Author: Andreatta Nicola
Publication venue
Publication date: 23/10/2023
Field of study

Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceDeep learning and text mining are involved in the research. This work includes the project I developed together with my colleagues at SAS Institute during my internship experience. In this project we had to support an Insurance company for the automation of their existing claim processing system. In fact, as of today, the procedure of reading the incoming claim requests, selecting the useful information and extracting it to a data management software, is done manually for hundreds of claims every day. The job required by the insurance company is to substitute the existing procedure with an automated one, by implementing an OCR system to read the raw data contained in the documents sent by the customers and transform it into clean and useful information to be inserted into the data management software. This research will show the investigation on how to deal with this problem and the objective is to automate the classification of the documents for the company, to provide them a system to prioritize the most urgent documents and to execute some technical and administrative checks on the extracted information. The automation is shown to be feasible; the completeness and accuracy of the information extracted are solid, proving that this specific task in the insurance company sector can be realized and help to reduce costs while improving time performance

Repositório da Universidade Nova de Lisboa

To Honor our Heroes: Analysis of the Obituaries of Australians Killed in Action in WWI and WWII

Author: Alfano Mark
Cheong Marc
Publication venue
Publication date: 01/01/2021
Field of study

Obituaries represent a prominent way of expressing the human universal of grief. According to philosophers, obituaries are a ritualized way of evaluating both individuals who have passed away and the communities that helped to shape them. The basic idea is that you can tell what it takes to count as a good person of a particular type in a particular community by seeing how persons of that type are described and celebrated in their obituaries. Obituaries of those killed in conflict, in particular, are rich repositories of communal values, as they reflect the values and virtues that are admired and respected in individuals who are considered to be heroes in their communities. In this paper, we use natural language processing techniques to map the patterns of values and virtues attributed to Australian military personnel who were killed in action during World War I and World War II. Doing so reveals several clusters of values and virtues that tend to be attributed together. In addition, we use named entity recognition and geotagging the track the movements of these soldiers to various theatres of the wars, including North Africa, Europe, and the Pacific

PhilPapers

Old Content and Modern Tools : Searching Named Entities in a Finnish OCRed Historical Newspaper Collection 1771–1910

Author: Kettunen Kimmo
Kuokkala Juha
Löfberg Laura
Mäkelä Eetu
Ruokolainen Teemu
Publication venue
Publication date: 09/11/2016
Field of study

Named Entity Recognition (NER), search, classification and tagging of names and name-like informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general, the performance of a NER system is genre- and domain-dependent and also used entity categories vary [Nadeau and Sekine 2007]. The most general set of named entities is usually some version of a tripartite categorization of locations, persons, and organizations. In this paper we report trials and evaluation of NER with data from a digitized Finnish historical newspaper collection (Digi). Experiments, results, and discussion of this research serve development of the web collection of historical Finnish newspapers. Digi collection contains 1,960,921 pages of newspaper material from 1771–1910 in both Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75 % [Kettunen and Pääkkönen 2016]. Our principal NE tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We also show results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. Three other tools are also evaluated briefly. This paper reports the first large scale results of NER in a historical Finnish OCRed newspaper collection. Results of this research supplement NER results of other languages with similar noisy data. As the results are also achieved with a small and morphologically rich language, they illuminate the relatively well-researched area of Named Entity Recognition from a new perspective.Peer reviewe

arXiv.org e-Print Archive

Helsingin yliopiston digitaalinen arkisto

Text Mining the History of Medicine

Author: A Henriksson
AR Aronson
C Mihăilă
Carsten Timmermann
D Lopresti
D McClosky
Elizabeth Toon
G Hripcsak
G Schneider
Georgios Kontonatsios
H Moen
H Suominen
J Cohen
J-D Kim
Jacob Carter
John McNaught
JR Firth
K Bontcheva
KB Wagholikar
L Kelly
LM Schriml
Luis M. Rocha
M Miwa
M Miwa
M Ruiz-Casado
M Worboys
MA Hearst
Michael Worboys
N Alnazzawi
O Bodenreider
P Murrieta-Flores
P Thompson
Paul Thompson
R Prasad
RI Dogan
Riza Theresa Batista-Navarro
S Jonnalagadda
S Pyysalo
S Zhang
Sophia Ananiadou
T Hitchcock
TH Tanner
Y Tsuruoka
Y Tsuruoka
Y Tsuruoka
Y Wang
Z Liu
ZS Harris
Ö Uzuner
Ö Uzuner
Ö Uzuner
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 06/01/2016
Field of study

Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform

Crossref

Directory of Open Access Journals

Edge Hill University Research Information Repository

PubMed Central

The University of Manchester - Institutional Repository

Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm

Author: Emily Franzini
Emily Franzini
Gabriela Rotari
Greta Franzini
Jan Rybicki
Jeremi K. Ochab
Joanna Byszuk
Melina Jander
Mike Kestemont
Publication venue: 'Modern Language Association'
Publication date: 01/01/2018
Field of study

This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regards to HTR, this research demonstrates that even though automated transcription significantly increases the risk of text misclassification when compared to OCR, a cleanliness above ≈ 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution

Crossref

PubliCatt

Directory of Open Access Journals

Frontiers - Publisher Connector

Humanities Commons

Institutional Repository Universiteit Antwerpen

Jagiellonian Univeristy Repository

Collaboration And Conflict In The Adirondack Park: An Analysis Of Conservation Discourses Over Time

Author: O\u27Donnell Jeffrey Michael
Publication venue: UVM ScholarWorks
Publication date: 01/01/2015
Field of study

The role of collaboration within conservation is of increasing interest to scholars, managers and forest communities. Collaboration can take many forms, but one under-studied topic is the form and content of public discourses across conservation project timelines. To understand the discursive processes that influence conservation decision-making, this research evaluates the use of collaborative rhetoric and claims about place within discourses of conservation in the Adirondacks. Local newspaper articles and editorials published from January 1996 to December 2013 and concerning six major conservation projects were studied using content analysis. Results show that collaborative rhetoric increased during the study period, and conflict discourses declined, in concert with the rise of collaborative planning efforts. Data also show an increasing convergence between conservation sponsors and local communities regarding the economic benefits of conservation and the importance of public participation. The study has value in examining representations of place and media claims-making strategies within conservation discourses, an important topic as natural resource managers increasingly embrace community-based natural resource management

UVM ScholarWorks