20 research outputs found

    Tagging Named Entities in 19th Century and Modern Finnish Newspaper Material with a Finnish Semantic Tagger

    Get PDF
    Named Entity Recognition (NER), search, classification and tagging of names and name like informational elements in texts, has become a standard information extraction procedure for textual data during the last two decades. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent. Also used entity categories vary a lot (Nadeau and Sekine, 2007). The most general set of named entities is usually some version of three part categorization of locations, persons and corporations. In this paper we report evaluation results of NER with two different data: digitized Finnish historical newspaper collection Digi and modern Finnish technology news, Digitoday. Historical newspaper collection Digi contains 1,960,921 pages of newspaper material from years 1771–1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75%, and its NER evaluation collection consists of 75 931 words (Kettunen and Pääkkönen, 2016; Kettunen et al., 2016). Digitoday’s annotated collection consists of 240 articles in six different sections of the newspaper. Our new evaluated tool for NER tagging is non-conventional: it is a rule-based semantic tagger of Finnish, the FST (Löfberg et al., 2005), and its results are compared to those of a standard rule-based NE tagger, FiNER. The FST achieves up to 55–61 F-score with locations and F-score of 51–52 with persons with the historical newspaper data, and its performance is comparative to FiNER with locations. With the modern Finnish technology news of Digitoday FiNER achieves F-scores of up to 79 with locations at best. Person names show worst performance; their F-score varies from 33 to 66. The FST performs equally well as FiNER with Digitoday’s location names, but is worse with persons. With corporations, FST is at its worst, while FiNER performs reasonably well. Overall our results show that a general semantic tool like the FST is able to perform in a restricted semantic task of name recognition almost as well as a dedicated NE tagger. As NER is a popular task in information extraction and retrieval, our results show that NE tagging does not need to be only a task of dedicated NE taggers, but it can be performed equally well with more general multipurpose semantic tools.Peer reviewe

    EFFICIENT QUALITY MANAGEMENT OF HUMAN-BASED ELECTRONIC SERVICES LEVERAGING GROUP DECISION MAKING

    Get PDF
    Human-based electronic services (people services) provide a powerful way of outsourcing tasks to a large crowd of remote workers over the Internet. Because of the limited control over the workforce in a potentially globally distributed environment, efficient quality management mechanisms are a prerequisite for successful implementation of the people service concept in a business context. Research has shown that multiple redundant results delivered by different workers can be aggregated in order to achieve a reliable result. However, existing implementations of this approach are highly inefficient as they multiply the effort for task execution and are not able to guarantee a certain quality level. Our weighted majority vote (WMV) approach addresses this issue by dynamically adjusting the level of redundancy depending on the historical error rates of the involved workers and the level of agreement among them. A practical evaluation in an OCR scenario demonstrates that the approach is capable of gaining reliable results at significantly lower costs compared to existing procedures

    Text retrieval from early printed books

    Get PDF

    Text retrieval from early printed books

    Get PDF

    Claims processing automation - Modernization of an insurance company internal process

    Get PDF
    Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceDeep learning and text mining are involved in the research. This work includes the project I developed together with my colleagues at SAS Institute during my internship experience. In this project we had to support an Insurance company for the automation of their existing claim processing system. In fact, as of today, the procedure of reading the incoming claim requests, selecting the useful information and extracting it to a data management software, is done manually for hundreds of claims every day. The job required by the insurance company is to substitute the existing procedure with an automated one, by implementing an OCR system to read the raw data contained in the documents sent by the customers and transform it into clean and useful information to be inserted into the data management software. This research will show the investigation on how to deal with this problem and the objective is to automate the classification of the documents for the company, to provide them a system to prioritize the most urgent documents and to execute some technical and administrative checks on the extracted information. The automation is shown to be feasible; the completeness and accuracy of the information extracted are solid, proving that this specific task in the insurance company sector can be realized and help to reduce costs while improving time performance

    To Honor our Heroes: Analysis of the Obituaries of Australians Killed in Action in WWI and WWII

    Get PDF
    Obituaries represent a prominent way of expressing the human universal of grief. According to philosophers, obituaries are a ritualized way of evaluating both individuals who have passed away and the communities that helped to shape them. The basic idea is that you can tell what it takes to count as a good person of a particular type in a particular community by seeing how persons of that type are described and celebrated in their obituaries. Obituaries of those killed in conflict, in particular, are rich repositories of communal values, as they reflect the values and virtues that are admired and respected in individuals who are considered to be heroes in their communities. In this paper, we use natural language processing techniques to map the patterns of values and virtues attributed to Australian military personnel who were killed in action during World War I and World War II. Doing so reveals several clusters of values and virtues that tend to be attributed together. In addition, we use named entity recognition and geotagging the track the movements of these soldiers to various theatres of the wars, including North Africa, Europe, and the Pacific

    Old Content and Modern Tools : Searching Named Entities in a Finnish OCRed Historical Newspaper Collection 1771–1910

    Get PDF
    Named Entity Recognition (NER), search, classification and tagging of names and name-like informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general, the performance of a NER system is genre- and domain-dependent and also used entity categories vary [Nadeau and Sekine 2007]. The most general set of named entities is usually some version of a tripartite categorization of locations, persons, and organizations. In this paper we report trials and evaluation of NER with data from a digitized Finnish historical newspaper collection (Digi). Experiments, results, and discussion of this research serve development of the web collection of historical Finnish newspapers. Digi collection contains 1,960,921 pages of newspaper material from 1771–1910 in both Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75 % [Kettunen and Pääkkönen 2016]. Our principal NE tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We also show results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. Three other tools are also evaluated briefly. This paper reports the first large scale results of NER in a historical Finnish OCRed newspaper collection. Results of this research supplement NER results of other languages with similar noisy data. As the results are also achieved with a small and morphologically rich language, they illuminate the relatively well-researched area of Named Entity Recognition from a new perspective.Peer reviewe

    Text Mining the History of Medicine

    Get PDF
    Historical text archives constitute a rich and diverse source of information, which is becoming increasingly readily accessible, due to large-scale digitisation efforts. However, it can be difficult for researchers to explore and search such large volumes of data in an efficient manner. Text mining (TM) methods can help, through their ability to recognise various types of semantic information automatically, e.g., instances of concepts (places, medical conditions, drugs, etc.), synonyms/variant forms of concepts, and relationships holding between concepts (which drugs are used to treat which medical conditions, etc.). TM analysis allows search systems to incorporate functionality such as automatic suggestions of synonyms of user-entered query terms, exploration of different concepts mentioned within search results or isolation of documents in which concepts are related in specific ways. However, applying TM methods to historical text can be challenging, according to differences and evolutions in vocabulary, terminology, language structure and style, compared to more modern text. In this article, we present our efforts to overcome the various challenges faced in the semantic analysis of published historical medical text dating back to the mid 19th century. Firstly, we used evidence from diverse historical medical documents from different periods to develop new resources that provide accounts of the multiple, evolving ways in which concepts, their variants and relationships amongst them may be expressed. These resources were employed to support the development of a modular processing pipeline of TM tools for the robust detection of semantic information in historical medical documents with varying characteristics. We applied the pipeline to two large-scale medical document archives covering wide temporal ranges as the basis for the development of a publicly accessible semantically-oriented search system. The novel resources are available for research purposes, while the processing pipeline and its modules may be used and configured within the Argo TM platform

    Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm

    Get PDF
    This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Character Recognition), reporting on the effect this noise has on the analyses necessary to computationally identify the different writing style of the two brothers. In summary, our findings show that OCR digitization serves as a reliable proxy for the more painstaking process of manual digitization, at least when it comes to authorship attribution. Our results suggest that attribution is viable even when using training and test sets from different digitization pipelines. With regards to HTR, this research demonstrates that even though automated transcription significantly increases the risk of text misclassification when compared to OCR, a cleanliness above ≈ 20% is already sufficient to achieve a higher-than-chance probability of correct binary attribution

    Collaboration And Conflict In The Adirondack Park: An Analysis Of Conservation Discourses Over Time

    Get PDF
    The role of collaboration within conservation is of increasing interest to scholars, managers and forest communities. Collaboration can take many forms, but one under-studied topic is the form and content of public discourses across conservation project timelines. To understand the discursive processes that influence conservation decision-making, this research evaluates the use of collaborative rhetoric and claims about place within discourses of conservation in the Adirondacks. Local newspaper articles and editorials published from January 1996 to December 2013 and concerning six major conservation projects were studied using content analysis. Results show that collaborative rhetoric increased during the study period, and conflict discourses declined, in concert with the rise of collaborative planning efforts. Data also show an increasing convergence between conservation sponsors and local communities regarding the economic benefits of conservation and the importance of public participation. The study has value in examining representations of place and media claims-making strategies within conservation discourses, an important topic as natural resource managers increasingly embrace community-based natural resource management
    corecore