7 research outputs found

    Modern Tools for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910

    Get PDF
    Named entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER systemā€™s performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771ā€“ 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74ā€“75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Secoā€™s tools achieve 30.0ā€“60.0 F-score with locations and persons. Performance of FiNER and SeCoā€™s tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed textNamed entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER systemā€™s performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771ā€“ 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74ā€“75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Secoā€™s tools achieve 30.0ā€“60.0 F-score with locations and persons. Performance of FiNER and SeCoā€™s tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed text.Peer reviewe

    Language technologies for a multilingual Europe

    Get PDF
    This volume of the series ā€œTranslation and Multilingual Natural Language Processingā€ includes most of the papers presented at the Workshop ā€œLanguage Technology for a Multilingual Europeā€, held at the University of Hamburg on September 27, 2011 in the framework of the conference GSCL 2011 with the topic ā€œMultilingual Resources and Multilingual Applicationsā€, along with several additional contributions. In addition to an overview article on Machine Translation and two contributions on the European initiatives META-NET and Multilingual Web, the volume includes six full research articles. Our intention with this workshop was to bring together various groups concerned with the umbrella topics of multilingualism and language technology, especially multilingual technologies. This encompassed, on the one hand, representatives from research and development in the field of language technologies, and, on the other hand, users from diverse areas such as, among others, industry, administration and funding agencies. The Workshop ā€œLanguage Technology for a Multilingual Europeā€ was co-organised by the two GSCL working groups ā€œText Technologyā€ and ā€œMachine Translationā€ (http://gscl.info) as well as by META-NET (http://www.meta-net.eu)

    Can humain association norm evaluate latent semantic analysis?

    Get PDF
    This paper presents the comparison of word association norm created by a psycholinguistic experiment to association lists generated by algorithms operating on text corpora. We compare lists generated by Church and Hanks algorithm and lists generated by LSA algorithm. An argument is presented on how those automatically generated lists reflect real semantic relations

    K + K = 120 : Papers dedicated to LĆ”szlĆ³ KĆ”lmĆ”n and AndrĆ”s Kornai on the occasion of their 60th birthdays

    Get PDF

    Information-theoretic causal inference of lexical flow

    Get PDF
    This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages. A flow-based separation criterion and domain-specific directionality detection criteria are developed to make existing causal inference algorithms more robust against imperfect cognacy data, giving rise to two new algorithms. The Phylogenetic Lexical Flow Inference (PLFI) algorithm requires lexical features of proto-languages to be reconstructed in advance, but yields fully general phylogenetic networks, whereas the more complex Contact Lexical Flow Inference (CLFI) algorithm treats proto-languages as hidden common causes, and only returns hypotheses of historical contact situations between attested languages. The algorithms are evaluated both against a large lexical database of Northern Eurasia spanning many language families, and against simulated data generated by a new model of language contact that builds on the opening and closing of directional contact channels as primary evolutionary events. The algorithms are found to infer the existence of contacts very reliably, whereas the inference of directionality remains difficult. This currently limits the new algorithms to a role as exploratory tools for quickly detecting salient patterns in large lexical datasets, but it should soon be possible for the framework to be enhanced e.g. by confidence values for each directionality decision

    Information-theoretic causal inference of lexical flow

    Get PDF
    This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages

    Intertextual Readings of the NyāyabhÅ«į¹£aį¹‡a on Buddhist Anti-Realism

    Get PDF
    This two-part dissertation has two goals: 1) a close philological reading of a 50-page section of a 10th-century Sanskrit philosophical work (BhāsarvajƱa's NyāyabhÅ«į¹£aį¹‡a), and 2) the creation and assessment of a novel intertextuality research system (Vātāyana) centered on the same work. The first half of the dissertation encompasses the philology project in four chapters: 1) background on the author, work, and key philosophical ideas in the passage; 2) descriptions of all known manuscript witnesses of this work and a new critical edition that substantially improves upon the editio princeps; 3) a word-for-word English translation richly annotated with both traditional explanatory material and novel digital links to not one but two interactive online research systems; and 4) a discussion of the Sanskrit author's dialectical strategy in the studied passage. The second half of the dissertation details the intertextuality research system in a further four chapters: 5) why it is needed and what can be learned from existing projects; 6) the creation of the system consisting of curated textual corpus, composite algorithm in natural language processing and information retrieval, and live web-app interface; 7) an evaluation of system performance measured against a small gold-standard dataset derived from traditional philological research; and 8) a discussion of the impact such new technology could have on humanistic research more broadly. System performance was assessed to be quite good, with a 'recall@5' of 80%, meaning that most previously known cases of mid-length quotation and even paraphrase could be automatically found and returned within the system's top five hits. Moreover, the system was also found to return a 34% surplus of additional significant parallels not found in the small benchmark. This assessment confirms that Vātāyana can be useful to researchers by aiding them in their collection and organization of intertextual observations, leaving them more time to focus on interpretation. Seventeen appendices illustrate both these efforts and a number of side projects, the latter of which span translation alignment, network visualization of an important database of South Asian prosopography (PANDiT), and a multi-functional Sanskrit text-processing web application (Skrutable).:Preface (i) Table of Contents (ii) Abbreviations (v) Terms and Symbols (v) NyāyabhÅ«į¹£aį¹‡a Witnesses (v) Main Sanskrit Editions (vi) Introduction (vii) A Multi-Disciplinary Project in Intertextual Reading (vii) Main Object of Study: NyāyabhÅ«į¹£aį¹‡a 104ā€“154 (vii) Project Outline (ix) Part I: Close Reading (1) 1 Background (1) 1.1 BhāsarvajƱa (1) 1.2 The NyāyabhÅ«į¹£aį¹‡a (6) 1.2.1 Ts One of Several Commentaries on BhāsarvajƱa's Nyāyasāra (6) 1.2.2 In Modern Scholarship, with Focus on NBhÅ« 104ā€“154 (8) 1.3 Philosophical Context (11) 1.3.1 Key Philosophical Concepts (12) 1.3.2 Intra-Textual Context within the NyāyabhÅ«į¹£aį¹‡a (34) 1.3.3 Inter-Textual Context (36) 2 Edition of NBhÅ« 104ā€“154 (39) 2.1 Source Materials (39) 2.1.1 Edition of YogÄ«ndrānanda 1968 (E) (40) 2.1.2 Manuscripts (P1, P2, V) (43) 2.1.3 Diplomatic Transcripts (59) 2.2 Notes on Using the Edition (60) 2.3 Critical Edition of NBhÅ« 104ā€“154 with Apparatuses (62) 3 Translation of NBhÅ« 104ā€“154 (108) 3.1 Notes on Translation Method (108) 3.2 Notes on Outline Headings (112) 3.3 Annotated Translation of NBhÅ« 104ā€“154 (114) 4 Discussion (216) 4.1 Internal Structure of NBhÅ« 104ā€“154 (216) 4.2 Critical Assessment of BhāsarvajƱa's Argumentation (218) ā€ƒ Part II: Distant Reading with Digital Humanities (224) 5 Background in Intertextuality Detection (224) 5.1 Sanskrit Projects (225) 5.2 Non-Sanskrit Projects (228) 5.3 Operationalizing Intertextuality (233) 6 Building an Intertextuality Machine (239) 6.1 Corpus (Pramāį¹‡a NLP) (239) 6.2 Algorithm (Vātāyana) (242) 6.3 User Interface (Vātāyana) (246) 7 Evaluating System Performance (255) 7.1 Previous Scholarship on NBhÅ« 104ā€“154 as Philological Benchmark (255) 7.2 System Performance Relative to Benchmark (257) 8 Discussion (262) Conclusion (266) Works Cited (269) Main Sanskrit Editions (269) Works Cited in Part I (271) Works Cited in Part II (281) Appendices (285) Appendix 1: Correspondence of Joshi 1986 to YogÄ«ndrānanda 1968 (286) Appendix 1D: Full-Text Alignment of Joshi 1986 to YogÄ«ndrānanda 1968 (287) Appendix 2: Prosopographical Relations Important for NBhÅ« 104ā€“154 (288) Appendix 2D: Command-Line Tool ā€œPandit Grapherā€ (290) Appendix 3: Previous Suggestions to Improve Text of NBhÅ« 104ā€“154 (291) Appendix 4D: Transcript and Collation Data for NBhÅ« 104ā€“154 (304) Appendix 5D: Command-Line Tool ā€œcte2cexā€ for Transcript Data Conversion (305) Appendix 6D: Deployment of Brucheion for Interactive Transcript Data (306) Appendix 7: Highlighted Improvements to Text of NBhÅ« 104ā€“154 (307) Appendix 7D: Alternate Version of Edition With Highlighted Improvements (316) Appendix 8D: Digital Forms of Translation of NBhÅ« 104ā€“154 (317) Appendix 9: Analytic Outline of NBhÅ« 104ā€“154 by Shodo Yamakami (318) Appendix 10.1: New Analytic Outline of NBhÅ« 104ā€“154 (Overall) (324) Appendix 10.2: New Analytic Outline of NBhÅ« 104ā€“154 (Detailed) (325) Appendix 11D: Skrutable Text Processing Library and Web Application (328) Appendix 12D: Pramāį¹‡a NLP Corpus, Metadata, and LDA Modeling Info (329) Appendix 13D: Vātāyana Intertextuality Research Web Application (330) Appendix 14: Sample of Yamakami Citation Benchmark for NBhÅ« 104ā€“154 (331) Appendix 14D: Full Yamakami Citation Benchmark for NBhÅ« 104ā€“154 (333) Appendix 15: Vātāyana Recall@5 Scores for NBhÅ« 104ā€“154 (334) Appendix 16: PVA, PVin, and PVSV VaĢ„taĢ„yana Search Hits for Entire NBhuĢ„ (338) Appendix 17: Sample Listing of Vātāyana Search Hits for Entire NBhÅ« (349) Appendix 17D: Full Listing of Vātāyana Search Hits for Entire NBhÅ« (355) Overview of Digital Appendices (356) Zusammenfassung (Thesen Zur Dissertation) (357) Summary of Results (361
    corecore