7 research outputs found
Modern Tools for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910
Named entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER systemās performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771ā 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74ā75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Secoās tools achieve 30.0ā60.0 F-score with locations and persons. Performance of FiNER and SeCoās tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed textNamed entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER systemās performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771ā 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74ā75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Secoās tools achieve 30.0ā60.0 F-score with locations and persons. Performance of FiNER and SeCoās tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed text.Peer reviewe
Language technologies for a multilingual Europe
This volume of the series āTranslation and Multilingual Natural Language Processingā includes most of the papers presented at the Workshop āLanguage Technology for a Multilingual Europeā, held at the University of Hamburg on September 27, 2011 in the framework of the conference GSCL 2011 with the topic āMultilingual Resources and Multilingual Applicationsā, along with several additional contributions. In addition to an overview article on Machine Translation and two contributions on the European initiatives META-NET and Multilingual Web, the volume includes six full research articles. Our intention with this workshop was to bring together various groups concerned with the umbrella topics of multilingualism and language technology, especially multilingual technologies. This encompassed, on the one hand, representatives from research and development in the field of language technologies, and, on the other hand, users from diverse areas such as, among others, industry, administration and funding agencies. The Workshop āLanguage Technology for a Multilingual Europeā was co-organised by the two GSCL working groups āText Technologyā and āMachine Translationā (http://gscl.info) as well as by META-NET (http://www.meta-net.eu)
Can humain association norm evaluate latent semantic analysis?
This paper presents the comparison of word association norm created by a psycholinguistic experiment to association lists generated by algorithms operating on text corpora. We compare lists generated by Church and Hanks algorithm and lists generated by LSA algorithm. An argument is presented on how those automatically generated lists reflect real semantic relations
Information-theoretic causal inference of lexical flow
This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages. A flow-based separation criterion and domain-specific directionality detection criteria are developed to make existing causal inference algorithms more robust against imperfect cognacy data, giving rise to two new algorithms. The Phylogenetic Lexical Flow Inference (PLFI) algorithm requires lexical features of proto-languages to be reconstructed in advance, but yields fully general phylogenetic networks, whereas the more complex Contact Lexical Flow Inference (CLFI) algorithm treats proto-languages as hidden common causes, and only returns hypotheses of historical contact situations between attested languages. The algorithms are evaluated both against a large lexical database of Northern Eurasia spanning many language families, and against simulated data generated by a new model of language contact that builds on the opening and closing of directional contact channels as primary evolutionary events. The algorithms are found to infer the existence of contacts very reliably, whereas the inference of directionality remains difficult. This currently limits the new algorithms to a role as exploratory tools for quickly detecting salient patterns in large lexical datasets, but it should soon be possible for the framework to be enhanced e.g. by confidence values for each directionality decision
Information-theoretic causal inference of lexical flow
This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages
Intertextual Readings of the NyÄyabhÅ«į¹£aį¹a on Buddhist Anti-Realism
This two-part dissertation has two goals: 1) a close philological reading of a 50-page section of a 10th-century Sanskrit philosophical work (BhÄsarvajƱa's NyÄyabhÅ«į¹£aį¹a), and 2) the creation and assessment of a novel intertextuality research system (VÄtÄyana) centered on the same work.
The first half of the dissertation encompasses the philology project in four chapters: 1) background on the author, work, and key philosophical ideas in the passage; 2) descriptions of all known manuscript witnesses of this work and a new critical edition that substantially improves upon the editio princeps; 3) a word-for-word English translation richly annotated with both traditional explanatory material and novel digital links to not one but two interactive online research systems; and 4) a discussion of the Sanskrit author's dialectical strategy in the studied passage.
The second half of the dissertation details the intertextuality research system in a further four chapters: 5) why it is needed and what can be learned from existing projects; 6) the creation of the system consisting of curated textual corpus, composite algorithm in natural language processing and information retrieval, and live web-app interface; 7) an evaluation of system performance measured against a small gold-standard dataset derived from traditional philological research; and 8) a discussion of the impact such new technology could have on humanistic research more broadly. System performance was assessed to be quite good, with a 'recall@5' of 80%, meaning that most previously known cases of mid-length quotation and even paraphrase could be automatically found and returned within the system's top five hits. Moreover, the system was also found to return a 34% surplus of additional significant parallels not found in the small benchmark. This assessment confirms that VÄtÄyana can be useful to researchers by aiding them in their collection and organization of intertextual observations, leaving them more time to focus on interpretation.
Seventeen appendices illustrate both these efforts and a number of side projects, the latter of which span translation alignment, network visualization of an important database of South Asian prosopography (PANDiT), and a multi-functional Sanskrit text-processing web application (Skrutable).:Preface (i)
Table of Contents (ii)
Abbreviations (v)
Terms and Symbols (v)
NyÄyabhÅ«į¹£aį¹a Witnesses (v)
Main Sanskrit Editions (vi)
Introduction (vii)
A Multi-Disciplinary Project in Intertextual Reading (vii)
Main Object of Study: NyÄyabhÅ«į¹£aį¹a 104ā154 (vii)
Project Outline (ix)
Part I: Close Reading (1)
1 Background (1)
1.1 BhÄsarvajƱa (1)
1.2 The NyÄyabhÅ«į¹£aį¹a (6)
1.2.1 Ts One of Several Commentaries on BhÄsarvajƱa's NyÄyasÄra (6)
1.2.2 In Modern Scholarship, with Focus on NBhÅ« 104ā154 (8)
1.3 Philosophical Context (11)
1.3.1 Key Philosophical Concepts (12)
1.3.2 Intra-Textual Context within the NyÄyabhÅ«į¹£aį¹a (34)
1.3.3 Inter-Textual Context (36)
2 Edition of NBhÅ« 104ā154 (39)
2.1 Source Materials (39)
2.1.1 Edition of YogÄ«ndrÄnanda 1968 (E) (40)
2.1.2 Manuscripts (P1, P2, V) (43)
2.1.3 Diplomatic Transcripts (59)
2.2 Notes on Using the Edition (60)
2.3 Critical Edition of NBhÅ« 104ā154 with Apparatuses (62)
3 Translation of NBhÅ« 104ā154 (108)
3.1 Notes on Translation Method (108)
3.2 Notes on Outline Headings (112)
3.3 Annotated Translation of NBhÅ« 104ā154 (114)
4 Discussion (216)
4.1 Internal Structure of NBhÅ« 104ā154 (216)
4.2 Critical Assessment of BhÄsarvajƱa's Argumentation (218)
ā
Part II: Distant Reading with Digital Humanities (224)
5 Background in Intertextuality Detection (224)
5.1 Sanskrit Projects (225)
5.2 Non-Sanskrit Projects (228)
5.3 Operationalizing Intertextuality (233)
6 Building an Intertextuality Machine (239)
6.1 Corpus (PramÄį¹a NLP) (239)
6.2 Algorithm (VÄtÄyana) (242)
6.3 User Interface (VÄtÄyana) (246)
7 Evaluating System Performance (255)
7.1 Previous Scholarship on NBhÅ« 104ā154 as Philological Benchmark (255)
7.2 System Performance Relative to Benchmark (257)
8 Discussion (262)
Conclusion (266)
Works Cited (269)
Main Sanskrit Editions (269)
Works Cited in Part I (271)
Works Cited in Part II (281)
Appendices (285)
Appendix 1: Correspondence of Joshi 1986 to YogÄ«ndrÄnanda 1968 (286)
Appendix 1D: Full-Text Alignment of Joshi 1986 to YogÄ«ndrÄnanda 1968 (287)
Appendix 2: Prosopographical Relations Important for NBhÅ« 104ā154 (288)
Appendix 2D: Command-Line Tool āPandit Grapherā (290)
Appendix 3: Previous Suggestions to Improve Text of NBhÅ« 104ā154 (291)
Appendix 4D: Transcript and Collation Data for NBhÅ« 104ā154 (304)
Appendix 5D: Command-Line Tool ācte2cexā for Transcript Data Conversion (305)
Appendix 6D: Deployment of Brucheion for Interactive Transcript Data (306)
Appendix 7: Highlighted Improvements to Text of NBhÅ« 104ā154 (307)
Appendix 7D: Alternate Version of Edition With Highlighted Improvements (316)
Appendix 8D: Digital Forms of Translation of NBhÅ« 104ā154 (317)
Appendix 9: Analytic Outline of NBhÅ« 104ā154 by Shodo Yamakami (318)
Appendix 10.1: New Analytic Outline of NBhÅ« 104ā154 (Overall) (324)
Appendix 10.2: New Analytic Outline of NBhÅ« 104ā154 (Detailed) (325)
Appendix 11D: Skrutable Text Processing Library and Web Application (328)
Appendix 12D: PramÄį¹a NLP Corpus, Metadata, and LDA Modeling Info (329)
Appendix 13D: VÄtÄyana Intertextuality Research Web Application (330)
Appendix 14: Sample of Yamakami Citation Benchmark for NBhÅ« 104ā154 (331)
Appendix 14D: Full Yamakami Citation Benchmark for NBhÅ« 104ā154 (333)
Appendix 15: VÄtÄyana Recall@5 Scores for NBhÅ« 104ā154 (334)
Appendix 16: PVA, PVin, and PVSV VaĢtaĢyana Search Hits for Entire NBhuĢ (338)
Appendix 17: Sample Listing of VÄtÄyana Search Hits for Entire NBhÅ« (349)
Appendix 17D: Full Listing of VÄtÄyana Search Hits for Entire NBhÅ« (355)
Overview of Digital Appendices (356)
Zusammenfassung (Thesen Zur Dissertation) (357)
Summary of Results (361