901 research outputs found

    Comparing Sanskrit Texts for Critical Editions: the sequences move problem

    Get PDF
    International audienceA critical edition takes into account various versions of the same text in order to show the differences between two distinct versions, in terms of words that have been missing, changed, omitted or displaced. Traditionally, Sanskrit is written without spaces between words, and the word order can be changed without altering the meaning of a sentence. This paper describes the characteristics which make Sanskrit text comparisons a specific matter. It presents two different methods for comparing Sanskrit texts, which can be used to develop a computer assisted critical edition. The first one method uses the L.C.S., while the second one uses the global alignment algorithm. Comparing them, we see that the second method provides better results, but that neither of these methods can detect when a word or a sentence fragment has been moved. We then present a method based on N-gram that can detect such a movement when it is not too far from its original location. We will see how the method behaves on several examples and look for future possible developments

    Linguistically-Informed Neural Architectures for Lexical, Syntactic and Semantic Tasks in Sanskrit

    Full text link
    The primary focus of this thesis is to make Sanskrit manuscripts more accessible to the end-users through natural language technologies. The morphological richness, compounding, free word orderliness, and low-resource nature of Sanskrit pose significant challenges for developing deep learning solutions. We identify four fundamental tasks, which are crucial for developing a robust NLP technology for Sanskrit: word segmentation, dependency parsing, compound type identification, and poetry analysis. The first task, Sanskrit Word Segmentation (SWS), is a fundamental text processing task for any other downstream applications. However, it is challenging due to the sandhi phenomenon that modifies characters at word boundaries. Similarly, the existing dependency parsing approaches struggle with morphologically rich and low-resource languages like Sanskrit. Compound type identification is also challenging for Sanskrit due to the context-sensitive semantic relation between components. All these challenges result in sub-optimal performance in NLP applications like question answering and machine translation. Finally, Sanskrit poetry has not been extensively studied in computational linguistics. While addressing these challenges, this thesis makes various contributions: (1) The thesis proposes linguistically-informed neural architectures for these tasks. (2) We showcase the interpretability and multilingual extension of the proposed systems. (3) Our proposed systems report state-of-the-art performance. (4) Finally, we present a neural toolkit named SanskritShala, a web-based application that provides real-time analysis of input for various NLP tasks. Overall, this thesis contributes to making Sanskrit manuscripts more accessible by developing robust NLP technology and releasing various resources, datasets, and web-based toolkit.Comment: Ph.D. dissertatio

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Linear mappings: semantic transfer from transformer models for cognate detection and coreference resolution

    Get PDF
    Includes bibliographical references.2022 Fall.Embeddings or vector representations of language and their properties are useful for understanding how Natural Language Processing technology works. The usefulness of embeddings, however, depends on how contextualized or information-rich such embeddings are. In this work, I apply a novel affine (linear) mapping technique first established in the field of computer vision to embeddings generated from large Transformer-based language models. In particular, I study its use in two challenging linguistic tasks: cross-lingual cognate detection and cross-document coreference resolution. Cognate detection for two Low-Resource Languages (LRL), Assamese and Bengali, is framed as a binary classification problem using semantic (embedding-based), articulatory, and phonetic features. Linear maps for this task are extrinsically evaluated on the extent of transfer of semantic information between monolingual as well as multi-lingual models including those specialized for low-resourced Indian languages. For cross-document coreference resolution, whole-document contextual representations are generated for event and entity mentions from cross- document language models like CDLM and other BERT-variants and then linearly mapped to form coreferring clusters based on their cosine similarities. I evaluate my results on gold output based on established coreference metrics like BCUB and MUC. My findings reveal that linearly transforming vectors from one model's embedding space to another carries certain semantic information with high fidelity thereby revealing the existence of a canonical embedding space and its geometric properties for language models. Interestingly, even for a much more challenging task like coreference resolution, linear maps are able to transfer semantic information between "lighter" models or less contextual models and "larger" models with near-equivalent performance or even improved results in some cases

    Intertextual Readings of the NyāyabhÅ«į¹£aį¹‡a on Buddhist Anti-Realism

    Get PDF
    This two-part dissertation has two goals: 1) a close philological reading of a 50-page section of a 10th-century Sanskrit philosophical work (BhāsarvajƱa's NyāyabhÅ«į¹£aį¹‡a), and 2) the creation and assessment of a novel intertextuality research system (Vātāyana) centered on the same work. The first half of the dissertation encompasses the philology project in four chapters: 1) background on the author, work, and key philosophical ideas in the passage; 2) descriptions of all known manuscript witnesses of this work and a new critical edition that substantially improves upon the editio princeps; 3) a word-for-word English translation richly annotated with both traditional explanatory material and novel digital links to not one but two interactive online research systems; and 4) a discussion of the Sanskrit author's dialectical strategy in the studied passage. The second half of the dissertation details the intertextuality research system in a further four chapters: 5) why it is needed and what can be learned from existing projects; 6) the creation of the system consisting of curated textual corpus, composite algorithm in natural language processing and information retrieval, and live web-app interface; 7) an evaluation of system performance measured against a small gold-standard dataset derived from traditional philological research; and 8) a discussion of the impact such new technology could have on humanistic research more broadly. System performance was assessed to be quite good, with a 'recall@5' of 80%, meaning that most previously known cases of mid-length quotation and even paraphrase could be automatically found and returned within the system's top five hits. Moreover, the system was also found to return a 34% surplus of additional significant parallels not found in the small benchmark. This assessment confirms that Vātāyana can be useful to researchers by aiding them in their collection and organization of intertextual observations, leaving them more time to focus on interpretation. Seventeen appendices illustrate both these efforts and a number of side projects, the latter of which span translation alignment, network visualization of an important database of South Asian prosopography (PANDiT), and a multi-functional Sanskrit text-processing web application (Skrutable).:Preface (i) Table of Contents (ii) Abbreviations (v) Terms and Symbols (v) NyāyabhÅ«į¹£aį¹‡a Witnesses (v) Main Sanskrit Editions (vi) Introduction (vii) A Multi-Disciplinary Project in Intertextual Reading (vii) Main Object of Study: NyāyabhÅ«į¹£aį¹‡a 104ā€“154 (vii) Project Outline (ix) Part I: Close Reading (1) 1 Background (1) 1.1 BhāsarvajƱa (1) 1.2 The NyāyabhÅ«į¹£aį¹‡a (6) 1.2.1 Ts One of Several Commentaries on BhāsarvajƱa's Nyāyasāra (6) 1.2.2 In Modern Scholarship, with Focus on NBhÅ« 104ā€“154 (8) 1.3 Philosophical Context (11) 1.3.1 Key Philosophical Concepts (12) 1.3.2 Intra-Textual Context within the NyāyabhÅ«į¹£aį¹‡a (34) 1.3.3 Inter-Textual Context (36) 2 Edition of NBhÅ« 104ā€“154 (39) 2.1 Source Materials (39) 2.1.1 Edition of YogÄ«ndrānanda 1968 (E) (40) 2.1.2 Manuscripts (P1, P2, V) (43) 2.1.3 Diplomatic Transcripts (59) 2.2 Notes on Using the Edition (60) 2.3 Critical Edition of NBhÅ« 104ā€“154 with Apparatuses (62) 3 Translation of NBhÅ« 104ā€“154 (108) 3.1 Notes on Translation Method (108) 3.2 Notes on Outline Headings (112) 3.3 Annotated Translation of NBhÅ« 104ā€“154 (114) 4 Discussion (216) 4.1 Internal Structure of NBhÅ« 104ā€“154 (216) 4.2 Critical Assessment of BhāsarvajƱa's Argumentation (218) ā€ƒ Part II: Distant Reading with Digital Humanities (224) 5 Background in Intertextuality Detection (224) 5.1 Sanskrit Projects (225) 5.2 Non-Sanskrit Projects (228) 5.3 Operationalizing Intertextuality (233) 6 Building an Intertextuality Machine (239) 6.1 Corpus (Pramāį¹‡a NLP) (239) 6.2 Algorithm (Vātāyana) (242) 6.3 User Interface (Vātāyana) (246) 7 Evaluating System Performance (255) 7.1 Previous Scholarship on NBhÅ« 104ā€“154 as Philological Benchmark (255) 7.2 System Performance Relative to Benchmark (257) 8 Discussion (262) Conclusion (266) Works Cited (269) Main Sanskrit Editions (269) Works Cited in Part I (271) Works Cited in Part II (281) Appendices (285) Appendix 1: Correspondence of Joshi 1986 to YogÄ«ndrānanda 1968 (286) Appendix 1D: Full-Text Alignment of Joshi 1986 to YogÄ«ndrānanda 1968 (287) Appendix 2: Prosopographical Relations Important for NBhÅ« 104ā€“154 (288) Appendix 2D: Command-Line Tool ā€œPandit Grapherā€ (290) Appendix 3: Previous Suggestions to Improve Text of NBhÅ« 104ā€“154 (291) Appendix 4D: Transcript and Collation Data for NBhÅ« 104ā€“154 (304) Appendix 5D: Command-Line Tool ā€œcte2cexā€ for Transcript Data Conversion (305) Appendix 6D: Deployment of Brucheion for Interactive Transcript Data (306) Appendix 7: Highlighted Improvements to Text of NBhÅ« 104ā€“154 (307) Appendix 7D: Alternate Version of Edition With Highlighted Improvements (316) Appendix 8D: Digital Forms of Translation of NBhÅ« 104ā€“154 (317) Appendix 9: Analytic Outline of NBhÅ« 104ā€“154 by Shodo Yamakami (318) Appendix 10.1: New Analytic Outline of NBhÅ« 104ā€“154 (Overall) (324) Appendix 10.2: New Analytic Outline of NBhÅ« 104ā€“154 (Detailed) (325) Appendix 11D: Skrutable Text Processing Library and Web Application (328) Appendix 12D: Pramāį¹‡a NLP Corpus, Metadata, and LDA Modeling Info (329) Appendix 13D: Vātāyana Intertextuality Research Web Application (330) Appendix 14: Sample of Yamakami Citation Benchmark for NBhÅ« 104ā€“154 (331) Appendix 14D: Full Yamakami Citation Benchmark for NBhÅ« 104ā€“154 (333) Appendix 15: Vātāyana Recall@5 Scores for NBhÅ« 104ā€“154 (334) Appendix 16: PVA, PVin, and PVSV VaĢ„taĢ„yana Search Hits for Entire NBhuĢ„ (338) Appendix 17: Sample Listing of Vātāyana Search Hits for Entire NBhÅ« (349) Appendix 17D: Full Listing of Vātāyana Search Hits for Entire NBhÅ« (355) Overview of Digital Appendices (356) Zusammenfassung (Thesen Zur Dissertation) (357) Summary of Results (361

    Modeling the Pāį¹‡inian System of Sanskrit Grammar

    Get PDF
    The present work is a study of the Aį¹£į¹­ÄdhyāyÄ« of Pāį¹‡ini from a new perspective. It attempts to explore the Pāį¹‡inian system of Sanskrit grammar from a formal point of view and investigate the possibilities of representing it in a logical, explicit and consistent manner. It puts forward an appropriate framework for such a representation. Differing from the formulation of Aį¹£į¹­ÄdhyāyÄ«, which is composed in an artificial yet natural language and is meant to be employed by individuals who are acquainted both with the Sanskrit language and the techniques of grammar, the present rendering aims for a non-verbal representation in terms of mathematical categories and logical relations which can be implemented in an algorithmic manner. The formal framework suggested in this work would facilitate adequate tools for postulating and evaluating hypotheses about the grammatical system. Moreover, it would furnish the basis for a computer implementation of the grammar. Both these aspects are objects of enquiry in the field of theoretical studies on Pāį¹‡ini as well as the emerging discipline of Sanskrit computational linguistics. This book takes on the ground-work in these areas.Die vorliegende Arbeit untersucht aus einer neuen Perspektive Pāį¹‡inis Aį¹£į¹­ÄdhyāyÄ«. Es versucht, Pāį¹‡inis Regelwerk der Sanskrit-Grammatik aus formaler Sicht zu erforschen und die Mƶglichkeiten zu untersuchen, es logisch, explizit und konsistent darzustellen. Dazu wird ein geeignetes Framework fĆ¼r eine solche ReprƤsentation vorgeschlagen. Im Unterschied zur Aį¹£į¹­ÄdhyāyÄ«, die in einer kĆ¼nstlichen, aber natĆ¼rlichen Sprache verfasst ist und fĆ¼r Personen konzipiert war, die sowohl mit der Sanskrit-Sprache als auch mit grammatischen Techniken vertraut sind, zielt die vorliegende Darstellung auf eine nonverbale ReprƤsentation in Form von mathematischen Kategorien und logischen Beziehungen ab, die algorithmisch umgesetzt werden kƶnnen. Der in dieser Arbeit vorgeschlagene formale Rahmen wĆ¼rde geeignete Werkzeuge bereitstellen, um Hypothesen zum grammatischen System zu postulieren und zu evaluieren. DarĆ¼ber hinaus wĆ¼rde er die Grundlage fĆ¼r eine computergestĆ¼tzte Implementierung der Grammatik schaffen. Beide Aspekte sind Forschungsgegenstand im Bereich der theoretischen Studien zu Pāį¹‡ini sowie der neu entstehenden Disziplin der Sanskrit-Computerlinguistik. Dieses Buch beschƤftigt sich mit der Grundlagenarbeit in diesen Bereichen

    Adaptive Reuse

    Get PDF
    The present volume explores a specific aspect of creativity in South Asian systems of knowledge, literature and rituals. Under the heading of ā€œadaptive reuse,ā€ it discusses the relationship between innovation and perpetuation of earlier forms and contents of knowledge and aesthetic expressions within the process of creating new works. Although this relation rarely became the topic of explicit reflections in the South Asian intellectual traditions, it is here investigated by taking a closer look at the treatment of older materials by later authors."Adaptive Reuse" ist ein wichtiges theoretisches Konzept aus dem Bereich der Architektur. Dort bezeichnete es die Verwendung eines teilweise umgebauten GebƤudes zu andern Zwecken als denen seiner ursprĆ¼nglichen Errichtung. Im vorliegenden Band wird dieses Konzept zum ersten Mal auf ein weiteres Spektrum kulturellen Schaffens Ć¼bertragen, nƤmlich auf die Komposition von Texten und auf die Kreation neuer Konzepte und Ritual
    • ā€¦
    corecore