1,360 research outputs found

    ANNOTATION MODEL FOR LOANWORDS IN INDONESIAN CORPUS: A LOCAL GRAMMAR FRAMEWORK

    Get PDF
    There is a considerable number for loanwords in Indonesian language as it has been, or even continuously, in contact with other languages. The contact takes place via different media; one of them is via machine readable medium. As the information in different languages can be obtained by a mouse click these days, the contact becomes more and more intense. This paper aims at proposing an annotation model and lexical resource for loanwords in Indonesian. The lexical resource is applied to a corpus by a corpus processing software called UNITEX. This software works under local grammar framewor

    Linguistically-Informed Neural Architectures for Lexical, Syntactic and Semantic Tasks in Sanskrit

    Full text link
    The primary focus of this thesis is to make Sanskrit manuscripts more accessible to the end-users through natural language technologies. The morphological richness, compounding, free word orderliness, and low-resource nature of Sanskrit pose significant challenges for developing deep learning solutions. We identify four fundamental tasks, which are crucial for developing a robust NLP technology for Sanskrit: word segmentation, dependency parsing, compound type identification, and poetry analysis. The first task, Sanskrit Word Segmentation (SWS), is a fundamental text processing task for any other downstream applications. However, it is challenging due to the sandhi phenomenon that modifies characters at word boundaries. Similarly, the existing dependency parsing approaches struggle with morphologically rich and low-resource languages like Sanskrit. Compound type identification is also challenging for Sanskrit due to the context-sensitive semantic relation between components. All these challenges result in sub-optimal performance in NLP applications like question answering and machine translation. Finally, Sanskrit poetry has not been extensively studied in computational linguistics. While addressing these challenges, this thesis makes various contributions: (1) The thesis proposes linguistically-informed neural architectures for these tasks. (2) We showcase the interpretability and multilingual extension of the proposed systems. (3) Our proposed systems report state-of-the-art performance. (4) Finally, we present a neural toolkit named SanskritShala, a web-based application that provides real-time analysis of input for various NLP tasks. Overall, this thesis contributes to making Sanskrit manuscripts more accessible by developing robust NLP technology and releasing various resources, datasets, and web-based toolkit.Comment: Ph.D. dissertatio

    A BRIEF INTRODUCTION TO AYURVEDIC SYSTEM OF MEDICINE : PROBLEMS ANDPROSPECTS OF DATABASE

    Get PDF
    Today the medical world is posed with complex chalolenges. Thus time demands an integrated and pluralistic approach towards healthcare to cope effectively with this situation. There has been an growing interest in Ayurveda in the past few years. To initiate fruitful dialogues between Ayurveda and modern science, an in-depth understanding of both the systems becomes an essential prerequisite. Such an exercise should emerge from a standpoint accepting that there are different world views existing in the world, Ayurveda being one among them. This may sound quite contrary to the common belief that the science is only one as expressed in modern scientific paradigm. Both Modern science and Ayurveda have universal attributes and share the common objective of well-being of mankind. But they are quite different in their philosophical and epistemological foundations, conceptual framework and practical outlook. So, let us see what are the fundamental differences between Sastra(Ayurveda) and the Modern science

    Aspirated and Unaspirated Voiceless Consonants in Old Tibetan

    Get PDF
    Although Tibetan orthography distinguishes aspirated and unaspirated voiceless consonants, various authors have viewed this distinction as not phonemic. An examination of the unaspirated voiceless initials in the Old Tibetan Inscriptions, together with unaspirated voiceless consonants in several Tibetan dialects confirms that aspiration was either not phonemic in Old Tibetan, or only just emerging as a distinction due to loan words. The data examined also affords evidence for the nature of the phonetic word in Old Tibetan

    The contribution of corpus linguistics to lexicography and the future of Tibetan dictionaries

    Get PDF
    The first alphabetized dictionary of Tibetan appeared in 1829 (cf. Bray 2008) and the intervening 184 years have witnessed the publication of scores of other Tibetan dictionaries (cf. Simon 1964). Hundreds of Tibetan dictionaries are now available; these include bilin gual dictionaries, both to and from such languages as English, French, German, Latin, Japanese, etc. and specialized dictionaries focusing on medicine, plants, dialects, archaic terms, neologisms, etc. (cf. Walter 2006, McGrath 2008). However, if one classifies Tibetan dictionaries by the methods of their compilation the accomplishments of Tibetan lexicography are less impressive. Methodologies of dictionary compilation divide heuristically into three types. First, some dictionaries lack explicit methodology; these works assemble words in an ad hoc manner and illustrate them with invented examples. Second, there are dictionaries that are compiled over very long periods of time on the basis of collections of slips recording attestations of words as used in context. Third, more recent dictionaries are compiled on the basis of electronic text corpora, which are processed computationally to aid in the precision, consistency and speed of dictionary compilation. These methods may be called respectively the 'informal method', the 'traditional method', and the 'modern method'. The overwhelming majority of Tibetan dictionaries were compiled with the informal method. Only five Tibetan dictionaries use the traditional methodology. No Tibetan dictionary yet compiled makes use of the modern method
    corecore