Search CORE

3,701 research outputs found

Empowering OLAC Extension using Anusaaraka and Effective text processing using Double Byte coding

Author: Pillai B Prabhulla Chandran
Publication venue
Publication date: 07/09/2009
Field of study

The paper reviews the hurdles while trying to implement the OLAC extension for Dravidian / Indian languages. The paper further explores the possibilities which could minimise or solve these problems. In this context, the Chinese system of text processing and the anusaaraka system are scrutinised.Comment: 5 Pages, 4 figure

arXiv.org e-Print Archive

LERIL : Collaborative Effort for Creating Lexical Resources

Author: Bharati Akshar
Chaitanya Vineet
Kulkarni Amba P
Rao Durgesh D
Sangal Rajeev
Sharma Dipti M
Publication venue
Publication date: 07/08/2003
Field of study

The paper reports on efforts taken to create lexical resources pertaining to Indian languages, using the collaborative model. The lexical resources being developed are: (1) Transfer lexicon and grammar from English to several Indian languages. (2) Dependencey tree bank of annotated corpora for several Indian languages. The dependency trees are based on the Paninian model. (3) Bilingual dictionary of 'core meanings'.Comment: [ To appear in Proceedings of Workshop on Language Resources in Asia, along with NLPRS-2001, Tokyo, 27-30 November 2001] Appeared in the Proceedings of Workshop on Language Resources in Asia, along with NLPRS-2001, Tokyo, 27-30 November 2001. Appeared in the proceedings of Workshop on Language Resources in Asia, along with NLPRS-2001, Tokyo, 27-30 November 200

arXiv.org e-Print Archive

Fuzzy Modeling and Natural Language Processing for Panini's Sanskrit Grammar

Author: Reddy P. Venkata Subba
Publication venue
Publication date: 14/06/2010
Field of study

Indian languages have long history in World Natural languages. Panini was the first to define Grammar for Sanskrit language with about 4000 rules in fifth century. These rules contain uncertainty information. It is not possible to Computer processing of Sanskrit language with uncertain information. In this paper, fuzzy logic and fuzzy reasoning are proposed to deal to eliminate uncertain information for reasoning with Sanskrit grammar. The Sanskrit language processing is also discussed in this paper.Comment: Submitted to Journal of Computer Science and Engineering, see http://sites.google.com/site/jcseuk/volume-1-issue-1-may-201

arXiv.org e-Print Archive

75 Languages, 1 Model: Parsing Universal Dependencies Universally

Author: Kondratyuk Dan
Straka Milan
Publication venue
Publication date: 25/08/2019
Field of study

We present UDify, a multilingual multi-task model capable of accurately predicting universal part-of-speech, morphological features, lemmas, and dependency trees simultaneously for all 124 Universal Dependencies treebanks across 75 languages. By leveraging a multilingual BERT self-attention model pretrained on 104 languages, we found that fine-tuning it on all datasets concatenated together with simple softmax classifiers for each UD task can result in state-of-the-art UPOS, UFeats, Lemmas, UAS, and LAS scores, without requiring any recurrent or language-specific components. We evaluate UDify for multilingual learning, showing that low-resource languages benefit the most from cross-linguistic annotations. We also evaluate for zero-shot learning, with results suggesting that multilingual training provides strong UD predictions even for languages that neither UDify nor BERT have ever been trained on. Code for UDify is available at https://github.com/hyperparticle/udify.Comment: Accepted for publication at EMNLP 2019. 17 pages, 6 figure

arXiv.org e-Print Archive

Anusaaraka: Overcoming the Language Barrier in India

Author: Bharati Akshar
Chaitanya Vineet
Kulkarni Amba P.
Rao G Umamaheshwara
Sangal Rajeev
Publication venue
Publication date: 07/08/2003
Field of study

The anusaaraka system makes text in one Indian language accessible in another Indian language. In the anusaaraka approach, the load is so divided between man and computer that the language load is taken by the machine, and the interpretation of the text is left to the man. The machine presents an image of the source text in a language close to the target language.In the image, some constructions of the source language (which do not have equivalents) spill over to the output. Some special notation is also devised. The user after some training learns to read and understand the output. Because the Indian languages are close, the learning time of the output language is short, and is expected to be around 2 weeks. The output can also be post-edited by a trained user to make it grammatically correct in the target language. Style can also be changed, if necessary. Thus, in this scenario, it can function as a human assisted translation system. Currently, anusaarakas are being built from Telugu, Kannada, Marathi, Bengali and Punjabi to Hindi. They can be built for all Indian languages in the near future. Everybody must pitch in to build such systems connecting all Indian languages, using the free software model.Comment: Published in "Anuvad: Approaches to Translation", Rukmini Bhaya Nair, (editor), Sage, New Delhi, 200

arXiv.org e-Print Archive

ANNOTATION MODEL FOR LOANWORDS IN INDONESIAN CORPUS: A LOCAL GRAMMAR FRAMEWORK

Author: Prihantoro Prihantoro
Publication venue
Publication date: 02/07/2013
Field of study

There is a considerable number for loanwords in Indonesian language as it has been, or even continuously, in contact with other languages. The contact takes place via different media; one of them is via machine readable medium. As the information in different languages can be obtained by a mouse click these days, the contact becomes more and more intense. This paper aims at proposing an annotation model and lexical resource for loanwords in Indonesian. The lexical resource is applied to a corpus by a corpus processing software called UNITEX. This software works under local grammar framewor

An OLAC Extension for Dravidian Languages

Author: Pillai B Prabhulla Chandran
Publication venue
Publication date: 30/08/2009
Field of study

OLAC was founded in 2000 for creating online databases of language resources. This paper intends to review the bottom-up distributed character of the project and proposes an extension of the architecture for Dravidian languages. An ontological structure is considered for effective natural language processing (NLP) and its advantages over statistical methods are reviewedComment: 4 Pages, 2 figure

arXiv.org e-Print Archive

English-Bhojpuri SMT System: Insights from the Karaka Model

Author: Ojha Atul Kr.
Publication venue
Publication date: 06/05/2019
Field of study

This thesis has been divided into six chapters namely: Introduction, Karaka Model and it impacts on Dependency Parsing, LT Resources for Bhojpuri, English-Bhojpuri SMT System: Experiment, Evaluation of EB-SMT System, and Conclusion. Chapter one introduces this PhD research by detailing the motivation of the study, the methodology used for the study and the literature review of the existing MT related work in Indian Languages. Chapter two talks of the theoretical background of Karaka and Karaka model. Along with this, it talks about previous related work. It also discusses the impacts of the Karaka model in NLP and dependency parsing. It compares Karaka dependency and Universal Dependency. It also presents a brief idea of the implementation of these models in the SMT system for English-Bhojpuri language pair.Comment: 211 pages and Submitted at JNU New Delh

arXiv.org e-Print Archive

Role of Language in Identity Formation: An Analysis of Influence of Sanskrit on Identity Formation

Author: Varanasi Varanasi Ramabrahmam
Publication venue
Publication date: 01/01/2017
Field of study

The contents of Brahmajnaana, the Buddhism, the Jainism, the Sabdabrahma Siddhanta and Shaddarsanas will be discussed to present the true meaning of individual’s identity and I. The influence of spirituality contained in Upanishadic insight in the development of Sanskrit language structure, Indian culture, and individual identity formation will be developed. The cultural and psychological aspects of a civilization on the formation of its language structure and prominence given to various parts of speech and vice versa will be touched upon. These aspects will be also compared and contrasted with German, French, Telugu and Hindi and their respective influence on cultural and identity formation and vice versa. A cognitive science interpretation of advaita and dvaita phases of mind and bhakti and vibhakti modes of language acquisition and communication in terms of physics and electronics will be given and be clubbed to present an inclusive and comprehensive modern scientific and social scientific understanding and interpretation of Brahmajnaana, the Buddhism, the Jainism and rest of current theistic and atheistic awareness of I and its spiritual, linguistic, cognitive scientific and rationalistic ideas and opinions. The use of this study for national integration and oneness of Indians will be highlighted

PhilPapers

Language Access: An Information Based Approach

Author: Bharati Akshar
Chaitanya Vineet
Kulkarni Amba P.
Sangal Rajeev
Publication venue
Publication date: 07/08/2003
Field of study

The anusaaraka system (a kind of machine translation system) makes text in one Indian language accessible through another Indian language. The machine presents an image of the source text in a language close to the target language. In the image, some constructions of the source language (which do not have equivalents in the target language) spill over to the output. Some special notation is also devised. Anusaarakas have been built from five pairs of languages: Telugu,Kannada, Marathi, Bengali and Punjabi to Hindi. They are available for use through Email servers. Anusaarkas follows the principle of substitutibility and reversibility of strings produced. This implies preservation of information while going from a source language to a target language. For narrow subject areas, specialized modules can be built by putting subject domain knowledge into the system, which produce good quality grammatical output. However, it should be remembered, that such modules will work only in narrow areas, and will sometimes go wrong. In such a situation, anusaaraka output will still remain useful.Comment: Published in the proceedings of Knowledge Based Computer Systems conference, 2000, Tata McGraw-Hill, New Delhi, Dec. 200

arXiv.org e-Print Archive