CREATING LANGUAGE RESOURCES FOR NLP IN INDIAN LANGUAGES 1. BACKGROUND

Abstract

Non-availability of lexical resources in the electronic form is a major bottleneck for anyone working in the field of NLP on Indian languages. Some measures were taken to alleviate this bottleneck in a quick and efficient way. It was felt that if the development of these resources is linked with an example application then it can act as a test bed for the developing resources and provide constant feedback. Moreover, immediate results in terms of a performing system also enthuses the developers for such time consuming jobs. It was decided to take up the building of a machine translation system as an example application, which would also serve as a vehicle for building lexical resources. 2. DEVELOPING LEXICAL RESOURCES The following lexical resources were built or are being built as part of a planned effort: a) Electronic dictionary (Shabdanjali English- Hindi dictionary) b) Transfer lexicon and grammar (TransLexGram) c) Part-of-Speech tagged corpora. These are described below. 2.1 SHABDANJALI ELECTRONIC DICTIONARY: As a first step in this direction a collaborative effort was undertaken to develop a bilingual electronic dictionary in the free software model. The interesting aspect of this effort was that the work was carried out by school children, teachers and others. People in about 8 cities were involved in the exercise. The school teachers participated, to some extent, in correcting and refining the work. The development of the dictionary resource took advantage of the bilingual ability of the contributors. The contributors provided the basic data: a) A number of Hindi equivalents required to cover various senses of the English lexical item in various contexts. b) An English example sentence for every Hindi equivalent. (The developed resource is now available as an "open resource " under General Public License

    Similar works

    Full text

    thumbnail-image

    Available Versions