60 research outputs found

    Building a foundation of HPSG-based treebank on Bangla language

    Get PDF
    Includes bibliographical references (page 6).Now a day, the importance of a large annotated corpus for NLP researchers is widely known. In this paper, we describe an initial phase of developing a linguistically annotated corpus for non-configurational ‘Bangla’ language. Since, the formalism differs from those posited for configurational languages; several features have been added for constraint based parsing through HPSG-based formalism. We propose an outline of a semi-automated process by applying both case marking approach and some morphological analysis to constraint the parsing of a relatively free word order language for creating a linguistically rich, highly-lexicalized annotated corpus

    A double metaphone encoding for approximate name searching and matching in Bangla

    Get PDF
    Includes bibliographical references (page 6).Almost any word can be a Bangali name, and the name in turn is often spelled in many different ways, all of which are considered correct and interchangeable. The reason for the spelling complication is two-fold: (1) there is a large gap between the script and pronunciation in Bangla, largely attributed to the large scale Sanskritization process that started in the 12th century and continued throughout the middle ages, and (2) typical Bangla names have very different origins, from the indigenous names derived primarily from Sanskrit, to the imported Muslim names from Persian and Arabic, Christian names from Portuguese, and even the names from popular Western TV soap-operas. However, there is always a large degree of phonetic similarity in the spelling variants of a name, which is the key to searching and matching names in records. We present a Double Metaphone encoding for Bangla names, taking into account the various spelling and phonetic rules in use, which can be used by applications to search for and match names. We encode the spelling variants of a large number of names found in the literature to demonstrate that the encoding does indeed show that the variants of a name are equivalent. A name searching algorithm may employ various figures of merit to narrow the list of possibilities when searching for similar names; we demonstrate one such figure of merit using name encoding and edit distance that has shown good promise

    Teaching compiler development to undergraduates using a template based approach

    Get PDF
    Includes bibliographical references (page 6-7).Compiler Design remains one of the most dreaded courses in any undergraduate Computer Science curriculum, due in part to the complexity and the breadth of the material covered in a typical 14-15 week semester time frame. The situation is further complicated by the fact that most undergraduates have never implemented a large enough software package that is needed for a working compiler, and to do so in such a short time span is a challenge indeed. This necessitates changes in the way we teach compilers, and specifically in ways we set up the project for the Compiler Design course at the undergraduate level. We describe a template based method for teaching compiler design and implementation to the undergraduates, where the students fill in the blanks in a set of templates for each phase of the compiler, starting from the lexical scanner to the code generator. Compilers for new languages can be implemented by modifying only the parts necessary to implement the syntax and the semantics of the language, leaving much of the remaining environment as is. The students not only learn how to design the various phases of the compiler, but also learn the software design and engineering techniques for implementing large software systems. In this paper, we describe a compiler teaching methodology that implements a full working compiler for an imperative C-like programming language with backend code generators for MIPS, Java Virtual Machine (JVM) and Microsoft’s .NET Common Language Runtime (CLR).Md. Zahurul IslamMumit Kha

    A light weight stemmer for Bengali and its use in spelling checker

    Get PDF
    Includes bibliographical references (page 6).Stemming is an operation that splits a word into the constituent root part and affix without doing complete morphological analysis. It is used to improve the performance of spelling checkers and information retrieval applications, where morphological analysis would be too computationally expensive. For spelling checkers specifically, using stemming may drastically reduce the dictionary size, often a bottleneck for mobile and embedded devices. This paper presents a computationally inexpensive stemming algorithm for Bengali, which handles suffix removal in a domain independent way. The evaluation of the proposed algorithm in a Bengali spelling checker indicates that it can be effectively used in information retrieval applications in general.Md. Zahurul IslamMd. Nizam UddinMumit Kha

    Text to speech for Bangla language using festival

    Get PDF
    Includes bibliographical references (page 6-7).In this paper, we present a Text to Speech (TTS) synthesis system for Bangla language using the open-source Festival TTS engine. Festival is a complete TTS synthesis system, with components supporting front-end processing of the input text, language modeling, and speech synthesis using its signal processing module. The Bangla TTS system proposed here, creates the voice data for festival, and additionally extends festival using its embedded scheme scripting interface to incorporate Bangla language support. Festival is a oncatenative TTS system using diphone or other unit selection speech units. Our TTS implementation uses two different kinds of these concatenative methods supported in Festival: unit selection and multisyn unit selection. The function of a Text-to-Speech system is to convert some language text into its spoken equivalent by a series of modules. These modules, constituting the TTS system are described in detail which is very much helpful for future development. Finally, the quality of synthesized speech is assessed in terms of acceptability and intelligibility

    Rule based automated pronunciation generator

    Get PDF
    Includes bibliographical references (page 6-7).This paper presents a rule based ronunciation generator for Bangla words. It takes a word and finds the pronunciations for the graphemes of the word. A grapheme is a unit in writing that cannot be analyzed into smaller components. Resolving the pronunciation of a polyphone grapheme (i.e. a grapheme that generates more than one phoneme) is the major hurdle that the Automated Pronunciation Generator (APG) encounters. Bangla is partially phonetic in nature, thus we can define rules to handle most of the cases. Besides, up till now we lack a balanced corpus which could be used for a statistical pronunciation generator. As a result, for the time being a rule-based approach towards implementing the APG for Bangla turns out to be efficient.Ayesha Binte MosaddequeNaushad UzZamanMumit Kha

    Example based English-Bengali machine translation using wordnet

    Get PDF
    Includes bibliographical references (page 4).In this paper we propose an architecture of English-Bengali Example Based Machine Translation (EBMT) using WordNet. The proposed EBMT system has five steps: 1) Tagging 2) Parsing 3) Prepare the chunks of the sentence using sub-sentential EBMT 4) Using an efficient adapting scheme, match the sentence rule 5) Translate from Source Language (English) to Target Language (Bengali) in the chunk and generate with morphological analysis with the help of WordNet. Using the word senses given by the WordNet we can detect the ambiguity and improve the correctness of translation.Khan Md. Anwarus SalamMumit KhanTetsuro Nishin

    Comparion of different POS tagging technique (N-Gram, HMM and Brill's tagger) for Bangla

    Get PDF
    Includes bibliographical references (page 6-7).There are different approaches to the problem of assigning each word of a text with a parts-of-speech tag, which is known as Part-Of-Speech (POS) tagging. In this paper we compare the performance of a few POS tagging techniques for Bangla language, e.g. statistical approach (n-gram, HMM) and transformation based approach (Brill’s tagger). A supervised POS tagging approach requires a large amount of annotated training corpus to tag properly. At this initial stage of POS-tagging for Bangla, we have very limited resource of annotated corpus. We tried to see which technique maximizes the performance with this limited resource. We also checked the performance for English and tried to conclude how these techniques might perform if we can manage a substantial amount of annotated corpus.Naushad UzZamanFahim Muhammad HasanMumit Kha

    A high performance domain specific OCR for Bangla script

    Get PDF
    Includes bibliographical references (page 5).Abstract-Research on recognizing Bengali script has been started since mid 1980’s. A variety of different techniques have been applied and the performance is examined. In this paper we present a high performance domain specific OCR for recognizing Bengali script. We select the training data set from the script of the specified domain. We choose Hidden Markov Model (HMM) for character classification due to its simple and straightforward way of representation. We examine the primary error types that mainly occurred at preprocessing level and carefully handled those errors by adding special error correcting module as a part of recognizer. Finally we added a dictionary and some error specific rules to correct the probable errors after the word formation is done. The entire technique significantly increases the performance of the OCR for a specific domain to a great extent

    Collaborative lexicon development for Bangla

    Get PDF
    Includes bibliographical references (page 7).This paper addresses the issue of building a Bangla lexicon with a collaborative effort through stand alone application and web based interface. The words in the lexicon will be annotated with a combination of tags addressing Parts-of-speech, syntactic, semantic and other grammatical features. Bangla words have been classified into several different parts – of – speech categories including various major word groups and subgroups. This paper aims to provide an integrated user – friendly software interface to the user to annotate a large existing Bangla word set and proposes a mechanism to collaboratively integrate linguists and other interested people into the lexicon build up process. The effort will be a significant progress towards development of a properly annotated lexicon. The outcome of the effort will significantly help in the processes of Morphological Analysis, Automatic grammar Extraction and machine translation for Bangla.Dewan Shahriar Hossain PavelAsif Iqbal SarkarFaisal Muhammad ShahMumit Kha
    • â€Ļ
    corecore