Search CORE

249 research outputs found

Improving the quality of Gujarati-Hindi Machine Translation through part-of-speech tagging and stemmer-assisted transliteration

Author: Ameta Juhi
Joshi Nisheeth
Mathur Iti
Publication venue
Publication date: 01/06/2013
Field of study

Machine Translation for Indian languages is an emerging research area. Transliteration is one such module that we design while designing a translation system. Transliteration means mapping of source language text into the target language. Simple mapping decreases the efficiency of overall translation system. We propose the use of stemming and part-of-speech tagging for transliteration. The effectiveness of translation can be improved if we use part-of-speech tagging and stemming assisted transliteration.We have shown that much of the content in Gujarati gets transliterated while being processed for translation to Hindi language

CogPrints Cognitive Sciences Eprint Archive

Morphological analysis of the Slovak language

Author: Hládek Daniel
Juhár Jozef
Staš Ján
Publication venue: 'VSB Technical University of Ostrava, Faculty of Electrical Engineering and Computer Sciences'
Publication date: 01/01/2015
Field of study

This paper proposes a new statistic-based method of segmenting words by identification of a suffix. Ability to identify suffix can improve morphological analysis by allowing the classifier to assign tags to words previously unseen in the training corpus. Identified suffix of the word can be used to improve the accuracy of the part-of-speech tagging or other natural language processing task

Crossref

Directory of Open Access Journals

DSpace at VSB Technical University of Ostrava

Maximum Entropy Language Modeling for Russian ASR

Author: Fügen Christian
Kilgour Kevin
Shin Evgeniy
Stüker Sebastian
Waibel Alex
Publication venue: Association for Computational Linguistics
Publication date: 03/01/2024
Field of study

Russian is a challenging language for automatic speech recognition systems due to its rich morphology. This rich morphology stems from Russian’s highly inflectional nature and the frequent use of preand suffixes. Also, Russian has a very free word order, changes in which are used to reflect connotations of the sentences. Dealing with these phenomena is rather difficult for traditional n-gram models. We therefore investigate in this paper the use of a maximum entropy language model for Russian whose features are specifically designed to deal with the inflections in Russian, as well as the loose word order. We combine this with a subword based language model in order to alleviate the problem of large vocabulary sizes necessary for dealing with highly inflecting languages. Applying the maximum entropy language model during re-scoring improves the word error rate of our recognition system by 1.2% absolute, while the use of the sub-word based language model reduces the vocabulary size from 120k to 40k and the OOV rate from 4.8% to 2.1%

KITopen

Genetic Algorithm (GA) in Feature Selection for CRF Based Manipuri Multiword Expression (MWE) Identification

Author: Bandyopadhyay Sivaji
Nongmeikapam Kishorjit
Publication venue: 'Academy and Industry Research Collaboration Center (AIRCC)'
Publication date: 10/11/2011
Field of study

This paper deals with the identification of Multiword Expressions (MWEs) in Manipuri, a highly agglutinative Indian Language. Manipuri is listed in the Eight Schedule of Indian Constitution. MWE plays an important role in the applications of Natural Language Processing(NLP) like Machine Translation, Part of Speech tagging, Information Retrieval, Question Answering etc. Feature selection is an important factor in the recognition of Manipuri MWEs using Conditional Random Field (CRF). The disadvantage of manual selection and choosing of the appropriate features for running CRF motivates us to think of Genetic Algorithm (GA). Using GA we are able to find the optimal features to run the CRF. We have tried with fifty generations in feature selection along with three fold cross validation as fitness function. This model demonstrated the Recall (R) of 64.08%, Precision (P) of 86.84% and F-measure (F) of 73.74%, showing an improvement over the CRF based Manipuri MWE identification without GA application.Comment: 14 pages, 6 figures, see http://airccse.org/journal/jcsit/1011csit05.pd

arXiv.org e-Print Archive

Crossref

Methods for Amharic part-of-speech tagging

Author: Argaw Atelach Alemu
Asker Lars
Gambäck Björn
Olsson Fredrik
Publication venue
Publication date: 01/01/2009
Field of study

The paper describes a set of experiments involving the application of three state-of- the-art part-of-speech taggers to Ethiopian Amharic, using three different tagsets. The taggers showed worse performance than previously reported results for Eng- lish, in particular having problems with unknown words. The best results were obtained using a Maximum Entropy ap- proach, while HMM-based and SVM- based taggers got comparable results

Crossref

Publikationer från Stockholms universitet

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Context-Aware Stemming algorithm for semantically related root words

Author: Abidoye Ademola P.
Adesina Ademola Olusola
Agbele Kehinde K.
Azeez Nureni A.
Publication venue: Institute of Electrical and Electronics Engineers (IEEE) Inc.
Publication date: 01/01/2012
Field of study

There is a growing interest in the use of context-awareness as a technique for developing pervasive computing applications that are flexible and adaptable for users. In this context, however, information retrieval (IR) is often defined in terms of location and delivery of documents to a user to satisfy their information need. In most cases, morphological variants of words have similar semantic interpretations and can be considered as equivalent for the purpose of IR applications. Consequently, document indexing will also be more meaningful if semantically related root words are used instead of stems. The popular Porter’s stemmer was studied with the aim to produce intelligible stems. In this paper, we propose Context-Aware Stemming (CAS) algorithm, which is a modified version of the extensively used Porter’s stemmer. Considering only generated meaningful stemming words as the stemmer output, the results show that the modified algorithm significantly reduces the error rate of Porter’s algorithm from 76.7% to 6.7% without compromising the efficacy of Porter’s algorithm

EUSpace

University of the Western Cape Research Repository

Large vocabulary continuous speech recognition of an inflected language using stems and endings

Author: Bellman
Beyerlein
Comrie
Deshmukh
Dimec
Kwon
Mirjam Sepesy Maučec
Mohri
Ohtsuki
Popovič
Sepesy
Sixtus
Tomaž Rotovnik
Zdravko Kačič
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref

Open-source resources and standards for Arabic word structure analysis: Fine grained morphological analysis of Arabic text corpora

Author: Sawalha Majdi Shaker Salem
Publication venue: University of Leeds
Publication date: 01/01/2011
Field of study

Morphological analyzers are preprocessors for text analysis. Many Text Analytics applications need them to perform their tasks. The aim of this thesis is to develop standards, tools and resources that widen the scope of Arabic word structure analysis - particularly morphological analysis, to process Arabic text corpora of different domains, formats and genres, of both vowelized and non-vowelized text. We want to morphologically tag our Arabic Corpus, but evaluation of existing morphological analyzers has highlighted shortcomings and shown that more research is required. Tag-assignment is significantly more complex for Arabic than for many languages. The morphological analyzer should add the appropriate linguistic information to each part or morpheme of the word (proclitic, prefix, stem, suffix and enclitic); in effect, instead of a tag for a word, we need a subtag for each part. Very fine-grained distinctions may cause problems for automatic morphosyntactic analysis – particularly probabilistic taggers which require training data, if some words can change grammatical tag depending on function and context; on the other hand, finegrained distinctions may actually help to disambiguate other words in the local context. The SALMA – Tagger is a fine grained morphological analyzer which is mainly depends on linguistic information extracted from traditional Arabic grammar books and prior knowledge broad-coverage lexical resources; the SALMA – ABCLexicon. More fine-grained tag sets may be more appropriate for some tasks. The SALMA –Tag Set is a theory standard for encoding, which captures long-established traditional fine-grained morphological features of Arabic, in a notation format intended to be compact yet transparent. The SALMA – Tagger has been used to lemmatize the 176-million words Arabic Internet Corpus. It has been proposed as a language-engineering toolkit for Arabic lexicography and for phonetically annotating the Qur’an by syllable and primary stress information, as well as, fine-grained morphological tagging

White Rose E-theses Online

OpenGrey Repository