4,686 research outputs found
Sub-word indexing and blind relevance feedback for English, Bengali, Hindi, and Marathi IR
The Forum for Information Retrieval Evaluation (FIRE) provides document collections, topics, and relevance assessments for information retrieval (IR) experiments on Indian languages. Several research questions are explored in this paper: 1. how to create create a simple, languageindependent corpus-based stemmer, 2. how to identify sub-words and which types of sub-words are suitable as indexing units, and 3. how to apply blind relevance feedback on sub-words and how feedback term selection is affected by the type of the indexing unit. More than 140 IR experiments are conducted using the BM25 retrieval model on the topic titles and descriptions (TD) for the FIRE 2008 English, Bengali, Hindi, and Marathi document collections. The major findings are: The corpus-based stemming approach is effective as a knowledge-light
term conation step and useful in case of few language-specific resources. For English, the corpusbased
stemmer performs nearly as well as the Porter stemmer and significantly better than the baseline of indexing words when combined with query expansion. In combination with blind relevance feedback, it also performs significantly better than the baseline for Bengali and Marathi IR.
Sub-words such as consonant-vowel sequences and word prefixes can yield similar or better performance in comparison to word indexing. There is no best performing method for all languages. For English, indexing using the Porter stemmer performs best, for Bengali and Marathi, overlapping 3-grams obtain the best result, and for Hindi, 4-prefixes yield the highest MAP. However, in combination with blind relevance feedback using 10 documents and 20 terms, 6-prefixes for English and 4-prefixes for Bengali, Hindi, and Marathi IR yield the highest MAP. Sub-word identification is a general case of decompounding. It results in one or more index terms for a single word form and increases the number of index terms but decreases their average length. The corresponding retrieval experiments show that relevance feedback on sub-words benefits from
selecting a larger number of index terms in comparison with retrieval on word forms. Similarly, selecting the number of relevance feedback terms depending on the ratio of word vocabulary size to sub-word vocabulary size almost always slightly increases information retrieval effectiveness
compared to using a fixed number of terms for different languages
Hanwrittent Text Recognition for Bengali
© 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Handwritten text recognition of Bengali
is a difficult task because of complex character shapes
due to the presence of modified/compound characters
as well as zone-wise writing styles of different individuals.
Most of the research published so far on Bengali
handwriting recognition deals with either isolated
character recognition or isolated word recognition,
and just a few papers have researched on recognition
of continuous handwritten Bengali. In this paper
we present a research on continuous handwritten
Bengali. We follow a classical line-based recognition
approach with a system based on hidden Markov
models and n-gram language models. These models
are trained with automatic methods from annotated
data. We research both on the maximum likelihood
approach and the minimum error phone approach for
training the optical models. We also research on the
use of word-based language models and characterbased
language models. This last approach allow us
to deal with the out-of-vocabulary word problem in
the test when the training set is of limited size. From
the experiments we obtained encouraging results.This work has been partially supported through the European Union’s H2020 grant READ (Recognition and Enrichment of Archival Documents) (Ref: 674943) and partially supported by MINECO/FEDER, UE under project TIN2015-70924-C2-1-R.Sánchez Peiró, JA.; Pal, U. (2016). Hanwrittent Text Recognition for Bengali. IEEE. https://doi.org/10.1109/ICFHR.2016.010
- …