3 research outputs found

    Fusion de résultats en recherche d'information : application aux documents manuscrits en-ligne

    Get PDF
    Ce travail présente les résultats d'une étude sur la combinaison des deux approches majeures existantes pour la recherche de documents manuscrits en-ligne. La première approche consiste à appliquer des méthodes de recherche d'information (RI) aux documents issus d'un processus de reconnaissance. La deuxième, quant à elle, ne nécessite pas de reconnaissance explicite et utilise un algorithme de word spotting. La fusion permet d'améliorer les performances de la recherche. Les résultats montrent que pour des textes ayant un taux d'erreur au niveau mot inférieur à 23 %, les performances après fusion sont comparables à celles obtenues avec la vérité terrain. De plus, pour des textes fortement dégradés, des améliorations sont également observées

    Text retrieval from early printed books

    Get PDF

    A novel image matching approach for word spotting

    Get PDF
    Word spotting has been adopted and used by various researchers as a complementary technique to Optical Character Recognition for document analysis and retrieval. The various applications of word spotting include document indexing, image retrieval and information filtering. The important factors in word spotting techniques are pre-processing, selection and extraction of proper features and image matching algorithms. The Correlation Similarity Measure (CORR) algorithm is considered to be a faster matching algorithm, originally defined for finding similarities between binary patterns. In the word spotting literature the CORR algorithm has been used successfully to compare the GSC binary features extracted from binary word images, i.e., Gradient, Structural and Concavity (GSC) features. However, the problem with this approach is that binarization of images leads to a loss of very useful information. Furthermore, before extracting GSC binary features the word images must be skew corrected and slant normalized, which is not only difficult but in some cases impossible in Arabic and modified Arabic scripts. We present a new approach in which the Correlation Similarity Measure (CORR) algorithm has been used innovatively to compare Gray-scale word images. In this approach, binarization of images, skew correction and slant normalization of word images are not required at all. The various features, i.e., projection profiles, word profiles and transitional features are extracted from the Gray-scale word images and converted into their binary equivalents, which are compared via CORR algorithm with greater speed and higher accuracy. The experiments have been conducted on Gray-scale versions of newly created handwritten databases of Pashto and Dari languages, written in modified Arabic scripts. For each of these languages we have used 4599 words relating to 21 different word classes collected from 219 writers. The average precision rates achieved for Pashto and Dari languages were 93.18 % and 93.75 %, respectively. The time taken for matching a pair of images was 1.43 milli-seconds. In addition, we will present the handwritten databases for two well-known Indo- Iranian languages, i.e., Pashto and Dari languages. These are large databases which contain six types of data, i.e., Dates, Isolated Digits, Numeral Strings, Isolated Characters, Different Words and Special Symbols, written by native speakers of the corresponding languages
    corecore