2,571 research outputs found
Cross-document word matching for segmentation and retrieval of Ottoman divans
Cataloged from PDF version of article.Motivated by the need for the automatic
indexing and analysis of huge number of documents in
Ottoman divan poetry, and for discovering new knowledge
to preserve and make alive this heritage, in this study we
propose a novel method for segmenting and retrieving
words in Ottoman divans. Documents in Ottoman are dif-
ficult to segment into words without a prior knowledge of
the word. In this study, using the idea that divans have
multiple copies (versions) by different writers in different
writing styles, and word segmentation in some of those
versions may be relatively easier to achieve than in other
versions, segmentation of the versions (which are difficult,
if not impossible, with traditional techniques) is performed
using information carried from the simpler version. One
version of a document is used as the source dataset and the
other version of the same document is used as the target
dataset. Words in the source dataset are automatically
extracted and used as queries to be spotted in the target
dataset for detecting word boundaries. We present the idea
of cross-document word matching for a novel task of
segmenting historical documents into words. We propose a
matching scheme based on possible combinations of
sequence of sub-words. We improve the performance of
simple features through considering the words in a context.
The method is applied on two versions of Layla and
Majnun divan by Fuzuli. The results show that, the proposed
word-matching-based segmentation method is
promising in finding the word boundaries and in retrieving
the words across documents
A line-based representation for matching words in historical manuscripts
Cataloged from PDF version of article.In this study, we propose a new method for retrieving and recognizing words in historical documents. We represent word images with a set of line segments. Then we provide a criterion for word matching based on matching the lines. We carry out experiments on a benchmark dataset consisting of manuscripts by George Washington, as well as on Ottoman manuscripts. (C) 2011 Elsevier B.V. All rights reserved
End-Shape Analysis for Automatic Segmentation of Arabic Handwritten Texts
Word segmentation is an important task for many methods that are related to document understanding especially word spotting and word recognition. Several approaches of word segmentation have been proposed for Latin-based languages while a few of them have been introduced for Arabic texts. The fact that Arabic writing is cursive by nature and unconstrained with no clear boundaries between the words makes the processing of Arabic handwritten text a more challenging problem.
In this thesis, the design and implementation of an End-Shape Letter (ESL) based segmentation system for Arabic handwritten text is presented. This incorporates four novel aspects: (i) removal of secondary components, (ii) baseline estimation, (iii) ESL recognition, and (iv) the creation of a new off-line CENPARMI ESL database.
Arabic texts include small connected components, also called secondary components. Removing these components can improve the performance of several systems such as baseline estimation. Thus, a robust method to remove secondary components that takes into consideration the challenges in the Arabic handwriting is introduced. The methods reconstruct the image based on some criteria. The results of this method were subsequently compared with those of two other methods that used the same database. The results show that the proposed method is effective.
Baseline estimation is a challenging task for Arabic texts since it includes ligature, overlapping, and secondary components. Therefore, we propose a learning-based approach that addresses these challenges. Our method analyzes the image and extracts baseline dependent features. Then, the baseline is estimated using a classifier.
Algorithms dealing with text segmentation usually analyze the gaps between connected components. These algorithms are based on metric calculation, finding threshold, and/or gap classification. We use two well-known metrics: bounding box and convex hull to test metric-based method on Arabic handwritten texts, and to include this technique in our approach. To determine the threshold, an unsupervised learning approach, known as the Gaussian Mixture Model, is used. Our ESL-based segmentation approach extracts the final letter of a word using rule-based technique and recognizes these letters using the implemented ESL classifier.
To demonstrate the benefit of text segmentation, a holistic word spotting system is implemented. For this system, a word recognition system is implemented. A series of experiments with different sets of features are conducted. The system shows promising results
OTS: A One-shot Learning Approach for Text Spotting in Historical Manuscripts
Historical manuscript processing poses challenges like limited annotated
training data and novel class emergence. To address this, we propose a novel
One-shot learning-based Text Spotting (OTS) approach that accurately and
reliably spots novel characters with just one annotated support sample. Drawing
inspiration from cognitive research, we introduce a spatial alignment module
that finds, focuses on, and learns the most discriminative spatial regions in
the query image based on one support image. Especially, since the low-resource
spotting task often faces the problem of example imbalance, we propose a novel
loss function called torus loss which can make the embedding space of distance
metric more discriminative. Our approach is highly efficient and requires only
a few training samples while exhibiting the remarkable ability to handle novel
characters, and symbols. To enhance dataset diversity, a new manuscript dataset
that contains the ancient Dongba hieroglyphics (DBH) is created. We conduct
experiments on publicly available VML-HD, TKH, NC datasets, and the new
proposed DBH dataset. The experimental results demonstrate that OTS outperforms
the state-of-the-art methods in one-shot text spotting. Overall, our proposed
method offers promising applications in the field of text spotting in
historical manuscripts
A new representation for matching words
Ankara : The Department of Computer Engineering and the Institute of Engineering and Sciences of Bilkent University, 2007.Thesis (Master's) -- Bilkent University, 2007.Includes bibliographical references leaves 77-82.Large archives of historical documents are challenging to many researchers all
over the world. However, these archives remain inaccessible since manual indexing
and transcription of such a huge volume is difficult. In addition, electronic
imaging tools and image processing techniques gain importance with the rapid
increase in digitalization of materials in libraries and archives. In this thesis,
a language independent method is proposed for representation of word images,
which leads to retrieval and indexing of documents. While character recognition
methods suffer from preprocessing and overtraining, we make use of another
method, which is based on extracting words from documents and representing
each word image with the features of invariant regions. The bag-of-words approach,
which is shown to be successful to classify objects and scenes, is adapted
for matching words. Since the curvature or connection points, or the dots are
important visual features to distinct two words from each other, we make use of
the salient points which are shown to be successful in representing such distinctive
areas and heavily used for matching. Difference of Gaussian (DoG) detector,
which is able to find scale invariant regions, and Harris Affine detector, which
detects affine invariant regions, are used for detection of such areas and detected
keypoints are described with Scale Invariant Feature Transform (SIFT) features.
Then, each word image is represented by a set of visual terms which are obtained
by vector quantization of SIFT descriptors and similar words are matched based
on the similarity of these representations by using different distance measures.
These representations are used both for document retrieval and word spotting.
The experiments are carried out on Arabic, Latin and Ottoman datasets,
which included different writing styles and different writers. The results show that
the proposed method is successful on retrieval and indexing of documents even if
with different scripts and different writers and since it is language independent, it can be easily adapted to other languages as well. Retrieval performance of the
system is comparable to the state of the art methods in this field. In addition,
the system is succesfull on capturing semantic similarities, which is useful for
indexing, and it does not include any supervising step.Ataer, EsraM.S
Cross-document word matching for segmentation and retrieval of Ottoman divans
Motivated by the need for the automatic indexing and analysis of huge number of documents in Ottoman divan poetry, and for discovering new knowledge to preserve and make alive this heritage, in this study we propose a novel method for segmenting and retrieving words in Ottoman divans. Documents in Ottoman are difficult to segment into words without a prior knowledge of the word. In this study, using the idea that divans have multiple copies (versions) by different writers in different writing styles, and word segmentation in some of those versions may be relatively easier to achieve than in other versions, segmentation of the versions (which are difficult, if not impossible, with traditional techniques) is performed using information carried from the simpler version. One version of a document is used as the source dataset and the other version of the same document is used as the target dataset. Words in the source dataset are automatically extracted and used as queries to be spotted in the target dataset for detecting word boundaries. We present the idea of cross-document word matching for a novel task of segmenting historical documents into words. We propose a matching scheme based on possible combinations of sequence of sub-words. We improve the performance of simple features through considering the words in a context. The method is applied on two versions of Layla and Majnun divan by Fuzuli. The results show that, the proposed word-matching-based segmentation method is promising in finding the word boundaries and in retrieving the words across documents. © 2014, Springer-Verlag London
A Line-based representation for matching words
Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2009.Thesis (Master's) -- Bilkent University, 2009.Includes bibliographical references leaves 46-49.With the increase of the number of documents available in the digital environment,
efficient access to the documents becomes crucial. Manual indexing of the
documents is costly; however, and can be carried out only in limited amounts.
Therefore, automatic analysis of documents is crucial. Although plenty of effort
has been spent on optical character recognition (OCR), most of the existing OCR
systems fail to address the challenge of recognizing characters in historical documents
on account of the poor quality of old documents, the high level of noise
factors, and the variety of scripts. More importantly, OCR systems are usually
language dependent and not available for all languages. Word spotting techniques
have been recently proposed to access the historical documents with the idea that
humans read whole words at a time. In these studies the words rather than the
characters are considered as the basic units. Due to the poor quality of historical
documents, the representation and matching of words continue to be challenging
problems for word spotting. In this study we address these challenges and propose
a simple but effective method for the representation of word images by a
set of line descriptors. Then, two different matching criteria making use of the
line-based representation are proposed. We apply our methods on the word spotting
and redif extraction tasks. The proposed line-based representation does not
require any specific pre-processing steps, and is applicable to different languages
and scripts. In word spotting task, our results provide higher scores than the
existing word spotting studies in terms of retrieval and recognition performances.
In the redif extraction task, we obtain promising results providing a motivation
for further and advanced studies on Ottoman literary texts.Can, Ethem FatihM.S
A line-based representation for matching words in historical manuscripts
In this study, we propose a new method for retrieving and recognizing words in historical documents. We represent word images with a set of line segments. Then we provide a criterion for word matching based on matching the lines. We carry out experiments on a benchmark dataset consisting of manuscripts by George Washington, as well as on Ottoman manuscripts. © 2011 Elsevier B.V. All rights reserved
HMM word graph based keyword spotting in handwritten document images
[EN] Line-level keyword spotting (KWS) is presented on the basis of frame-level word posterior
probabilities. These posteriors are obtained using word graphs derived from the recogni-
tion process of a full-fledged handwritten text recognizer based on hidden Markov models
and N-gram language models. This approach has several advantages. First, since it uses
a holistic, segmentation-free technology, it does not require any kind of word or charac-
ter segmentation. Second, the use of language models allows the context of each spotted
word to be taken into account, thereby considerably increasing KWS accuracy. And third,
the proposed KWS scores are based on true posterior probabilities, taking into account
all (or most) possible word segmentations of the input image. These scores are properly
bounded and normalized. This mathematically clean formulation lends itself to smooth,
threshold-based keyword queries which, in turn, permit comfortable trade-offs between
search precision and recall. Experiments are carried out on several historic collections of
handwritten text images, as well as a well-known data set of modern English handwrit-
ten text. According to the empirical results, the proposed approach achieves KWS results
comparable to those obtained with the recently-introduced "BLSTM neural networks KWS"
approach and clearly outperform the popular, state-of-the-art "Filler HMM" KWS method.
Overall, the results clearly support all the above-claimed advantages of the proposed ap-
proach.This work has been partially supported by the Generalitat Valenciana under the Prometeo/2009/014 project grant ALMA-MATER, and through the EU projects: HIMANIS (JPICH programme, Spanish grant Ref. PCIN-2015-068) and READ (Horizon 2020 programme, grant Ref. 674943).Toselli, AH.; Vidal, E.; Romero, V.; Frinken, V. (2016). HMM word graph based keyword spotting in handwritten document images. Information Sciences. 370:497-518. https://doi.org/10.1016/j.ins.2016.07.063S49751837
- …