Cataloged from PDF version of article.Motivated by the need for the automatic
indexing and analysis of huge number of documents in
Ottoman divan poetry, and for discovering new knowledge
to preserve and make alive this heritage, in this study we
propose a novel method for segmenting and retrieving
words in Ottoman divans. Documents in Ottoman are dif-
ficult to segment into words without a prior knowledge of
the word. In this study, using the idea that divans have
multiple copies (versions) by different writers in different
writing styles, and word segmentation in some of those
versions may be relatively easier to achieve than in other
versions, segmentation of the versions (which are difficult,
if not impossible, with traditional techniques) is performed
using information carried from the simpler version. One
version of a document is used as the source dataset and the
other version of the same document is used as the target
dataset. Words in the source dataset are automatically
extracted and used as queries to be spotted in the target
dataset for detecting word boundaries. We present the idea
of cross-document word matching for a novel task of
segmenting historical documents into words. We propose a
matching scheme based on possible combinations of
sequence of sub-words. We improve the performance of
simple features through considering the words in a context.
The method is applied on two versions of Layla and
Majnun divan by Fuzuli. The results show that, the proposed
word-matching-based segmentation method is
promising in finding the word boundaries and in retrieving
the words across documents