8 research outputs found

    Cross-document word matching for segmentation and retrieval of Ottoman divans

    Get PDF
    Cataloged from PDF version of article.Motivated by the need for the automatic indexing and analysis of huge number of documents in Ottoman divan poetry, and for discovering new knowledge to preserve and make alive this heritage, in this study we propose a novel method for segmenting and retrieving words in Ottoman divans. Documents in Ottoman are dif- ficult to segment into words without a prior knowledge of the word. In this study, using the idea that divans have multiple copies (versions) by different writers in different writing styles, and word segmentation in some of those versions may be relatively easier to achieve than in other versions, segmentation of the versions (which are difficult, if not impossible, with traditional techniques) is performed using information carried from the simpler version. One version of a document is used as the source dataset and the other version of the same document is used as the target dataset. Words in the source dataset are automatically extracted and used as queries to be spotted in the target dataset for detecting word boundaries. We present the idea of cross-document word matching for a novel task of segmenting historical documents into words. We propose a matching scheme based on possible combinations of sequence of sub-words. We improve the performance of simple features through considering the words in a context. The method is applied on two versions of Layla and Majnun divan by Fuzuli. The results show that, the proposed word-matching-based segmentation method is promising in finding the word boundaries and in retrieving the words across documents

    Cross-document word matching for segmentation and retrieval of Ottoman divans

    Get PDF
    Motivated by the need for the automatic indexing and analysis of huge number of documents in Ottoman divan poetry, and for discovering new knowledge to preserve and make alive this heritage, in this study we propose a novel method for segmenting and retrieving words in Ottoman divans. Documents in Ottoman are difficult to segment into words without a prior knowledge of the word. In this study, using the idea that divans have multiple copies (versions) by different writers in different writing styles, and word segmentation in some of those versions may be relatively easier to achieve than in other versions, segmentation of the versions (which are difficult, if not impossible, with traditional techniques) is performed using information carried from the simpler version. One version of a document is used as the source dataset and the other version of the same document is used as the target dataset. Words in the source dataset are automatically extracted and used as queries to be spotted in the target dataset for detecting word boundaries. We present the idea of cross-document word matching for a novel task of segmenting historical documents into words. We propose a matching scheme based on possible combinations of sequence of sub-words. We improve the performance of simple features through considering the words in a context. The method is applied on two versions of Layla and Majnun divan by Fuzuli. The results show that, the proposed word-matching-based segmentation method is promising in finding the word boundaries and in retrieving the words across documents. © 2014, Springer-Verlag London

    Historical document analysis based on word matching

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2011.Thesis (Master's) -- Bilkent University, 2011.Includes bibliographical references leaves 67-76.Historical documents constitute a heritage which should be preserved and providing automatic retrieval and indexing scheme for these archives would be beneficial for researchers from several disciplines and countries. Unfortunately, applying ordinary Optical Character Recognition (OCR) techniques on these documents is nearly impossible, since these documents are degraded and deformed. Recently, word matching methods are proposed to access these documents. In this thesis, two historical document analysis problems, word segmentation in historical documents and Islamic pattern matching in kufic images are tackled based on word matching. In the first task, a cross document word matching based approach is proposed to segment historical documents into words. A version of a document, in which word segmentation is easy, is used as a source data set and another version in a different writing style, which is more difficult to segment into words, is used as a target data set. The source data set is segmented into words by a simple method and extracted words are used as queries to be spotted in the target data set. Experiments on an Ottoman data set show that cross document word matching is a promising method to segment historical documents into words. In the second task, firstly lines are extracted and sub-patterns are automatically detected in the images. Then sub-patterns are matched based on a line representation in two ways: by their chain code representation and by their shape contexts. Promising results are obtained for finding the instances of a query pattern and for fully automatic detection of repeating patterns on a square kufic image collection.Arifoğlu, DamlaM.S

    A line-based representation for matching words in historical manuscripts

    Get PDF
    Cataloged from PDF version of article.In this study, we propose a new method for retrieving and recognizing words in historical documents. We represent word images with a set of line segments. Then we provide a criterion for word matching based on matching the lines. We carry out experiments on a benchmark dataset consisting of manuscripts by George Washington, as well as on Ottoman manuscripts. (C) 2011 Elsevier B.V. All rights reserved

    A line-based representation for matching words in historical manuscripts

    Get PDF
    In this study, we propose a new method for retrieving and recognizing words in historical documents. We represent word images with a set of line segments. Then we provide a criterion for word matching based on matching the lines. We carry out experiments on a benchmark dataset consisting of manuscripts by George Washington, as well as on Ottoman manuscripts. © 2011 Elsevier B.V. All rights reserved

    The use of optical character recognition technology in Ottoman Turkish research

    Get PDF
    Together with the increasing availability of electronic archives of Ottoman text images, conducting computer aided studies on those texts attracts more interest as a field of study. While some historical languages have been object to computational studies more than the others, Ottoman is still among the low resource languages in terms of machine-readable texts despite its richness of document image archives. Main principle to take a step towards creating an environment in the area of computational processing of Ottoman Turkish texts is building an electronic text corpus. To archive this it is needed to develop AI powered Optical Character Recognition (OCR) tools for Ottoman. Current OCR tools have some limitations such that they only can be used on printed documents with Naskh font with a simple layout book pages. The metrics for evaluating OCR performance target both physical segmentation and textual content recognition. In this paper I am focusing on the content recognition performance of the selected tool. I stored OCR output and the reference text in separate txt files and wrote a little script to compare the content of the files in terms of word matching. Even putting aside, the complexities of page layout, due to the problems related to the word-level segmentation, the tool's over 90% performance in letter recognition does not guarantee that the text is actually rendered accurately. It means that producing high-quality machine-readable texts still depends on manual correction, controlling the texts character by character. Improving performance of the AI powered OCR tools is highly dependent on high-quality training data. Due to the language characteristics of the documents, the preparation of pre-processed parallel data sets for the training-testing phases, requires linguistic expertise in Ottoman Turkish. Therefore, data and methods of related fields should be integrated while developing AI powered tools in order to progress in computerized historical linguistics studies.Osmanlı Türkçesi kaynaklarının elektronik arşivlerinin yaygınlaşmasıyla birlikte, bu metinlerüzerinde bilgisayar destekli çalışmalar giderek ilgi çeken bir alan hâline gelmiştir. Bazı tarihî dillerdiğerlerinden daha fazla bilgisayarlı dil bilimi çalışmalarına konu olurken Osmanlı Türkçesi, görselmetin arşivlerinin zenginliğine rağmen, makine tarafından okunabilen metinler açısından hâlâ düşükkaynaklı diller arasında yer almaktadır. Osmanlı Türkçesiyle yazılmış metinlerin sayısal olarakişlenmesine imkân yaratılmasına yönelik ilk adım elektronik bir metin külliyatı oluşturmaktır.Bunun da yolu Osmanlı Türkçesi için yapay zekâ destekli Optik Karakter Tanıma (OCR) araçlarıgeliştirmekten geçmektedir. Mevcut OCR araçlarının yalnızca Nesih yazı tipiyle basılı, basit birsayfa düzenine sahip kitap sayfalarında kullanılabilmeleri gibi bazı sınırlılıkları bulunmaktadır.OCR performansını değerlendirmeyi hedefleyen ölçekler, hem fiziksel segmentasyonu hem demetin içeriğini tanıma başarısını tespit etmeyi hedefler. Bu yazıda, seçtiğim OCR aracının içeriktanıma performansını değerlendirdim. OCR çıktısını ve referans metni txt dosyalarına kaydederekdosyaların içeriğini kelime bazında karşılaştırmak için küçük bir komut dosyası (script) yazdım.Sayfa düzeninden kaynaklanan karışıklıkları bir yana bıraksak bile, kelime seviyesindesegmentasyon problemleri sebebiyle, aracın harf tanımada %90 üzerinde performans göstermesiçıktı metninin aslına uygunluğunu garanti etmemektedir. Bu da makine tarafından okunabilenyüksek kalitede metinlerin oluşturulması için OCR yapılmış metinlerin harf harf kontrol edilerektekrar elle düzeltilmesi gerektiği anlamına gelmektedir

    Weathered Words : Formulaic Language and Verbal Art

    Get PDF
    Formulaic phraseology presents the epitome of words worn and weathered by trial and the tests of time. Scholarship on weathered words is exceptionally diverse and interdisciplinary. This volume focuses on verbal art, which makes Oral-Formulaic Theory (OFT) a major point of reference. Yet weathered words are but a part of OFT, and OFT is only a part of scholarship on weathered words. Each of the eighteen essays gathered here brings particular aspects of formulaic language into focus. No volume on such a diverse topic can be all-encompassing, but the essays highlight aspects of the phenomenon that may be eclipsed elsewhere: they diverge not only in style, but sometimes even in how they choose to define “formula.” As such, they offer overlapping frames that complement one another both in their convergences and their contrasts. While they view formulaicity from multifarious angles, they unite in a Picasso of perspectives on which the reader can reflect and draw insight.Peer reviewe
    corecore