14,550 research outputs found

    Text Line Segmentation of Historical Documents: a Survey

    Full text link
    There is a huge amount of historical documents in libraries and in various National Archives that have not been exploited electronically. Although automatic reading of complete pages remains, in most cases, a long-term objective, tasks such as word spotting, text/image alignment, authentication and extraction of specific fields are in use today. For all these tasks, a major step is document segmentation into text lines. Because of the low quality and the complexity of these documents (background noise, artifacts due to aging, interfering lines),automatic text line segmentation remains an open research field. The objective of this paper is to present a survey of existing methods, developed during the last decade, and dedicated to documents of historical interest.Comment: 25 pages, submitted version, To appear in International Journal on Document Analysis and Recognition, On line version available at http://www.springerlink.com/content/k2813176280456k3

    Robust off-line text independent writer identification using bagged discrete cosine transform features

    Get PDF
    Efficient writer identification systems identify the authorship of an unknown sample of text with high confidence. This has made automatic writer identification a very important topic of research for forensic document analysis. In this paper, we propose a robust system for offline text independent writer identification using bagged discrete cosine transform (BDCT) descriptors. Universal codebooks are first used to generate multiple predictor models. A final decision is then obtained by using the majority voting rule from these predictor models. The BDCT approach allows for DCT features to be effectively exploited for robust hand writer identification. The proposed system has first been assessed on the original version of hand written documents of various datasets and results have shown comparable performance with state-of-the-art systems. Next, blurry and noisy documents of two different datasets have been considered through intensive experiments where the system has been shown to perform significantly better than its competitors. To the best of our knowledge this is the first work that addresses the robustness aspect in automatic hand writer identification. This is particularly suitable in digital forensics as the documents acquired by the analyst may not be in ideal conditions

    Transductive Learning with String Kernels for Cross-Domain Text Classification

    Full text link
    For many text classification tasks, there is a major problem posed by the lack of labeled data in a target domain. Although classifiers for a target domain can be trained on labeled text data from a related source domain, the accuracy of such classifiers is usually lower in the cross-domain setting. Recently, string kernels have obtained state-of-the-art results in various text classification tasks such as native language identification or automatic essay scoring. Moreover, classifiers based on string kernels have been found to be robust to the distribution gap between different domains. In this paper, we formally describe an algorithm composed of two simple yet effective transductive learning approaches to further improve the results of string kernels in cross-domain settings. By adapting string kernels to the test set without using the ground-truth test labels, we report significantly better accuracy rates in cross-domain English polarity classification.Comment: Accepted at ICONIP 2018. arXiv admin note: substantial text overlap with arXiv:1808.0840
    • …
    corecore