1,344 research outputs found

    Text Search in Document Images Based on Hausdorff Distance Measures

    Get PDF
    The Hausdorff type distances between the sets of points on the plane are the commonly used similarity measures for binary images. In this work we present several such measures in a unified manner and introduce a new, naturally arisen variant of Hausdorff distance. The matching performance of all similarity measures is compared by computer experiments, using real word images from a scanned book

    Hausdorff distances for searching in binary text images

    Get PDF
    Hausdorff distance (HD) seems the most efficient instrument for measuring how far two compact non-empty subsets of a metric space are from each other. This paper considers the possibilities provided by HD and some of its modifications used recently by many authors for resemblance between binary text images. Summarizing part of the existing word image matching methods, relied on HD, we investigate a new similar parameterized method which contains almost all of them as particular cases. Numerical experiments for searching words in binary text images are carried out with 333 pages of old Bulgarian typewritten text, 200 printed pages of Bulgarian Chrestomathy from year 1884, and 200 handwritten pages of Slavonic manuscript from year 1574. They outline how the parameters must be set in order to use the advantages of the proposed method for the purposes of word matching in scanned document images

    Word Image Matching Based on Hausdorff Distances

    Get PDF
    Hausdorff distance (HD) and its modifications provides one of the best approaches for matching of binary images. This paper proposes a formalism generalizing almost all of these HD based methods. Numerical experiments for searching words in binary text images are carried out with old Bulgarian typewritten text, printed Bulgarian Chrestomathy from 1884 and Slavonic manuscript from 1574

    A scalable framework for stylometric analysis query processing

    Get PDF
    This is an accepted manuscript of an article published by IEEE in 2016 IEEE 16th International Conference on Data Mining (ICDM) on 02/02/2017, available online: https://ieeexplore.ieee.org/document/7837960 The accepted version of the publication may differ from the final published version.Stylometry is the statistical analyses of variationsin the author's literary style. The technique has been used inmany linguistic analysis applications, such as, author profiling, authorship identification, and authorship verification. Over thepast two decades, authorship identification has been extensivelystudied by researchers in the area of natural language processing. However, these studies are generally limited to (i) a small number of candidate authors, and (ii) documents with similar lengths. In this paper, we propose a novel solution by modeling authorship attribution as a set similarity problem to overcome the two stated limitations. We conducted extensive experimental studies on a real dataset collected from an online book archive, Project Gutenberg. Experimental results show that in comparison to existing stylometry studies, our proposed solution can handlea larger number of documents of different lengths written by alarger pool of candidate authors with a high accuracy.Published versio

    A Comparison of Nature Inspired Algorithms for Multi-threshold Image Segmentation

    Full text link
    In the field of image analysis, segmentation is one of the most important preprocessing steps. One way to achieve segmentation is by mean of threshold selection, where each pixel that belongs to a determined class islabeled according to the selected threshold, giving as a result pixel groups that share visual characteristics in the image. Several methods have been proposed in order to solve threshold selectionproblems; in this work, it is used the method based on the mixture of Gaussian functions to approximate the 1D histogram of a gray level image and whose parameters are calculated using three nature inspired algorithms (Particle Swarm Optimization, Artificial Bee Colony Optimization and Differential Evolution). Each Gaussian function approximates thehistogram, representing a pixel class and therefore a threshold point. Experimental results are shown, comparing in quantitative and qualitative fashion as well as the main advantages and drawbacks of each algorithm, applied to multi-threshold problem.Comment: 16 pages, this is a draft of the final version of the article sent to the Journa

    Multiple Instance Learning: A Survey of Problem Characteristics and Applications

    Full text link
    Multiple instance learning (MIL) is a form of weakly supervised learning where training instances are arranged in sets, called bags, and a label is provided for the entire bag. This formulation is gaining interest because it naturally fits various problems and allows to leverage weakly labeled data. Consequently, it has been used in diverse application fields such as computer vision and document classification. However, learning from bags raises important challenges that are unique to MIL. This paper provides a comprehensive survey of the characteristics which define and differentiate the types of MIL problems. Until now, these problem characteristics have not been formally identified and described. As a result, the variations in performance of MIL algorithms from one data set to another are difficult to explain. In this paper, MIL problem characteristics are grouped into four broad categories: the composition of the bags, the types of data distribution, the ambiguity of instance labels, and the task to be performed. Methods specialized to address each category are reviewed. Then, the extent to which these characteristics manifest themselves in key MIL application areas are described. Finally, experiments are conducted to compare the performance of 16 state-of-the-art MIL methods on selected problem characteristics. This paper provides insight on how the problem characteristics affect MIL algorithms, recommendations for future benchmarking and promising avenues for research
    corecore