1,344 research outputs found
Text Search in Document Images Based on Hausdorff Distance Measures
The Hausdorff type distances between the sets of points on the plane are the commonly used similarity measures for binary images. In this work we present several such measures in a unified manner and introduce a new, naturally arisen variant of Hausdorff distance. The matching performance of all similarity measures is compared by computer experiments, using real word images from a scanned book
Hausdorff distances for searching in binary text images
Hausdorff distance (HD) seems the most efficient instrument
for measuring how far two compact non-empty subsets of a metric space are from each other. This paper considers the possibilities provided by HD and some of its modifications used recently by many authors for resemblance between binary text images. Summarizing part of the existing word image matching methods, relied on HD, we investigate a new similar parameterized method which contains almost all of them as particular cases. Numerical experiments for searching words in binary text images are carried out with
333 pages of old Bulgarian typewritten text, 200 printed pages of Bulgarian Chrestomathy from year 1884, and 200 handwritten pages of Slavonic manuscript from year 1574. They outline how the parameters must be set in order
to use the advantages of the proposed method for the purposes of word matching in scanned document images
Word Image Matching Based on Hausdorff Distances
Hausdorff distance (HD) and its modifications provides
one of the best approaches for matching of binary images.
This paper proposes a formalism generalizing almost
all of these HD based methods. Numerical experiments
for searching words in binary text images are carried
out with old Bulgarian typewritten text, printed Bulgarian
Chrestomathy from 1884 and Slavonic manuscript
from 1574
A scalable framework for stylometric analysis query processing
This is an accepted manuscript of an article published by IEEE in 2016 IEEE 16th International Conference on Data Mining (ICDM) on 02/02/2017, available online: https://ieeexplore.ieee.org/document/7837960
The accepted version of the publication may differ from the final published version.Stylometry is the statistical analyses of variationsin the author's literary style. The technique has been used inmany linguistic analysis applications, such as, author profiling, authorship identification, and authorship verification. Over thepast two decades, authorship identification has been extensivelystudied by researchers in the area of natural language processing. However, these studies are generally limited to (i) a small number of candidate authors, and (ii) documents with similar lengths. In this paper, we propose a novel solution by modeling authorship attribution as a set similarity problem to overcome the two stated limitations. We conducted extensive experimental studies on a real dataset collected from an online book archive, Project Gutenberg. Experimental results show that in comparison to existing stylometry studies, our proposed solution can handlea larger number of documents of different lengths written by alarger pool of candidate authors with a high accuracy.Published versio
A Comparison of Nature Inspired Algorithms for Multi-threshold Image Segmentation
In the field of image analysis, segmentation is one of the most important
preprocessing steps. One way to achieve segmentation is by mean of threshold
selection, where each pixel that belongs to a determined class islabeled
according to the selected threshold, giving as a result pixel groups that share
visual characteristics in the image. Several methods have been proposed in
order to solve threshold selectionproblems; in this work, it is used the method
based on the mixture of Gaussian functions to approximate the 1D histogram of a
gray level image and whose parameters are calculated using three nature
inspired algorithms (Particle Swarm Optimization, Artificial Bee Colony
Optimization and Differential Evolution). Each Gaussian function approximates
thehistogram, representing a pixel class and therefore a threshold point.
Experimental results are shown, comparing in quantitative and qualitative
fashion as well as the main advantages and drawbacks of each algorithm, applied
to multi-threshold problem.Comment: 16 pages, this is a draft of the final version of the article sent to
the Journa
Multiple Instance Learning: A Survey of Problem Characteristics and Applications
Multiple instance learning (MIL) is a form of weakly supervised learning
where training instances are arranged in sets, called bags, and a label is
provided for the entire bag. This formulation is gaining interest because it
naturally fits various problems and allows to leverage weakly labeled data.
Consequently, it has been used in diverse application fields such as computer
vision and document classification. However, learning from bags raises
important challenges that are unique to MIL. This paper provides a
comprehensive survey of the characteristics which define and differentiate the
types of MIL problems. Until now, these problem characteristics have not been
formally identified and described. As a result, the variations in performance
of MIL algorithms from one data set to another are difficult to explain. In
this paper, MIL problem characteristics are grouped into four broad categories:
the composition of the bags, the types of data distribution, the ambiguity of
instance labels, and the task to be performed. Methods specialized to address
each category are reviewed. Then, the extent to which these characteristics
manifest themselves in key MIL application areas are described. Finally,
experiments are conducted to compare the performance of 16 state-of-the-art MIL
methods on selected problem characteristics. This paper provides insight on how
the problem characteristics affect MIL algorithms, recommendations for future
benchmarking and promising avenues for research
- …