20 research outputs found

    Separability versus Prototypicality in Handwritten Word Retrieval

    Get PDF
    User appreciation of a word-image retrieval system is based on the quality ofa hit list for a query. Using support vector machines for ranking in largescale, handwritten document collections, we observed that many hit listssuffered from bad instances in the top ranks. An analysis of this problemrevealed that two functions needed to be optimised concerning bothseparability and prototypicality. By ranking images in two stages, the numberof distracting images is reduced, making the method very convenient formassive scale, continuously trainable retrieval engines. Instead of cumbersomeSVM training, we present a nearest-centroid method and show that precisionimprovements of up to 35 percentage points can be achieved, yielding up to100% precision in data sets with a large amount of instances, whilemaintaining high recall performances.<br/

    Separability versus prototypicality in handwritten word-image retrieval

    Get PDF
    Hit lists are at the core of retrieval systems. The top ranks are important, especially if user feedback is used to train the system. Analysis of hit lists revealed counter-intuitive instances in the top ranks for good classifiers. In this study, we propose that two functions need to be optimised: (a) in order to reduce a massive set of instances to a likely subset among ten thousand or more classes, separability is required. However, the results need to be intuitive after ranking, reflecting (b) the prototypicality of instances. By optimising these requirements sequentially, the number of distracting images is strongly reduced, followed by nearest-centroid based instance ranking that retains an intuitive (low-edit distance) ranking. We show that in handwritten word-image retrieval, precision improvements of up to 35 percentage points can be achieved, yielding up to 100% top hit precision and 99% top-7 precision in data sets with 84 000 instances, while maintaining high recall performances. The method is conveniently implemented in a massive scale, continuously trainable retrieval engine, Monk. (C) 2013 Elsevier Ltd. All rights reserved

    A large-scale field test on word-image classification in large historical document collections using a traditional and two deep-learning methods

    Get PDF
    This technical report describes a practical field test on word-image classification in a very large collection of more than 300 diverse handwritten historical manuscripts, with 1.6 million unique labeled images and more than 11 million images used in testing. Results indicate that several deep-learning tests completely failed (mean accuracy 83%). In the tests with more than 1000 output units (lexical words) in one-hot encoding for classification, performance steeply drops to almost zero percent accuracy, even with a modest size of the pre-final (i.e., penultimate) layer (150 units). A traditional feature method (BOVW) displays a consistent performance over numbers of classes and numbers of training examples (mean accuracy 87%). Additional tests using nearest mean on the output of the pre-final layer of an Inception V3 network, for each book, only yielded mediocre results (mean accuracy 49\%), but was not sensitive to high numbers of classes. Notably, this experiment was only possible on the basis of labels that were harvested on the basis of a traditional method which already works starting from a single labeled image per class. It is expected that the performance of the failed deep learning tests can be repaired, but only on the basis of human handcrafting (sic) of network architecture and hyperparameters. When the failed problematic books are not considered, end-to-end CNN training yields about 95% accuracy. This average is dominated by a large subset of Chinese characters, performances for other script styles being lower

    OptGAN:Optimizing and Interpreting the Latent Space of the Conditional Text-to-Image GANs

    Get PDF
    Text-to-image generation intends to automatically produce a photo-realistic image, conditioned on a textual description. It can be potentially employed in the field of art creation, data augmentation, photo-editing, etc. Although many efforts have been dedicated to this task, it remains particularly challenging to generate believable, natural scenes. To facilitate the real-world applications of text-to-image synthesis, we focus on studying the following three issues: 1) How to ensure that generated samples are believable, realistic or natural? 2) How to exploit the latent space of the generator to edit a synthesized image? 3) How to improve the explainability of a text-to-image generation framework? In this work, we constructed two novel data sets (i.e., the Good & Bad bird and face data sets) consisting of successful as well as unsuccessful generated samples, according to strict criteria. To effectively and efficiently acquire high-quality images by increasing the probability of generating Good latent codes, we use a dedicated Good/Bad classifier for generated images. It is based on a pre-trained front end and fine-tuned on the basis of the proposed Good & Bad data set. After that, we present a novel algorithm which identifies semantically-understandable directions in the latent space of a conditional text-to-image GAN architecture by performing independent component analysis on the pre-trained weight values of the generator. Furthermore, we develop a background-flattening loss (BFL), to improve the background appearance in the edited image. Subsequently, we introduce linear interpolation analysis between pairs of keywords. This is extended into a similar triangular `linguistic' interpolation in order to take a deep look into what a text-to-image synthesis model has learned within the linguistic embeddings. Our data set is available at https://zenodo.org/record/6283798#.YhkN_ujMI2w.Comment: 18 page

    Towards robust real-world historical handwriting recognition

    Get PDF
    In this thesis, we make a bridge from the past to the future by using artificial-intelligence methods for text recognition in a historical Dutch collection of the Natuurkundige Commissie that explored Indonesia (1820-1850). In spite of the successes of systems like 'ChatGPT', reading historical handwriting is still quite challenging for AI. Whereas GPT-like methods work on digital texts, historical manuscripts are only available as an extremely diverse collections of (pixel) images. Despite the great results, current DL methods are very data greedy, time consuming, heavily dependent on the human expert from the humanities for labeling and require machine-learning experts for designing the models. Ideally, the use of deep learning methods should require minimal human effort, have an algorithm observe the evolution of the training process, and avoid inefficient use of the already sparse amount of labeled data. We present several approaches towards dealing with these problems, aiming to improve the robustness of current methods and to improve the autonomy in training. We applied our novel word and line text recognition approaches on nine data sets differing in time period, language, and difficulty: three locally collected historical Latin-based data sets from Naturalis, Leiden; four public Latin-based benchmark data sets for comparability with other approaches; and two Arabic data sets. Using ensemble voting of just five neural networks, a level of accuracy was achieved which required hundreds of neural networks in earlier studies. Moreover, we increased the speed of evaluation of each training epoch without the need of labeled data
    corecore