758 research outputs found
Attribute CNNs for Word Spotting in Handwritten Documents
Word spotting has become a field of strong research interest in document
image analysis over the last years. Recently, AttributeSVMs were proposed which
predict a binary attribute representation. At their time, this influential
method defined the state-of-the-art in segmentation-based word spotting. In
this work, we present an approach for learning attribute representations with
Convolutional Neural Networks (CNNs). By taking a probabilistic perspective on
training CNNs, we derive two different loss functions for binary and
real-valued word string embeddings. In addition, we propose two different CNN
architectures, specifically designed for word spotting. These architectures are
able to be trained in an end-to-end fashion. In a number of experiments, we
investigate the influence of different word string embeddings and optimization
strategies. We show our Attribute CNNs to achieve state-of-the-art results for
segmentation-based word spotting on a large variety of data sets.Comment: under review at IJDA
A Computationally Efficient Pipeline Approach to Full Page Offline Handwritten Text Recognition
Offline handwriting recognition with deep neural networks is usually limited
to words or lines due to large computational costs. In this paper, a less
computationally expensive full page offline handwritten text recognition
framework is introduced. This framework includes a pipeline that locates
handwritten text with an object detection neural network and recognises the
text within the detected regions using features extracted with a multi-scale
convolutional neural network (CNN) fed into a bidirectional long short term
memory (LSTM) network. This framework achieves comparable error rates to state
of the art frameworks while using less memory and time. The results in this
paper demonstrate the potential of this framework and future work can
investigate production ready and deployable handwritten text recognisers
Annotation-free Learning of Deep Representations for Word Spotting using Synthetic Data and Self Labeling
Word spotting is a popular tool for supporting the first exploration of
historic, handwritten document collections. Today, the best performing methods
rely on machine learning techniques, which require a high amount of annotated
training material. As training data is usually not available in the application
scenario, annotation-free methods aim at solving the retrieval task without
representative training samples. In this work, we present an annotation-free
method that still employs machine learning techniques and therefore outperforms
other learning-free approaches. The weakly supervised training scheme relies on
a lexicon, that does not need to precisely fit the dataset. In combination with
a confidence based selection of pseudo-labeled training samples, we achieve
state-of-the-art query-by-example performances. Furthermore, our method allows
to perform query-by-string, which is usually not the case for other
annotation-free methods.Comment: Accepted to Workshop on Document Analysis Systems (DAS) 202
R-PHOC: Segmentation-Free Word Spotting using CNN
This paper proposes a region based convolutional neural network for
segmentation-free word spotting. Our net- work takes as input an image and a
set of word candidate bound- ing boxes and embeds all bounding boxes into an
embedding space, where word spotting can be casted as a simple nearest
neighbour search between the query representation and each of the candidate
bounding boxes. We make use of PHOC embedding as it has previously achieved
significant success in segmentation- based word spotting. Word candidates are
generated using a simple procedure based on grouping connected components using
some spatial constraints. Experiments show that R-PHOC which operates on images
directly can improve the current state-of- the-art in the standard GW dataset
and performs as good as PHOCNET in some cases designed for segmentation based
word spotting.Comment: Accepted in ICDAR'201
Depression Severity Estimation from Multiple Modalities
Depression is a major debilitating disorder which can affect people from all
ages. With a continuous increase in the number of annual cases of depression,
there is a need to develop automatic techniques for the detection of the
presence and extent of depression. In this AVEC challenge we explore different
modalities (speech, language and visual features extracted from face) to design
and develop automatic methods for the detection of depression. In psychology
literature, the PHQ-8 questionnaire is well established as a tool for measuring
the severity of depression. In this paper we aim to automatically predict the
PHQ-8 scores from features extracted from the different modalities. We show
that visual features extracted from facial landmarks obtain the best
performance in terms of estimating the PHQ-8 results with a mean absolute error
(MAE) of 4.66 on the development set. Behavioral characteristics from speech
provide an MAE of 4.73. Language features yield a slightly higher MAE of 5.17.
When switching to the test set, our Turn Features derived from audio
transcriptions achieve the best performance, scoring an MAE of 4.11
(corresponding to an RMSE of 4.94), which makes our system the winner of the
AVEC 2017 depression sub-challenge.Comment: 8 pages, 1 figur
PHOCNet: A Deep Convolutional Neural Network for Word Spotting in Handwritten Documents
In recent years, deep convolutional neural networks have achieved state of
the art performance in various computer vision task such as classification,
detection or segmentation. Due to their outstanding performance, CNNs are more
and more used in the field of document image analysis as well. In this work, we
present a CNN architecture that is trained with the recently proposed PHOC
representation. We show empirically that our CNN architecture is able to
outperform state of the art results for various word spotting benchmarks while
exhibiting short training and test times.Comment: published as conference paper at the International Conference on
Frontiers in Handwriting Recognition 201
WSRNet: Joint Spotting and Recognition of Handwritten Words
In this work, we present a unified model that can handle both Keyword
Spotting and Word Recognition with the same network architecture. The proposed
network is comprised of a non-recurrent CTC branch and a Seq2Seq branch that is
further augmented with an Autoencoding module. The related joint loss leads to
a boost in recognition performance, while the Seq2Seq branch is used to create
efficient word representations. We show how to further process these
representations with binarization and a retraining scheme to provide compact
and highly efficient descriptors, suitable for keyword spotting. Numerical
results validate the usefulness of the proposed architecture, as our method
outperforms the previous state-of-the-art in keyword spotting, and provides
results in the ballpark of the leading methods for word recognition
Neural Word Search in Historical Manuscript Collections
We address the problem of segmenting and retrieving word images in
collections of historical manuscripts given a text query. This is commonly
referred to as "word spotting". To this end, we first propose an end-to-end
trainable model based on deep neural networks that we dub Ctrl-F-Net. The model
simultaneously generates region proposals and embeds them into a word embedding
space, wherein a search is performed. We further introduce a simplified version
called Ctrl-F-Mini. It is faster with similar performance, though it is limited
to more easily segmented manuscripts. We evaluate both models on common
benchmark datasets and surpass the previous state of the art. Finally, in
collaboration with historians, we employ the Ctrl-F-Net to search within a
large manuscript collection of over 100 thousand pages, written across two
centuries. With only 11 training pages, we enable large scale data collection
in manuscript-based historical research. This results in a speed up of data
collection and the number of manuscripts processed by orders of magnitude.
Given the time consuming manual work required to study old manuscripts in the
humanities, quick and robust tools for word spotting has the potential to
revolutionise domains like history, religion and language.Comment: Extension of arXiv:1703.07645. This version adds results on two
additional benchmark datasets (Botany and Konzilsprotokolle) and improves the
experiment done in section 5.3.
Learning Deep Representations for Word Spotting Under Weak Supervision
Convolutional Neural Networks have made their mark in various fields of
computer vision in recent years. They have achieved state-of-the-art
performance in the field of document analysis as well. However, CNNs require a
large amount of annotated training data and, hence, great manual effort. In our
approach, we introduce a method to drastically reduce the manual annotation
effort while retaining the high performance of a CNN for word spotting in
handwritten documents. The model is learned with weak supervision using a
combination of synthetically generated training data and a small subset of the
training partition of the handwritten data set. We show that the network
achieves results highly competitive to the state-of-the-art in word spotting
with shorter training times and a fraction of the annotation effort.Comment: submitted to DAS 201
Candidate Fusion: Integrating Language Modelling into a Sequence-to-Sequence Handwritten Word Recognition Architecture
Sequence-to-sequence models have recently become very popular for tackling
handwritten word recognition problems. However, how to effectively integrate an
external language model into such recognizer is still a challenging problem.
The main challenge faced when training a language model is to deal with the
language model corpus which is usually different to the one used for training
the handwritten word recognition system. Thus, the bias between both word
corpora leads to incorrectness on the transcriptions, providing similar or even
worse performances on the recognition task. In this work, we introduce
Candidate Fusion, a novel way to integrate an external language model to a
sequence-to-sequence architecture. Moreover, it provides suggestions from an
external language knowledge, as a new input to the sequence-to-sequence
recognizer. Hence, Candidate Fusion provides two improvements. On the one hand,
the sequence-to-sequence recognizer has the flexibility not only to combine the
information from itself and the language model, but also to choose the
importance of the information provided by the language model. On the other
hand, the external language model has the ability to adapt itself to the
training corpus and even learn the most commonly errors produced from the
recognizer. Finally, by conducting comprehensive experiments, the Candidate
Fusion proves to outperform the state-of-the-art language models for
handwritten word recognition tasks
- …