1,011 research outputs found
Learning to Read by Spelling: Towards Unsupervised Text Recognition
This work presents a method for visual text recognition without using any
paired supervisory data. We formulate the text recognition task as one of
aligning the conditional distribution of strings predicted from given text
images, with lexically valid strings sampled from target corpora. This enables
fully automated, and unsupervised learning from just line-level text-images,
and unpaired text-string samples, obviating the need for large aligned
datasets. We present detailed analysis for various aspects of the proposed
method, namely - (1) impact of the length of training sequences on convergence,
(2) relation between character frequencies and the order in which they are
learnt, (3) generalisation ability of our recognition network to inputs of
arbitrary lengths, and (4) impact of varying the text corpus on recognition
accuracy. Finally, we demonstrate excellent text recognition accuracy on both
synthetically generated text images, and scanned images of real printed books,
using no labelled training examples
Inductive Visual Localisation: Factorised Training for Superior Generalisation
End-to-end trained Recurrent Neural Networks (RNNs) have been successfully
applied to numerous problems that require processing sequences, such as image
captioning, machine translation, and text recognition. However, RNNs often
struggle to generalise to sequences longer than the ones encountered during
training. In this work, we propose to optimise neural networks explicitly for
induction. The idea is to first decompose the problem in a sequence of
inductive steps and then to explicitly train the RNN to reproduce such steps.
Generalisation is achieved as the RNN is not allowed to learn an arbitrary
internal state; instead, it is tasked with mimicking the evolution of a valid
state. In particular, the state is restricted to a spatial memory map that
tracks parts of the input image which have been accounted for in previous
steps. The RNN is trained for single inductive steps, where it produces updates
to the memory in addition to the desired output. We evaluate our method on two
different visual recognition problems involving visual sequences: (1) text
spotting, i.e. joint localisation and reading of text in images containing
multiple lines (or a block) of text, and (2) sequential counting of objects in
aerial images. We show that inductive training of recurrent models enhances
their generalisation ability on challenging image datasets.Comment: In BMVC 2018 (spotlight
A Deep Generative Framework for Paraphrase Generation
Paraphrase generation is an important problem in NLP, especially in question
answering, information retrieval, information extraction, conversation systems,
to name a few. In this paper, we address the problem of generating paraphrases
automatically. Our proposed method is based on a combination of deep generative
models (VAE) with sequence-to-sequence models (LSTM) to generate paraphrases,
given an input sentence. Traditional VAEs when combined with recurrent neural
networks can generate free text but they are not suitable for paraphrase
generation for a given sentence. We address this problem by conditioning the
both, encoder and decoder sides of VAE, on the original sentence, so that it
can generate the given sentence's paraphrases. Unlike most existing models, our
model is simple, modular and can generate multiple paraphrases, for a given
sentence. Quantitative evaluation of the proposed method on a benchmark
paraphrase dataset demonstrates its efficacy, and its performance improvement
over the state-of-the-art methods by a significant margin, whereas qualitative
human evaluation indicate that the generated paraphrases are well-formed,
grammatically correct, and are relevant to the input sentence. Furthermore, we
evaluate our method on a newly released question paraphrase dataset, and
establish a new baseline for future research
A Time-Series-Based Feature Extraction Approach for Prediction of Protein Structural Class
This paper presents a novel feature vector based on physicochemical property of amino acids for prediction protein structural classes. The proposed method is divided into three different stages. First, a discrete time series representation to protein sequences using physicochemical scale is provided. Later on, a wavelet-based time-series technique is proposed for extracting features from mapped amino acid sequence and a fixed length feature vector for classification is constructed. The proposed feature space summarizes the variance information of ten different biological properties of amino acids. Finally, an optimized support vector machine model is constructed for prediction of each protein structural class. The proposed approach is evaluated using leave-one-out cross-validation tests on two standard datasets. Comparison of our result with existing approaches shows that overall accuracy achieved by our approach is better than exiting methods
Helping Hands: An Object-Aware Ego-Centric Video Recognition Model
We introduce an object-aware decoder for improving the performance of
spatio-temporal representations on ego-centric videos. The key idea is to
enhance object-awareness during training by tasking the model to predict hand
positions, object positions, and the semantic label of the objects using paired
captions when available. At inference time the model only requires RGB frames
as inputs, and is able to track and ground objects (although it has not been
trained explicitly for this). We demonstrate the performance of the
object-aware representations learnt by our model, by: (i) evaluating it for
strong transfer, i.e. through zero-shot testing, on a number of downstream
video-text retrieval and classification benchmarks; and (ii) by using the
representations learned as input for long-term video understanding tasks (e.g.
Episodic Memory in Ego4D). In all cases the performance improves over the state
of the art -- even compared to networks trained with far larger batch sizes. We
also show that by using noisy image-level detection as pseudo-labels in
training, the model learns to provide better bounding boxes using video
consistency, as well as grounding the words in the associated text
descriptions. Overall, we show that the model can act as a drop-in replacement
for an ego-centric video model to improve performance through visual-text
grounding.Comment: ICCV202
Adaptive Text Recognition through Visual Matching
In this work, our objective is to address the problems of generalization and
flexibility for text recognition in documents. We introduce a new model that
exploits the repetitive nature of characters in languages, and decouples the
visual representation learning and linguistic modelling stages. By doing this,
we turn text recognition into a shape matching problem, and thereby achieve
generalization in appearance and flexibility in classes. We evaluate the new
model on both synthetic and real datasets across different alphabets and show
that it can handle challenges that traditional architectures are not able to
solve without expensive retraining, including: (i) it can generalize to unseen
fonts without new exemplars from them; (ii) it can flexibly change the number
of classes, simply by changing the exemplars provided; and (iii) it can
generalize to new languages and new characters that it has not been trained for
by providing a new glyph set. We show significant improvements over
state-of-the-art models for all these cases.Comment: ECCV202
Is an Object-Centric Video Representation Beneficial for Transfer?
The objective of this work is to learn an object-centric video
representation, with the aim of improving transferability to novel tasks, i.e.,
tasks different from the pre-training task of action classification. To this
end, we introduce a new object-centric video recognition model based on a
transformer architecture. The model learns a set of object-centric summary
vectors for the video, and uses these vectors to fuse the visual and
spatio-temporal trajectory 'modalities' of the video clip. We also introduce a
novel trajectory contrast loss to further enhance objectness in these summary
vectors. With experiments on four datasets -- SomethingSomething-V2,
SomethingElse, Action Genome and EpicKitchens -- we show that the
object-centric model outperforms prior video representations (both
object-agnostic and object-aware), when: (1) classifying actions on unseen
objects and unseen environments; (2) low-shot learning of novel classes; (3)
linear probe to other downstream tasks; as well as (4) for standard action
classification.Comment: Accepted to ACCV 202
- …