2,694 research outputs found
Interpretable Transformations with Encoder-Decoder Networks
Deep feature spaces have the capacity to encode complex transformations of
their input data. However, understanding the relative feature-space
relationship between two transformed encoded images is difficult. For instance,
what is the relative feature space relationship between two rotated images?
What is decoded when we interpolate in feature space? Ideally, we want to
disentangle confounding factors, such as pose, appearance, and illumination,
from object identity. Disentangling these is difficult because they interact in
very nonlinear ways. We propose a simple method to construct a deep feature
space, with explicitly disentangled representations of several known
transformations. A person or algorithm can then manipulate the disentangled
representation, for example, to re-render an image with explicit control over
parameterized degrees of freedom. The feature space is constructed using a
transforming encoder-decoder network with a custom feature transform layer,
acting on the hidden representations. We demonstrate the advantages of explicit
disentangling on a variety of datasets and transformations, and as an aid for
traditional tasks, such as classification.Comment: Accepted at ICCV 201
Graph Element Networks: adaptive, structured computation and memory
We explore the use of graph neural networks (GNNs) to model spatial processes
in which there is no a priori graphical structure. Similar to finite element
analysis, we assign nodes of a GNN to spatial locations and use a computational
process defined on the graph to model the relationship between an initial
function defined over a space and a resulting function in the same space. We
use GNNs as a computational substrate, and show that the locations of the nodes
in space as well as their connectivity can be optimized to focus on the most
complex parts of the space. Moreover, this representational strategy allows the
learned input-output relationship to generalize over the size of the underlying
space and run the same model at different levels of precision, trading
computation for accuracy. We demonstrate this method on a traditional PDE
problem, a physical prediction problem from robotics, and learning to predict
scene images from novel viewpoints.Comment: Accepted to ICML 201
Character Queries: A Transformer-based Approach to On-Line Handwritten Character Segmentation
On-line handwritten character segmentation is often associated with
handwriting recognition and even though recognition models include mechanisms
to locate relevant positions during the recognition process, it is typically
insufficient to produce a precise segmentation. Decoupling the segmentation
from the recognition unlocks the potential to further utilize the result of the
recognition. We specifically focus on the scenario where the transcription is
known beforehand, in which case the character segmentation becomes an
assignment problem between sampling points of the stylus trajectory and
characters in the text. Inspired by the -means clustering algorithm, we view
it from the perspective of cluster assignment and present a Transformer-based
architecture where each cluster is formed based on a learned character query in
the Transformer decoder block. In order to assess the quality of our approach,
we create character segmentation ground truths for two popular on-line
handwriting datasets, IAM-OnDB and HANDS-VNOnDB, and evaluate multiple methods
on them, demonstrating that our approach achieves the overall best results.Comment: ICDAR 2023 Best Student Paper Award. Code available at
https://github.com/jungomi/character-querie
A Vietnamese Handwritten Text Recognition Pipeline for Tetanus Medical Records
Machine learning techniques are successful for optical character recognition tasks, especially in recognizing handwriting. However, recognizing Vietnamese handwriting is challenging with the presence of extra six distinctive tonal symbols and vowels. Such a challenge is amplified given the handwriting of health workers in an emergency care setting, where staff is under constant pressure to record the well-being of patients. In this study, we aim to digitize the handwriting of Vietnamese health workers. We develop a complete handwritten text recognition pipeline that receives scanned documents, detects, and enhances the handwriting text areas of interest, transcribes the images into computer text, and finally auto-corrects invalid words and terms to achieve high accuracy. From experiments with medical documents written by 30 doctors and nurses from the Tetanus Emergency Care unit at the Hospital for Tropical Diseases, we obtain promising results of 2% and 12% for Character Error Rate and Word Error Rate, respectively
Writer adaptation for offline text recognition: An exploration of neural network-based methods
Handwriting recognition has seen significant success with the use of deep
learning. However, a persistent shortcoming of neural networks is that they are
not well-equipped to deal with shifting data distributions. In the field of
handwritten text recognition (HTR), this shows itself in poor recognition
accuracy for writers that are not similar to those seen during training. An
ideal HTR model should be adaptive to new writing styles in order to handle the
vast amount of possible writing styles. In this paper, we explore how HTR
models can be made writer adaptive by using only a handful of examples from a
new writer (e.g., 16 examples) for adaptation. Two HTR architectures are used
as base models, using a ResNet backbone along with either an LSTM or
Transformer sequence decoder. Using these base models, two methods are
considered to make them writer adaptive: 1) model-agnostic meta-learning
(MAML), an algorithm commonly used for tasks such as few-shot classification,
and 2) writer codes, an idea originating from automatic speech recognition.
Results show that an HTR-specific version of MAML known as MetaHTR improves
performance compared to the baseline with a 1.4 to 2.0 improvement in word
error rate (WER). The improvement due to writer adaptation is between 0.2 and
0.7 WER, where a deeper model seems to lend itself better to adaptation using
MetaHTR than a shallower model. However, applying MetaHTR to larger HTR models
or sentence-level HTR may become prohibitive due to its high computational and
memory requirements. Lastly, writer codes based on learned features or Hinge
statistical features did not lead to improved recognition performance.Comment: 21 pages including appendices, 6 figures, 10 table
Interpret Vision Transformers as ConvNets with Dynamic Convolutions
There has been a debate about the superiority between vision Transformers and
ConvNets, serving as the backbone of computer vision models. Although they are
usually considered as two completely different architectures, in this paper, we
interpret vision Transformers as ConvNets with dynamic convolutions, which
enables us to characterize existing Transformers and dynamic ConvNets in a
unified framework and compare their design choices side by side. In addition,
our interpretation can also guide the network design as researchers now can
consider vision Transformers from the design space of ConvNets and vice versa.
We demonstrate such potential through two specific studies. First, we inspect
the role of softmax in vision Transformers as the activation function and find
it can be replaced by commonly used ConvNets modules, such as ReLU and Layer
Normalization, which results in a faster convergence rate and better
performance. Second, following the design of depth-wise convolution, we create
a corresponding depth-wise vision Transformer that is more efficient with
comparable performance. The potential of the proposed unified interpretation is
not limited to the given examples and we hope it can inspire the community and
give rise to more advanced network architectures
Vectorization and Rasterization: Self-Supervised Learning for Sketch and Handwriting
Self-supervised learning has gained prominence due to its efficacy at
learning powerful representations from unlabelled data that achieve excellent
performance on many challenging downstream tasks. However supervision-free
pre-text tasks are challenging to design and usually modality specific.
Although there is a rich literature of self-supervised methods for either
spatial (such as images) or temporal data (sound or text) modalities, a common
pre-text task that benefits both modalities is largely missing. In this paper,
we are interested in defining a self-supervised pre-text task for sketches and
handwriting data. This data is uniquely characterised by its existence in dual
modalities of rasterized images and vector coordinate sequences. We address and
exploit this dual representation by proposing two novel cross-modal translation
pre-text tasks for self-supervised feature learning: Vectorization and
Rasterization. Vectorization learns to map image space to vector coordinates
and rasterization maps vector coordinates to image space. We show that the our
learned encoder modules benefit both raster-based and vector-based downstream
approaches to analysing hand-drawn data. Empirical evidence shows that our
novel pre-text tasks surpass existing single and multi-modal self-supervision
methods.Comment: IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2021
Code : https://github.com/AyanKumarBhunia/Self-Supervised-Learning-for-Sketc
A survey on knowledge-enhanced multimodal learning
Multimodal learning has been a field of increasing interest, aiming to
combine various modalities in a single joint representation. Especially in the
area of visiolinguistic (VL) learning multiple models and techniques have been
developed, targeting a variety of tasks that involve images and text. VL models
have reached unprecedented performances by extending the idea of Transformers,
so that both modalities can learn from each other. Massive pre-training
procedures enable VL models to acquire a certain level of real-world
understanding, although many gaps can be identified: the limited comprehension
of commonsense, factual, temporal and other everyday knowledge aspects
questions the extendability of VL tasks. Knowledge graphs and other knowledge
sources can fill those gaps by explicitly providing missing information,
unlocking novel capabilities of VL models. In the same time, knowledge graphs
enhance explainability, fairness and validity of decision making, issues of
outermost importance for such complex implementations. The current survey aims
to unify the fields of VL representation learning and knowledge graphs, and
provides a taxonomy and analysis of knowledge-enhanced VL models
- …