2,438 research outputs found
Neural Data-to-Text Generation via Jointly Learning the Segmentation and Correspondence
The neural attention model has achieved great success in data-to-text
generation tasks. Though usually excelling at producing fluent text, it suffers
from the problem of information missing, repetition and "hallucination". Due to
the black-box nature of the neural attention architecture, avoiding these
problems in a systematic way is non-trivial. To address this concern, we
propose to explicitly segment target text into fragment units and align them
with their data correspondences. The segmentation and correspondence are
jointly learned as latent variables without any human annotations. We further
impose a soft statistical constraint to regularize the segmental granularity.
The resulting architecture maintains the same expressive power as neural
attention models, while being able to generate fully interpretable outputs with
several times less computational cost. On both E2E and WebNLG benchmarks, we
show the proposed model consistently outperforms its neural attention
counterparts.Comment: Accepted at ACL 202
Training a Hidden Markov Model with a Bayesian Spiking Neural Network
It is of some interest to understand how statistically based mechanisms for
signal processing might be integrated with biologically motivated mechanisms
such as neural networks. This paper explores a novel hybrid approach for
classifying segments of sequential data, such as individual spoken works. The
approach combines a hidden Markov model (HMM) with a spiking neural network
(SNN). The HMM, consisting of states and transitions, forms a fixed backbone
with nonadaptive transition probabilities. The SNN, however, implements a
biologically based Bayesian computation that derives from the spike
timing-dependent plasticity (STDP) learning rule. The emission (observation)
probabilities of the HMM are represented in the SNN and trained with the STDP
rule. A separate SNN, each with the same architecture, is associated with each
of the states of the HMM. Because of the STDP training, each SNN implements an
expectation maximization algorithm to learn the emission probabilities for one
HMM state. The model was studied on synthesized spike-train data and also on
spoken word data. Preliminary results suggest its performance compares
favorably with other biologically motivated approaches. Because of the model's
uniqueness and initial promise, it warrants further study. It provides some new
ideas on how the brain might implement the equivalent of an HMM in a neural
circuit.Comment: Bayesian Spiking Neural Network, Revision submitted: April-27-201
NENET: An Edge Learnable Network for Link Prediction in Scene Text
Text detection in scenes based on deep neural networks have shown promising
results. Instead of using word bounding box regression, recent state-of-the-art
methods have started focusing on character bounding box and pixel-level
prediction. This necessitates the need to link adjacent characters, which we
propose in this paper using a novel Graph Neural Network (GNN) architecture
that allows us to learn both node and edge features as opposed to only the node
features under the typical GNN. The main advantage of using GNN for link
prediction lies in its ability to connect characters which are spatially
separated and have an arbitrary orientation. We show our concept on the well
known SynthText dataset, achieving top results as compared to state-of-the-art
methods.Comment: 9 page
Table understanding in structured documents
Abstract--- Table detection and extraction has been studied in the context of
documents like reports, where tables are clearly outlined and stand out from
the document structure visually. We study this topic in a rather more
challenging domain of layout-heavy business documents, particularly invoices.
Invoices present the novel challenges of tables being often without outlines -
either in the form of borders or surrounding text flow - with ragged columns
and widely varying data content. We will also show, that we can extract
specific information from structurally different tables or table-like
structures with one model. We present a comprehensive representation of a page
using graph over word boxes, positional embeddings, trainable textual features
and rephrase the table detection as a text box labeling problem. We will work
on our newly presented dataset of pro forma invoices, invoices and debit note
documents using this representation and propose multiple baselines to solve
this labeling problem. We then propose a novel neural network model that
achieves strong, practical results on the presented dataset and analyze the
model performance and effects of graph convolutions and self-attention in
detail.Comment: Changed from previous version based on icdar2019 feedback to include
6 pages, 2 figures. Slightly changed paper name and abstract to be less
misleading. Corrected grammar and shortened content heavily, corrected
misleading information and readability. Currently in review for icdar2019-wml
subconference/worksho
A Review of Research on Devnagari Character Recognition
English Character Recognition (CR) has been extensively studied in the last
half century and progressed to a level, sufficient to produce technology driven
applications. But same is not the case for Indian languages which are
complicated in terms of structure and computations. Rapidly growing
computational power may enable the implementation of Indic CR methodologies.
Digital document processing is gaining popularity for application to office and
library automation, bank and postal services, publishing houses and
communication technology. Devnagari being the national language of India,
spoken by more than 500 million people, should be given special attention so
that document retrieval and analysis of rich ancient and modern Indian
literature can be effectively done. This article is intended to serve as a
guide and update for the readers, working in the Devnagari Optical Character
Recognition (DOCR) area. An overview of DOCR systems is presented and the
available DOCR techniques are reviewed. The current status of DOCR is discussed
and directions for future research are suggested.Comment: 8 pages, 1 Figure, 8 Tables, Journal pape
Natural Language Processing (almost) from Scratch
We propose a unified neural network architecture and learning algorithm that
can be applied to various natural language processing tasks including:
part-of-speech tagging, chunking, named entity recognition, and semantic role
labeling. This versatility is achieved by trying to avoid task-specific
engineering and therefore disregarding a lot of prior knowledge. Instead of
exploiting man-made input features carefully optimized for each task, our
system learns internal representations on the basis of vast amounts of mostly
unlabeled training data. This work is then used as a basis for building a
freely available tagging system with good performance and minimal computational
requirements
Deep Structured Prediction with Nonlinear Output Transformations
Deep structured models are widely used for tasks like semantic segmentation,
where explicit correlations between variables provide important prior
information which generally helps to reduce the data needs of deep nets.
However, current deep structured models are restricted by oftentimes very local
neighborhood structure, which cannot be increased for computational complexity
reasons, and by the fact that the output configuration, or a representation
thereof, cannot be transformed further. Very recent approaches which address
those issues include graphical model inference inside deep nets so as to permit
subsequent non-linear output space transformations. However, optimization of
those formulations is challenging and not well understood. Here, we develop a
novel model which generalizes existing approaches, such as structured
prediction energy networks, and discuss a formulation which maintains
applicability of existing inference techniques.Comment: Appearing in NIPS 201
Neural Speech Synthesis with Transformer Network
Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2)
are proposed and achieve state-of-the-art performance, they still suffer from
two problems: 1) low efficiency during training and inference; 2) hard to model
long dependency using current recurrent neural networks (RNNs). Inspired by the
success of Transformer network in neural machine translation (NMT), in this
paper, we introduce and adapt the multi-head attention mechanism to replace the
RNN structures and also the original attention mechanism in Tacotron2. With the
help of multi-head self-attention, the hidden states in the encoder and decoder
are constructed in parallel, which improves the training efficiency. Meanwhile,
any two inputs at different times are connected directly by self-attention
mechanism, which solves the long range dependency problem effectively. Using
phoneme sequences as input, our Transformer TTS network generates mel
spectrograms, followed by a WaveNet vocoder to output the final audio results.
Experiments are conducted to test the efficiency and performance of our new
network. For the efficiency, our Transformer TTS network can speed up the
training about 4.25 times faster compared with Tacotron2. For the performance,
rigorous human tests show that our proposed model achieves state-of-the-art
performance (outperforms Tacotron2 with a gap of 0.048) and is very close to
human quality (4.39 vs 4.44 in MOS)
Improving patch-based scene text script identification with ensembles of conjoined networks
This paper focuses on the problem of script identification in scene text
images. Facing this problem with state of the art CNN classifiers is not
straightforward, as they fail to address a key characteristic of scene text
instances: their extremely variable aspect ratio. Instead of resizing input
images to a fixed aspect ratio as in the typical use of holistic CNN
classifiers, we propose here a patch-based classification framework in order to
preserve discriminative parts of the image that are characteristic of its
class. We describe a novel method based on the use of ensembles of conjoined
networks to jointly learn discriminative stroke-parts representations and their
relative importance in a patch-based classification scheme. Our experiments
with this learning procedure demonstrate state-of-the-art results in two public
script identification datasets. In addition, we propose a new public benchmark
dataset for the evaluation of multi-lingual scene text end-to-end reading
systems. Experiments done in this dataset demonstrate the key role of script
identification in a complete end-to-end system that combines our script
identification method with a previously published text detector and an
off-the-shelf OCR engine
Graph U-Nets
We consider the problem of representation learning for graph data.
Convolutional neural networks can naturally operate on images, but have
significant challenges in dealing with graph data. Given images are special
cases of graphs with nodes lie on 2D lattices, graph embedding tasks have a
natural correspondence with image pixel-wise prediction tasks such as
segmentation. While encoder-decoder architectures like U-Nets have been
successfully applied on many image pixel-wise prediction tasks, similar methods
are lacking for graph data. This is due to the fact that pooling and
up-sampling operations are not natural on graph data. To address these
challenges, we propose novel graph pooling (gPool) and unpooling (gUnpool)
operations in this work. The gPool layer adaptively selects some nodes to form
a smaller graph based on their scalar projection values on a trainable
projection vector. We further propose the gUnpool layer as the inverse
operation of the gPool layer. The gUnpool layer restores the graph into its
original structure using the position information of nodes selected in the
corresponding gPool layer. Based on our proposed gPool and gUnpool layers, we
develop an encoder-decoder model on graph, known as the graph U-Nets. Our
experimental results on node classification and graph classification tasks
demonstrate that our methods achieve consistently better performance than
previous models.Comment: 10 pages, ICML1
- …