14,237 research outputs found
Document Modeling with Graph Attention Networks for Multi-grained Machine Reading Comprehension
Natural Questions is a new challenging machine reading comprehension
benchmark with two-grained answers, which are a long answer (typically a
paragraph) and a short answer (one or more entities inside the long answer).
Despite the effectiveness of existing methods on this benchmark, they treat
these two sub-tasks individually during training while ignoring their
dependencies. To address this issue, we present a novel multi-grained machine
reading comprehension framework that focuses on modeling documents at their
hierarchical nature, which are different levels of granularity: documents,
paragraphs, sentences, and tokens. We utilize graph attention networks to
obtain different levels of representations so that they can be learned
simultaneously. The long and short answers can be extracted from
paragraph-level representation and token-level representation, respectively. In
this way, we can model the dependencies between the two-grained answers to
provide evidence for each other. We jointly train the two sub-tasks, and our
experiments show that our approach significantly outperforms previous systems
at both long and short answer criteria.Comment: ACL202
IRC: Cross-layer design exploration of Intermittent Robust Computation units for IoTs
Energy-harvesting-powered computing offers intriguing and vast opportunities
to dramatically transform the landscape of the Internet of Things (IoT) devices
by utilizing ambient sources of energy to achieve battery-free computing. In
order to operate within the restricted energy capacity and intermittency
profile, it is proposed to innovate Intermittent Robust Computation (IRC) Unit
as a new duty-cycle-variable computing approach leveraging the non-volatility
inherent in spin-based switching devices. The foundations of IRC will be
advanced from the device-level upwards, by extending a Spin Hall Effect
Magnetic Tunnel Junction (SHE-MTJ) device. The device will then be used to
realize SHE-MTJ Majority/Polymorphic Gate (MG/PG) logic approaches and
libraries. Then a Logic-Embedded Flip-Flop (LE-FF) is developed to realize
rudimentary Boolean logic functions along with an inherent state-holding
capability within a compact footprint. Finally, the NV-Clustering synthesis
procedure and corresponding tool module are proposed to instantiate the LE-FF
library cells within conventional Register Transfer Language (RTL)
specifications. This selectively clusters together logic and NV state-holding
functionality, based on energy and area minimization criteria. It also realizes
middleware-coherent, intermittent computation without checkpointing,
micro-tasking, or software bloat and energy overheads vital to IoT. Simulation
results for various benchmark circuits including ISCAS-89 validate
functionality and power dissipation, area, and delay benefits
Full-Network Embedding in a Multimodal Embedding Pipeline
The current state-of-the-art for image annotation and image retrieval tasks
is obtained through deep neural networks, which combine an image representation
and a text representation into a shared embedding space. In this paper we
evaluate the impact of using the Full-Network embedding in this setting,
replacing the original image representation in a competitive multimodal
embedding generation scheme. Unlike the one-layer image embeddings typically
used by most approaches, the Full-Network embedding provides a multi-scale
representation of images, which results in richer characterizations. To measure
the influence of the Full-Network embedding, we evaluate its performance on
three different datasets, and compare the results with the original multimodal
embedding generation scheme when using a one-layer image embedding, and with
the rest of the state-of-the-art. Results for image annotation and image
retrieval tasks indicate that the Full-Network embedding is consistently
superior to the one-layer embedding. These results motivate the integration of
the Full-Network embedding on any multimodal embedding generation scheme,
something feasible thanks to the flexibility of the approach.Comment: In 2nd Workshop on Semantic Deep Learning (SemDeep-2) at the 12th
International Conference on Computational Semantics (IWCS) 201
Measuring Human Perception to Improve Handwritten Document Transcription
The subtleties of human perception, as measured by vision scientists through
the use of psychophysics, are important clues to the internal workings of
visual recognition. For instance, measured reaction time can indicate whether a
visual stimulus is easy for a subject to recognize, or whether it is hard. In
this paper, we consider how to incorporate psychophysical measurements of
visual perception into the loss function of a deep neural network being trained
for a recognition task, under the assumption that such information can enforce
consistency with human behavior. As a case study to assess the viability of
this approach, we look at the problem of handwritten document transcription.
While good progress has been made towards automatically transcribing modern
handwriting, significant challenges remain in transcribing historical
documents. Here we describe a general enhancement strategy, underpinned by the
new loss formulation, which can be applied to the training regime of any deep
learning-based document transcription system. Through experimentation, reliable
performance improvement is demonstrated for the standard IAM and RIMES datasets
for three different network architectures. Further, we go on to show
feasibility for our approach on a new dataset of digitized Latin manuscripts,
originally produced by scribes in the Cloister of St. Gall in the the 9th
century
OneNet: Joint Domain, Intent, Slot Prediction for Spoken Language Understanding
In practice, most spoken language understanding systems process user input in
a pipelined manner; first domain is predicted, then intent and semantic slots
are inferred according to the semantic frames of the predicted domain. The
pipeline approach, however, has some disadvantages: error propagation and lack
of information sharing. To address these issues, we present a unified neural
network that jointly performs domain, intent, and slot predictions. Our
approach adopts a principled architecture for multitask learning to fold in the
state-of-the-art models for each task. With a few more ingredients, e.g.
orthography-sensitive input encoding and curriculum training, our model
delivered significant improvements in all three tasks across all domains over
strong baselines, including one using oracle prediction for domain detection,
on real user data of a commercial personal assistant.Comment: 5 pages conference paper accepted to IEEE ASRU 2017. Will be
published in December 201
jiant: A Software Toolkit for Research on General-Purpose Text Understanding Models
We introduce jiant, an open source toolkit for conducting multitask and
transfer learning experiments on English NLU tasks. jiant enables modular and
configuration-driven experimentation with state-of-the-art models and
implements a broad set of tasks for probing, transfer learning, and multitask
training experiments. jiant implements over 50 NLU tasks, including all GLUE
and SuperGLUE benchmark tasks. We demonstrate that jiant reproduces published
performance on a variety of tasks and models, including BERT and RoBERTa. jiant
is available at https://jiant.info
PyText: A Seamless Path from NLP research to production
We introduce PyText - a deep learning based NLP modeling framework built on
PyTorch. PyText addresses the often-conflicting requirements of enabling rapid
experimentation and of serving models at scale. It achieves this by providing
simple and extensible interfaces for model components, and by using PyTorch's
capabilities of exporting models for inference via the optimized Caffe2
execution engine. We report our own experience of migrating experimentation and
production workflows to PyText, which enabled us to iterate faster on novel
modeling ideas and then seamlessly ship them at industrial scale
An Online Attention-based Model for Speech Recognition
Attention-based end-to-end models such as Listen, Attend and Spell (LAS),
simplify the whole pipeline of traditional automatic speech recognition (ASR)
systems and become popular in the field of speech recognition. In previous
work, researchers have shown that such architectures can acquire comparable
results to state-of-the-art ASR systems, especially when using a bidirectional
encoder and global soft attention (GSA) mechanism. However, bidirectional
encoder and GSA are two obstacles for real-time speech recognition. In this
work, we aim to stream LAS baseline by removing the above two obstacles. On the
encoder side, we use a latency-controlled (LC) bidirectional structure to
reduce the delay of forward computation. Meanwhile, an adaptive monotonic
chunk-wise attention (AMoChA) mechanism is proposed to replace GSA for the
calculation of attention weight distribution. Furthermore, we propose two
methods to alleviate the huge performance degradation when combining LC and
AMoChA. Finally, we successfully acquire an online LAS model, LC-AMoChA, which
has only 3.5% relative performance reduction to LAS baseline on our internal
Mandarin corpus
Detecting Work Zones in SHRP 2 NDS Videos Using Deep Learning Based Computer Vision
Naturalistic driving studies seek to perform the observations of human driver
behavior in the variety of environmental conditions necessary to analyze,
understand and predict that behavior using statistical and physical models. The
second Strategic Highway Research Program (SHRP 2) funds a number of
transportation safety-related projects including its primary effort, the
Naturalistic Driving Study (NDS), and an effort supplementary to the NDS, the
Roadway Information Database (RID). This work seeks to expand the range of
answerable research questions that researchers might pose to the NDS and RID
databases. Specifically, we present the SHRP 2 NDS Video Analytics (SNVA)
software application, which extracts information from NDS-instrumented
vehicles' forward-facing camera footage and efficiently integrates that
information into the RID, tying the video content to geolocations and other
trip attributes. Of particular interest to researchers and other stakeholders
is the integration of work zone, traffic signal state and weather information.
The version of SNVA introduced in this paper focuses on work zone detection,
the highest priority. The ability to automate the discovery and cataloging of
this information, and to do so quickly, is especially important given the two
petabyte (2PB) size of the NDS video data set.Comment: IEEE 17th International Conference on Machine Learning and
Applications (ICMLA 2018), 3 figures, 1 table, 2 algorithm
Speech-Driven Facial Reenactment Using Conditional Generative Adversarial Networks
We present a novel approach to generating photo-realistic images of a face
with accurate lip sync, given an audio input. By using a recurrent neural
network, we achieved mouth landmarks based on audio features. We exploited the
power of conditional generative adversarial networks to produce
highly-realistic face conditioned on a set of landmarks. These two networks
together are capable of producing a sequence of natural faces in sync with an
input audio track.Comment: Submitted for ECCV 201
- …