795 research outputs found
Bilateral Multi-Perspective Matching for Natural Language Sentences
Natural language sentence matching is a fundamental technology for a variety
of tasks. Previous approaches either match sentences from a single direction or
only apply single granular (word-by-word or sentence-by-sentence) matching. In
this work, we propose a bilateral multi-perspective matching (BiMPM) model
under the "matching-aggregation" framework. Given two sentences and ,
our model first encodes them with a BiLSTM encoder. Next, we match the two
encoded sentences in two directions and . In
each matching direction, each time step of one sentence is matched against all
time-steps of the other sentence from multiple perspectives. Then, another
BiLSTM layer is utilized to aggregate the matching results into a fix-length
matching vector. Finally, based on the matching vector, the decision is made
through a fully connected layer. We evaluate our model on three tasks:
paraphrase identification, natural language inference and answer sentence
selection. Experimental results on standard benchmark datasets show that our
model achieves the state-of-the-art performance on all tasks.Comment: To appear in Proceedings of IJCAI 201
One-Shot Learning for Semantic Segmentation
Low-shot learning methods for image classification support learning from
sparse data. We extend these techniques to support dense semantic image
segmentation. Specifically, we train a network that, given a small set of
annotated images, produces parameters for a Fully Convolutional Network (FCN).
We use this FCN to perform dense pixel-level prediction on a test image for the
new semantic class. Our architecture shows a 25% relative meanIoU improvement
compared to the best baseline methods for one-shot segmentation on unseen
classes in the PASCAL VOC 2012 dataset and is at least 3 times faster.Comment: To appear in the proceedings of the British Machine Vision Conference
(BMVC) 2017. The code is available at https://github.com/lzzcd001/OSLS
Unsupervised Learning of Visual Representations using Videos
Is strong supervision necessary for learning a good visual representation? Do
we really need millions of semantically-labeled images to train a Convolutional
Neural Network (CNN)? In this paper, we present a simple yet surprisingly
powerful approach for unsupervised learning of CNN. Specifically, we use
hundreds of thousands of unlabeled videos from the web to learn visual
representations. Our key idea is that visual tracking provides the supervision.
That is, two patches connected by a track should have similar visual
representation in deep feature space since they probably belong to the same
object or object part. We design a Siamese-triplet network with a ranking loss
function to train this CNN representation. Without using a single image from
ImageNet, just using 100K unlabeled videos and the VOC 2012 dataset, we train
an ensemble of unsupervised networks that achieves 52% mAP (no bounding box
regression). This performance comes tantalizingly close to its
ImageNet-supervised counterpart, an ensemble which achieves a mAP of 54.4%. We
also show that our unsupervised network can perform competitively in other
tasks such as surface-normal estimation
A Hybrid Siamese Neural Network for Natural Language Inference in Cyber-Physical Systems
Cyber-Physical Systems (CPS), as a multi-dimensional complex system that connects the physical world and the cyber world, has a strong demand for processing large amounts of heterogeneous data. These tasks also include Natural Language Inference (NLI) tasks based on text from different sources. However, the current research on natural language processing in CPS does not involve exploration in this field. Therefore, this study proposes a Siamese Network structure that combines Stacked Residual Long Short-Term Memory (bidirectional) with the Attention mechanism and Capsule Network for the NLI module in CPS, which is used to infer the relationship between text/language data from different sources. This model is mainly used to implement NLI tasks and conduct a detailed evaluation in three main NLI benchmarks as the basic semantic understanding module in CPS. Comparative experiments prove that the proposed method achieves competitive performance, has a certain generalization ability, and can balance the performance and the number of trained parameters
Looking Beyond Appearances: Synthetic Training Data for Deep CNNs in Re-identification
Re-identification is generally carried out by encoding the appearance of a
subject in terms of outfit, suggesting scenarios where people do not change
their attire. In this paper we overcome this restriction, by proposing a
framework based on a deep convolutional neural network, SOMAnet, that
additionally models other discriminative aspects, namely, structural attributes
of the human figure (e.g. height, obesity, gender). Our method is unique in
many respects. First, SOMAnet is based on the Inception architecture, departing
from the usual siamese framework. This spares expensive data preparation
(pairing images across cameras) and allows the understanding of what the
network learned. Second, and most notably, the training data consists of a
synthetic 100K instance dataset, SOMAset, created by photorealistic human body
generation software. Synthetic data represents a good compromise between
realistic imagery, usually not required in re-identification since surveillance
cameras capture low-resolution silhouettes, and complete control of the
samples, which is useful in order to customize the data w.r.t. the surveillance
scenario at-hand, e.g. ethnicity. SOMAnet, trained on SOMAset and fine-tuned on
recent re-identification benchmarks, outperforms all competitors, matching
subjects even with different apparel. The combination of synthetic data with
Inception architectures opens up new research avenues in re-identification.Comment: 14 page
- …