16,925 research outputs found
Neural Recovery Machine for Chinese Dropped Pronoun
Dropped pronouns (DPs) are ubiquitous in pro-drop languages like Chinese,
Japanese etc. Previous work mainly focused on painstakingly exploring the
empirical features for DPs recovery. In this paper, we propose a neural
recovery machine (NRM) to model and recover DPs in Chinese, so that to avoid
the non-trivial feature engineering process. The experimental results show that
the proposed NRM significantly outperforms the state-of-the-art approaches on
both two heterogeneous datasets. Further experiment results of Chinese zero
pronoun (ZP) resolution show that the performance of ZP resolution can also be
improved by recovering the ZPs to DPs
Multi-Modal Generative Adversarial Network for Short Product Title Generation in Mobile E-Commerce
Nowadays, more and more customers browse and purchase products in favor of
using mobile E-Commerce Apps such as Taobao and Amazon. Since merchants are
usually inclined to describe redundant and over-informative product titles to
attract attentions from customers, it is important to concisely display short
product titles on limited screen of mobile phones. To address this discrepancy,
previous studies mainly consider textual information of long product titles and
lacks of human-like view during training and evaluation process. In this paper,
we propose a Multi-Modal Generative Adversarial Network (MM-GAN) for short
product title generation in E-Commerce, which innovatively incorporates image
information and attribute tags from product, as well as textual information
from original long titles. MM-GAN poses short title generation as a
reinforcement learning process, where the generated titles are evaluated by the
discriminator in a human-like view. Extensive experiments on a large-scale
E-Commerce dataset demonstrate that our algorithm outperforms other
state-of-the-art methods. Moreover, we deploy our model into a real-world
online E-Commerce environment and effectively boost the performance of click
through rate and click conversion rate by 1.66% and 1.87%, respectively.Comment: Accepted by NAACL-HLT 2019. arXiv admin note: substantial text
overlap with arXiv:1811.0449
Multimodal Transformer with Multi-View Visual Representation for Image Captioning
Image captioning aims to automatically generate a natural language
description of a given image, and most state-of-the-art models have adopted an
encoder-decoder framework. The framework consists of a convolution neural
network (CNN)-based image encoder that extracts region-based visual features
from the input image, and an recurrent neural network (RNN)-based caption
decoder that generates the output caption words based on the visual features
with the attention mechanism. Despite the success of existing studies, current
methods only model the co-attention that characterizes the inter-modal
interactions while neglecting the self-attention that characterizes the
intra-modal interactions. Inspired by the success of the Transformer model in
machine translation, here we extend it to a Multimodal Transformer (MT) model
for image captioning. Compared to existing image captioning approaches, the MT
model simultaneously captures intra- and inter-modal interactions in a unified
attention block. Due to the in-depth modular composition of such attention
blocks, the MT model can perform complex multimodal reasoning and output
accurate captions. Moreover, to further improve the image captioning
performance, multi-view visual features are seamlessly introduced into the MT
model. We quantitatively and qualitatively evaluate our approach using the
benchmark MSCOCO image captioning dataset and conduct extensive ablation
studies to investigate the reasons behind its effectiveness. The experimental
results show that our method significantly outperforms the previous
state-of-the-art methods. With an ensemble of seven models, our solution ranks
the 1st place on the real-time leaderboard of the MSCOCO image captioning
challenge at the time of the writing of this paper.Comment: submitted to a journa
Few-shot Compositional Font Generation with Dual Memory
Generating a new font library is a very labor-intensive and time-consuming
job for glyph-rich scripts. Despite the remarkable success of existing font
generation methods, they have significant drawbacks; they require a large
number of reference images to generate a new font set, or they fail to capture
detailed styles with only a few samples. In this paper, we focus on
compositional scripts, a widely used letter system in the world, where each
glyph can be decomposed by several components. By utilizing the
compositionality of compositional scripts, we propose a novel font generation
framework, named Dual Memory-augmented Font Generation Network (DM-Font), which
enables us to generate a high-quality font library with only a few samples. We
employ memory components and global-context awareness in the generator to take
advantage of the compositionality. In the experiments on Korean-handwriting
fonts and Thai-printing fonts, we observe that our method generates a
significantly better quality of samples with faithful stylization compared to
the state-of-the-art generation methods quantitatively and qualitatively.
Source code is available at https://github.com/clovaai/dmfont.Comment: ECCV 2020 camera-read
Vision-to-Language Tasks Based on Attributes and Attention Mechanism
Vision-to-language tasks aim to integrate computer vision and natural
language processing together, which has attracted the attention of many
researchers. For typical approaches, they encode image into feature
representations and decode it into natural language sentences. While they
neglect high-level semantic concepts and subtle relationships between image
regions and natural language elements. To make full use of these information,
this paper attempt to exploit the text guided attention and semantic-guided
attention (SA) to find the more correlated spatial information and reduce the
semantic gap between vision and language. Our method includes two level
attention networks. One is the text-guided attention network which is used to
select the text-related regions. The other is SA network which is used to
highlight the concept-related regions and the region-related concepts. At last,
all these information are incorporated to generate captions or answers.
Practically, image captioning and visual question answering experiments have
been carried out, and the experimental results have shown the excellent
performance of the proposed approach.Comment: 15 pages, 6 figures, 50 reference
A Survey on Food Computing
Food is very essential for human life and it is fundamental to the human
experience. Food-related study may support multifarious applications and
services, such as guiding the human behavior, improving the human health and
understanding the culinary culture. With the rapid development of social
networks, mobile networks, and Internet of Things (IoT), people commonly
upload, share, and record food images, recipes, cooking videos, and food
diaries, leading to large-scale food data. Large-scale food data offers rich
knowledge about food and can help tackle many central issues of human society.
Therefore, it is time to group several disparate issues related to food
computing. Food computing acquires and analyzes heterogenous food data from
disparate sources for perception, recognition, retrieval, recommendation, and
monitoring of food. In food computing, computational approaches are applied to
address food related issues in medicine, biology, gastronomy and agronomy. Both
large-scale food data and recent breakthroughs in computer science are
transforming the way we analyze food data. Therefore, vast amounts of work has
been conducted in the food area, targeting different food-oriented tasks and
applications. However, there are very few systematic reviews, which shape this
area well and provide a comprehensive and in-depth summary of current efforts
or detail open problems in this area. In this paper, we formalize food
computing and present such a comprehensive overview of various emerging
concepts, methods, and tasks. We summarize key challenges and future directions
ahead for food computing. This is the first comprehensive survey that targets
the study of computing technology for the food area and also offers a
collection of research studies and technologies to benefit researchers and
practitioners working in different food-related fields.Comment: Accepted by ACM Computing Survey
Hooks in the Headline: Learning to Generate Headlines with Controlled Styles
Current summarization systems only produce plain, factual headlines, but do
not meet the practical needs of creating memorable titles to increase exposure.
We propose a new task, Stylistic Headline Generation (SHG), to enrich the
headlines with three style options (humor, romance and clickbait), in order to
attract more readers. With no style-specific article-headline pair (only a
standard headline summarization dataset and mono-style corpora), our method
TitleStylist generates style-specific headlines by combining the summarization
and reconstruction tasks into a multitasking framework. We also introduced a
novel parameter sharing scheme to further disentangle the style from the text.
Through both automatic and human evaluation, we demonstrate that TitleStylist
can generate relevant, fluent headlines with three target styles: humor,
romance, and clickbait. The attraction score of our model generated headlines
surpasses that of the state-of-the-art summarization model by 9.68%, and even
outperforms human-written references.Comment: ACL 202
cvpaper.challenge in 2016: Futuristic Computer Vision through 1,600 Papers Survey
The paper gives futuristic challenges disscussed in the cvpaper.challenge. In
2015 and 2016, we thoroughly study 1,600+ papers in several
conferences/journals such as CVPR/ICCV/ECCV/NIPS/PAMI/IJCV
LiveBot: Generating Live Video Comments Based on Visual and Textual Contexts
We introduce the task of automatic live commenting. Live commenting, which is
also called `video barrage', is an emerging feature on online video sites that
allows real-time comments from viewers to fly across the screen like bullets or
roll at the right side of the screen. The live comments are a mixture of
opinions for the video and the chit chats with other comments. Automatic live
commenting requires AI agents to comprehend the videos and interact with human
viewers who also make the comments, so it is a good testbed of an AI agent's
ability of dealing with both dynamic vision and language. In this work, we
construct a large-scale live comment dataset with 2,361 videos and 895,929 live
comments. Then, we introduce two neural models to generate live comments based
on the visual and textual contexts, which achieve better performance than
previous neural baselines such as the sequence-to-sequence model. Finally, we
provide a retrieval-based evaluation protocol for automatic live commenting
where the model is asked to sort a set of candidate comments based on the
log-likelihood score, and evaluated on metrics such as mean-reciprocal-rank.
Putting it all together, we demonstrate the first `LiveBot'
Deep Learning applied to NLP
Convolutional Neural Network (CNNs) are typically associated with Computer
Vision. CNNs are responsible for major breakthroughs in Image Classification
and are the core of most Computer Vision systems today. More recently CNNs have
been applied to problems in Natural Language Processing and gotten some
interesting results. In this paper, we will try to explain the basics of CNNs,
its different variations and how they have been applied to NLP
- …