6,380 research outputs found
Neural Recovery Machine for Chinese Dropped Pronoun
Dropped pronouns (DPs) are ubiquitous in pro-drop languages like Chinese,
Japanese etc. Previous work mainly focused on painstakingly exploring the
empirical features for DPs recovery. In this paper, we propose a neural
recovery machine (NRM) to model and recover DPs in Chinese, so that to avoid
the non-trivial feature engineering process. The experimental results show that
the proposed NRM significantly outperforms the state-of-the-art approaches on
both two heterogeneous datasets. Further experiment results of Chinese zero
pronoun (ZP) resolution show that the performance of ZP resolution can also be
improved by recovering the ZPs to DPs
Fast and Accurate Neural Word Segmentation for Chinese
Neural models with minimal feature engineering have achieved competitive
performance against traditional methods for the task of Chinese word
segmentation. However, both training and working procedures of the current
neural models are computationally inefficient. This paper presents a greedy
neural word segmenter with balanced word and character embedding inputs to
alleviate the existing drawbacks. Our segmenter is truly end-to-end, capable of
performing segmentation much faster and even more accurate than
state-of-the-art neural models on Chinese benchmark datasets.Comment: To appear in ACL201
Open Vocabulary Learning for Neural Chinese Pinyin IME
Pinyin-to-character (P2C) conversion is the core component of pinyin-based
Chinese input method engine (IME). However, the conversion is seriously
compromised by the ambiguities of Chinese characters corresponding to pinyin as
well as the predefined fixed vocabularies. To alleviate such inconveniences, we
propose a neural P2C conversion model augmented by an online updated vocabulary
with a sampling mechanism to support open vocabulary learning during IME
working. Our experiments show that the proposed method outperforms commercial
IMEs and state-of-the-art traditional models on standard corpus and true
inputting history dataset in terms of multiple metrics and thus the online
updated vocabulary indeed helps our IME effectively follows user inputting
behavior.Comment: Accepted by ACL 201
Synonym Discovery with Etymology-based Word Embeddings
We propose a novel approach to learn word embeddings based on an extended
version of the distributional hypothesis. Our model derives word embedding
vectors using the etymological composition of words, rather than the context in
which they appear. It has the strength of not requiring a large text corpus,
but instead it requires reliable access to etymological roots of words, making
it specially fit for languages with logographic writing systems. The model
consists on three steps: (1) building an etymological graph, which is a
bipartite network of words and etymological roots, (2) obtaining the
biadjacency matrix of the etymological graph and reducing its dimensionality,
(3) using columns/rows of the resulting matrices as embedding vectors. We test
our model in the Chinese and Sino-Korean vocabularies. Our graphs are formed by
a set of 117,000 Chinese words, and a set of 135,000 Sino-Korean words. In both
cases we show that our model performs well in the task of synonym discovery.Comment: 6 pages, IEEE Symposium Series on Computational Intelligence (IEEE
SSCI 2017
COMIC: Towards A Compact Image Captioning Model with Attention
Recent works in image captioning have shown very promising raw performance.
However, we realize that most of these encoder-decoder style networks with
attention do not scale naturally to large vocabulary size, making them
difficult to be deployed on embedded system with limited hardware resources.
This is because the size of word and output embedding matrices grow
proportionally with the size of vocabulary, adversely affecting the compactness
of these networks. To address this limitation, this paper introduces a brand
new idea in the domain of image captioning. That is, we tackle the problem of
compactness of image captioning models which is hitherto unexplored. We showed
that, our proposed model, named COMIC for COMpact Image Captioning, achieves
comparable results in five common evaluation metrics with state-of-the-art
approaches on both MS-COCO and InstaPIC-1.1M datasets despite having an
embedding vocabulary size that is 39x - 99x smaller. The source code and models
are available at:
https://github.com/jiahuei/COMIC-Compact-Image-Captioning-with-AttentionComment: Added source code link and new results in Table
Cross-Lingual Transfer of Semantic Roles: From Raw Text to Semantic Roles
We describe a transfer method based on annotation projection to develop a
dependency-based semantic role labeling system for languages for which no
supervised linguistic information other than parallel data is available. Unlike
previous work that presumes the availability of supervised features such as
lemmas, part-of-speech tags, and dependency parse trees, we only make use of
word and character features. Our deep model considers using character-based
representations as well as unsupervised stem embeddings to alleviate the need
for supervised features. Our experiments outperform a state-of-the-art method
that uses supervised lexico-syntactic features on 6 out of 7 languages in the
Universal Proposition Bank.Comment: Accepted at the 13th International Conference on Computational
Semantics (IWCS 2019
Are You Talking to a Machine? Dataset and Methods for Multilingual Image Question Answering
In this paper, we present the mQA model, which is able to answer questions
about the content of an image. The answer can be a sentence, a phrase or a
single word. Our model contains four components: a Long Short-Term Memory
(LSTM) to extract the question representation, a Convolutional Neural Network
(CNN) to extract the visual representation, an LSTM for storing the linguistic
context in an answer, and a fusing component to combine the information from
the first three components and generate the answer. We construct a Freestyle
Multilingual Image Question Answering (FM-IQA) dataset to train and evaluate
our mQA model. It contains over 150,000 images and 310,000 freestyle Chinese
question-answer pairs and their English translations. The quality of the
generated answers of our mQA model on this dataset is evaluated by human judges
through a Turing Test. Specifically, we mix the answers provided by humans and
our model. The human judges need to distinguish our model from the human. They
will also provide a score (i.e. 0, 1, 2, the larger the better) indicating the
quality of the answer. We propose strategies to monitor the quality of this
evaluation process. The experiments show that in 64.7% of cases, the human
judges cannot distinguish our model from humans. The average score is 1.454
(1.918 for human). The details of this work, including the FM-IQA dataset, can
be found on the project page: http://idl.baidu.com/FM-IQA.htmlComment: Dataset released on the project page, see
http://idl.baidu.com/FM-IQA.html ; NIPS 2015 camera ready versio
Convolutional Neural Network with Word Embeddings for Chinese Word Segmentation
Character-based sequence labeling framework is flexible and efficient for
Chinese word segmentation (CWS). Recently, many character-based neural models
have been applied to CWS. While they obtain good performance, they have two
obvious weaknesses. The first is that they heavily rely on manually designed
bigram feature, i.e. they are not good at capturing n-gram features
automatically. The second is that they make no use of full word information.
For the first weakness, we propose a convolutional neural model, which is able
to capture rich n-gram features without any feature engineering. For the second
one, we propose an effective approach to integrate the proposed model with word
embeddings. We evaluate the model on two benchmark datasets: PKU and MSR.
Without any feature engineering, the model obtains competitive performance --
95.7% on PKU and 97.3% on MSR. Armed with word embeddings, the model achieves
state-of-the-art performance on both datasets -- 96.5% on PKU and 98.0% on MSR,
without using any external labeled resource.Comment: will be published by IJCNLP201
Listening to Chaotic Whispers: A Deep Learning Framework for News-oriented Stock Trend Prediction
Stock trend prediction plays a critical role in seeking maximized profit from
stock investment. However, precise trend prediction is very difficult since the
highly volatile and non-stationary nature of stock market. Exploding
information on Internet together with advancing development of natural language
processing and text mining techniques have enable investors to unveil market
trends and volatility from online content. Unfortunately, the quality,
trustworthiness and comprehensiveness of online content related to stock market
varies drastically, and a large portion consists of the low-quality news,
comments, or even rumors. To address this challenge, we imitate the learning
process of human beings facing such chaotic online news, driven by three
principles: sequential content dependency, diverse influence, and effective and
efficient learning. In this paper, to capture the first two principles, we
designed a Hybrid Attention Networks to predict the stock trend based on the
sequence of recent related news. Moreover, we apply the self-paced learning
mechanism to imitate the third principle. Extensive experiments on real-world
stock market data demonstrate the effectiveness of our approach
Neural Responding Machine for Short-Text Conversation
We propose Neural Responding Machine (NRM), a neural network-based response
generator for Short-Text Conversation. NRM takes the general encoder-decoder
framework: it formalizes the generation of response as a decoding process based
on the latent representation of the input text, while both encoding and
decoding are realized with recurrent neural networks (RNN). The NRM is trained
with a large amount of one-round conversation data collected from a
microblogging service. Empirical study shows that NRM can generate
grammatically correct and content-wise appropriate responses to over 75% of the
input text, outperforming state-of-the-arts in the same setting, including
retrieval-based and SMT-based models.Comment: accepted as a full paper at ACL 201
- …