12 research outputs found
Automatic Translating Between Ancient Chinese and Contemporary Chinese with Limited Aligned Corpora
The Chinese language has evolved a lot during the long-term development.
Therefore, native speakers now have trouble in reading sentences written in
ancient Chinese. In this paper, we propose to build an end-to-end neural model
to automatically translate between ancient and contemporary Chinese. However,
the existing ancient-contemporary Chinese parallel corpora are not aligned at
the sentence level and sentence-aligned corpora are limited, which makes it
difficult to train the model. To build the sentence level parallel training
data for the model, we propose an unsupervised algorithm that constructs
sentence-aligned ancient-contemporary pairs by using the fact that the aligned
sentence pair shares many of the tokens. Based on the aligned corpus, we
propose an end-to-end neural model with copying mechanism and local attention
to translate between ancient and contemporary Chinese. Experiments show that
the proposed unsupervised algorithm achieves 99.4% F1 score for sentence
alignment, and the translation model achieves 26.95 BLEU from ancient to
contemporary, and 36.34 BLEU from contemporary to ancient.Comment: Acceptted by NLPCC 201
VulDeePecker: A Deep Learning-Based System for Vulnerability Detection
The automatic detection of software vulnerabilities is an important research
problem. However, existing solutions to this problem rely on human experts to
define features and often miss many vulnerabilities (i.e., incurring high false
negative rate). In this paper, we initiate the study of using deep
learning-based vulnerability detection to relieve human experts from the
tedious and subjective task of manually defining features. Since deep learning
is motivated to deal with problems that are very different from the problem of
vulnerability detection, we need some guiding principles for applying deep
learning to vulnerability detection. In particular, we need to find
representations of software programs that are suitable for deep learning. For
this purpose, we propose using code gadgets to represent programs and then
transform them into vectors, where a code gadget is a number of (not
necessarily consecutive) lines of code that are semantically related to each
other. This leads to the design and implementation of a deep learning-based
vulnerability detection system, called Vulnerability Deep Pecker
(VulDeePecker). In order to evaluate VulDeePecker, we present the first
vulnerability dataset for deep learning approaches. Experimental results show
that VulDeePecker can achieve much fewer false negatives (with reasonable false
positives) than other approaches. We further apply VulDeePecker to 3 software
products (namely Xen, Seamonkey, and Libav) and detect 4 vulnerabilities, which
are not reported in the National Vulnerability Database but were "silently"
patched by the vendors when releasing later versions of these products; in
contrast, these vulnerabilities are almost entirely missed by the other
vulnerability detection systems we experimented with
Lattice-Based Recurrent Neural Network Encoders for Neural Machine Translation
Neural machine translation (NMT) heavily relies on word-level modelling to learn semantic representations of input sentences.However, for languages without natural word delimiters (e.g., Chinese) where input sentences have to be tokenized first,conventional NMT is confronted with two issues:1) it is difficult to find an optimal tokenization granularity for source sentence modelling, and2) errors in 1-best tokenizations may propagate to the encoder of NMT.To handle these issues, we propose word-lattice based Recurrent Neural Network (RNN) encoders for NMT,which generalize the standard RNN to word lattice topology.The proposed encoders take as input a word lattice that compactly encodes multiple tokenizations, and learn to generate new hidden states from arbitrarily many inputs and hidden states in preceding time steps.As such, the word-lattice based encoders not only alleviate the negative impact of tokenization errors but also are more expressive and flexible to embed input sentences.Experiment results on Chinese-English translation demonstrate the superiorities of the proposed encoders over the conventional encoder