3,359 research outputs found
Short Text Classification Research Based on TW-CNN
Short texts are characterized by short length and sparse features. The study is less effective in the classification of short texts. Motivated by this, this paper seeks to extract features from the “topic” and “word” levels with proposing a convolutional neural network (CNN) based on topic and word, which is named TW-CNN. It uses the Latent Dirichlet Allocation (LDA), a topic model, and word2vec to obtain two distinct word vector matrices, which are then respectively taken as the inputs of two CNNs. After the process of convolution and pooling of the CNNs, there are two different vector representations of the text. And the vector representations are connected with the text-topic vector obtained by LDA, forming the final representation vector of the text. In the end, softmax text classification is conducted. And experiments based on short news texts show that the TW-CNN model has an improvement over the traditional CNNs
Active Discriminative Text Representation Learning
We propose a new active learning (AL) method for text classification with
convolutional neural networks (CNNs). In AL, one selects the instances to be
manually labeled with the aim of maximizing model performance with minimal
effort. Neural models capitalize on word embeddings as representations
(features), tuning these to the task at hand. We argue that AL strategies for
multi-layered neural models should focus on selecting instances that most
affect the embedding space (i.e., induce discriminative word representations).
This is in contrast to traditional AL approaches (e.g., entropy-based
uncertainty sampling), which specify higher level objectives. We propose a
simple approach for sentence classification that selects instances containing
words whose embeddings are likely to be updated with the greatest magnitude,
thereby rapidly learning discriminative, task-specific embeddings. We extend
this approach to document classification by jointly considering: (1) the
expected changes to the constituent word representations; and (2) the model's
current overall uncertainty regarding the instance. The relative emphasis
placed on these criteria is governed by a stochastic process that favors
selecting instances likely to improve representations at the outset of
learning, and then shifts toward general uncertainty sampling as AL progresses.
Empirical results show that our method outperforms baseline AL approaches on
both sentence and document classification tasks. We also show that, as
expected, the method quickly learns discriminative word embeddings. To the best
of our knowledge, this is the first work on AL addressing neural models for
text classification.Comment: This paper got accepted by AAAI 201
2kenize: Tying Subword Sequences for Chinese Script Conversion
Simplified Chinese to Traditional Chinese character conversion is a common
preprocessing step in Chinese NLP. Despite this, current approaches have poor
performance because they do not take into account that a simplified Chinese
character can correspond to multiple traditional characters. Here, we propose a
model that can disambiguate between mappings and convert between the two
scripts. The model is based on subword segmentation, two language models, as
well as a method for mapping between subword sequences. We further construct
benchmark datasets for topic classification and script conversion. Our proposed
method outperforms previous Chinese Character conversion approaches by 6 points
in accuracy. These results are further confirmed in a downstream application,
where 2kenize is used to convert pretraining dataset for topic classification.
An error analysis reveals that our method's particular strengths are in dealing
with code-mixing and named entities.Comment: Accepted to ACL 202
Data-Driven and Deep Learning Methodology for Deceptive Advertising and Phone Scams Detection
The advance of smartphones and cellular networks boosts the need of mobile
advertising and targeted marketing. However, it also triggers the unseen
security threats. We found that the phone scams with fake calling numbers of
very short lifetime are increasingly popular and have been used to trick the
users. The harm is worldwide. On the other hand, deceptive advertising
(deceptive ads), the fake ads that tricks users to install unnecessary apps via
either alluring or daunting texts and pictures, is an emerging threat that
seriously harms the reputation of the advertiser. To counter against these two
new threats, the conventional blacklist (or whitelist) approach and the machine
learning approach with predefined features have been proven useless.
Nevertheless, due to the success of deep learning in developing the highly
intelligent program, our system can efficiently and effectively detect phone
scams and deceptive ads by taking advantage of our unified framework on deep
neural network (DNN) and convolutional neural network (CNN). The proposed
system has been deployed for operational use and the experimental results
proved the effectiveness of our proposed system. Furthermore, we keep our
research results and release experiment material on
http://DeceptiveAds.TWMAN.ORG and http://PhoneScams.TWMAN.ORG if there is any
update.Comment: 6 pages, TAAI 2017 versio
- …