Search CORE

3,359 research outputs found

Short Text Classification Research Based on TW-CNN

Author: Chen Yongzhou
Ma Jing
Nie Weimin
Pei Kefeng
Publication venue: AIS Electronic Library (AISeL)
Publication date: 26/06/2018
Field of study

Short texts are characterized by short length and sparse features. The study is less effective in the classification of short texts. Motivated by this, this paper seeks to extract features from the “topic” and “word” levels with proposing a convolutional neural network (CNN) based on topic and word, which is named TW-CNN. It uses the Latent Dirichlet Allocation (LDA), a topic model, and word2vec to obtain two distinct word vector matrices, which are then respectively taken as the inputs of two CNNs. After the process of convolution and pooling of the CNNs, there are two different vector representations of the text. And the vector representations are connected with the text-topic vector obtained by LDA, forming the final representation vector of the text. In the end, softmax text classification is conducted. And experiments based on short news texts show that the TW-CNN model has an improvement over the traditional CNNs

AIS Electronic Library (AISeL)

Active Discriminative Text Representation Learning

Author: Lease Matthew
Wallace Byron C.
Zhang Ye
Publication venue
Publication date: 01/12/2016
Field of study

We propose a new active learning (AL) method for text classification with convolutional neural networks (CNNs). In AL, one selects the instances to be manually labeled with the aim of maximizing model performance with minimal effort. Neural models capitalize on word embeddings as representations (features), tuning these to the task at hand. We argue that AL strategies for multi-layered neural models should focus on selecting instances that most affect the embedding space (i.e., induce discriminative word representations). This is in contrast to traditional AL approaches (e.g., entropy-based uncertainty sampling), which specify higher level objectives. We propose a simple approach for sentence classification that selects instances containing words whose embeddings are likely to be updated with the greatest magnitude, thereby rapidly learning discriminative, task-specific embeddings. We extend this approach to document classification by jointly considering: (1) the expected changes to the constituent word representations; and (2) the model's current overall uncertainty regarding the instance. The relative emphasis placed on these criteria is governed by a stochastic process that favors selecting instances likely to improve representations at the outset of learning, and then shifts toward general uncertainty sampling as AL progresses. Empirical results show that our method outperforms baseline AL approaches on both sentence and document classification tasks. We also show that, as expected, the method quickly learns discriminative word embeddings. To the best of our knowledge, this is the first work on AL addressing neural models for text classification.Comment: This paper got accepted by AAAI 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

2kenize: Tying Subword Sequences for Chinese Script Conversion

Author: A Pranav
Augenstein Isabelle
Publication venue
Publication date: 01/01/2020
Field of study

Simplified Chinese to Traditional Chinese character conversion is a common preprocessing step in Chinese NLP. Despite this, current approaches have poor performance because they do not take into account that a simplified Chinese character can correspond to multiple traditional characters. Here, we propose a model that can disambiguate between mappings and convert between the two scripts. The model is based on subword segmentation, two language models, as well as a method for mapping between subword sequences. We further construct benchmark datasets for topic classification and script conversion. Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy. These results are further confirmed in a downstream application, where 2kenize is used to convert pretraining dataset for topic classification. An error analysis reveals that our method's particular strengths are in dealing with code-mixing and named entities.Comment: Accepted to ACL 202

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

Data-Driven and Deep Learning Methodology for Deceptive Advertising and Phone Scams Detection

Author: Huang TonTon Hsien-De
Kao Hung-Yu
Yu Chia-Mu
Publication venue
Publication date: 15/10/2017
Field of study

The advance of smartphones and cellular networks boosts the need of mobile advertising and targeted marketing. However, it also triggers the unseen security threats. We found that the phone scams with fake calling numbers of very short lifetime are increasingly popular and have been used to trick the users. The harm is worldwide. On the other hand, deceptive advertising (deceptive ads), the fake ads that tricks users to install unnecessary apps via either alluring or daunting texts and pictures, is an emerging threat that seriously harms the reputation of the advertiser. To counter against these two new threats, the conventional blacklist (or whitelist) approach and the machine learning approach with predefined features have been proven useless. Nevertheless, due to the success of deep learning in developing the highly intelligent program, our system can efficiently and effectively detect phone scams and deceptive ads by taking advantage of our unified framework on deep neural network (DNN) and convolutional neural network (CNN). The proposed system has been deployed for operational use and the experimental results proved the effectiveness of our proposed system. Furthermore, we keep our research results and release experiment material on http://DeceptiveAds.TWMAN.ORG and http://PhoneScams.TWMAN.ORG if there is any update.Comment: 6 pages, TAAI 2017 versio

arXiv.org e-Print Archive

Crossref