922 research outputs found
Which Encoding is the Best for Text Classification in Chinese, English, Japanese and Korean?
This article offers an empirical study on the different ways of encoding
Chinese, Japanese, Korean (CJK) and English languages for text classification.
Different encoding levels are studied, including UTF-8 bytes, characters,
words, romanized characters and romanized words. For all encoding levels,
whenever applicable, we provide comparisons with linear models, fastText and
convolutional networks. For convolutional networks, we compare between encoding
mechanisms using character glyph images, one-hot (or one-of-n) encoding, and
embedding. In total there are 473 models, using 14 large-scale text
classification datasets in 4 languages including Chinese, English, Japanese and
Korean. Some conclusions from these results include that byte-level one-hot
encoding based on UTF-8 consistently produces competitive results for
convolutional networks, that word-level n-grams linear models are competitive
even without perfect word segmentation, and that fastText provides the best
result using character-level n-gram encoding but can overfit when the features
are overly rich
Learning to Weight for Text Classification
In information retrieval (IR) and related tasks, term weighting approaches
typically consider the frequency of the term in the document and in the
collection in order to compute a score reflecting the importance of the term
for the document. In tasks characterized by the presence of training data (such
as text classification) it seems logical that the term weighting function
should take into account the distribution (as estimated from training data) of
the term across the classes of interest. Although `supervised term weighting'
approaches that use this intuition have been described before, they have failed
to show consistent improvements. In this article we analyse the possible
reasons for this failure, and call consolidated assumptions into question.
Following this criticism we propose a novel supervised term weighting approach
that, instead of relying on any predefined formula, learns a term weighting
function optimised on the training set of interest; we dub this approach
\emph{Learning to Weight} (LTW). The experiments that we run on several
well-known benchmarks, and using different learning methods, show that our
method outperforms previous term weighting approaches in text classification.Comment: To appear in IEEE Transactions on Knowledge and Data Engineerin
A no-regret generalization of hierarchical softmax to extreme multi-label classification
Extreme multi-label classification (XMLC) is a problem of tagging an instance
with a small subset of relevant labels chosen from an extremely large pool of
possible labels. Large label spaces can be efficiently handled by organizing
labels as a tree, like in the hierarchical softmax (HSM) approach commonly used
for multi-class problems. In this paper, we investigate probabilistic label
trees (PLTs) that have been recently devised for tackling XMLC problems. We
show that PLTs are a no-regret multi-label generalization of HSM when
precision@k is used as a model evaluation metric. Critically, we prove that
pick-one-label heuristic - a reduction technique from multi-label to
multi-class that is routinely used along with HSM - is not consistent in
general. We also show that our implementation of PLTs, referred to as
extremeText (XT), obtains significantly better results than HSM with the
pick-one-label heuristic and XML-CNN, a deep network specifically designed for
XMLC problems. Moreover, XT is competitive to many state-of-the-art approaches
in terms of statistical performance, model size and prediction time which makes
it amenable to deploy in an online system.Comment: Accepted at NIPS 201
Investigating the Working of Text Classifiers
Text classification is one of the most widely studied tasks in natural
language processing. Motivated by the principle of compositionality, large
multilayer neural network models have been employed for this task in an attempt
to effectively utilize the constituent expressions. Almost all of the reported
work train large networks using discriminative approaches, which come with a
caveat of no proper capacity control, as they tend to latch on to any signal
that may not generalize. Using various recent state-of-the-art approaches for
text classification, we explore whether these models actually learn to compose
the meaning of the sentences or still just focus on some keywords or lexicons
for classifying the document. To test our hypothesis, we carefully construct
datasets where the training and test splits have no direct overlap of such
lexicons, but overall language structure would be similar. We study various
text classifiers and observe that there is a big performance drop on these
datasets. Finally, we show that even simple models with our proposed
regularization techniques, which disincentivize focusing on key lexicons, can
substantially improve classification accuracy.Comment: Proceedings of COLING 2018, the 27th International Conference on
Computational Linguistics: Technical Papers (COLING 2018), NIPS 2017 Workshop
on Deep Learning: Bridging Theory and Practic
An Empirical Evaluation of Text Representation Schemes on Multilingual Social Web to Filter the Textual Aggression
This paper attempt to study the effectiveness of text representation schemes
on two tasks namely: User Aggression and Fact Detection from the social media
contents. In User Aggression detection, The aim is to identify the level of
aggression from the contents generated in the Social media and written in the
English, Devanagari Hindi and Romanized Hindi. Aggression levels are
categorized into three predefined classes namely: `Non-aggressive`, `Overtly
Aggressive`, and `Covertly Aggressive`. During the disaster-related incident,
Social media like, Twitter is flooded with millions of posts. In such emergency
situations, identification of factual posts is important for organizations
involved in the relief operation. We anticipated this problem as a combination
of classification and Ranking problem. This paper presents a comparison of
various text representation scheme based on BoW techniques, distributed
word/sentence representation, transfer learning on classifiers. Weighted
score is used as a primary evaluation metric. Results show that text
representation using BoW performs better than word embedding on machine
learning classifiers. While pre-trained Word embedding techniques perform
better on classifiers based on deep neural net. Recent transfer learning model
like ELMO, ULMFiT are fine-tuned for the Aggression classification task.
However, results are not at par with pre-trained word embedding model. Overall,
word embedding using fastText produce best weighted -score than Word2Vec
and Glove. Results are further improved using pre-trained vector model.
Statistical significance tests are employed to ensure the significance of the
classification results. In the case of lexically different test Dataset, other
than training Dataset, deep neural models are more robust and perform
substantially better than machine learning classifiers.Comment: 21 Page, 2 Figur
Null It Out: Guarding Protected Attributes by Iterative Nullspace Projection
The ability to control for the kinds of information encoded in neural
representation has a variety of use cases, especially in light of the challenge
of interpreting these models. We present Iterative Null-space Projection
(INLP), a novel method for removing information from neural representations.
Our method is based on repeated training of linear classifiers that predict a
certain property we aim to remove, followed by projection of the
representations on their null-space. By doing so, the classifiers become
oblivious to that target property, making it hard to linearly separate the data
according to it. While applicable for multiple uses, we evaluate our method on
bias and fairness use-cases, and show that our method is able to mitigate bias
in word embeddings, as well as to increase fairness in a setting of multi-class
classification.Comment: Accepted as a long paper in ACL 202
A Content-Based Approach to Email Triage Action Prediction: Exploration and Evaluation
Email has remained a principal form of communication among people, both in
enterprise and social settings. With a deluge of emails crowding our mailboxes
daily, there is a dire need of smart email systems that can recover important
emails and make personalized recommendations. In this work, we study the
problem of predicting user triage actions to incoming emails where we take the
reply prediction as a working example. Different from existing methods, we
formulate the triage action prediction as a recommendation problem and focus on
the content-based approach, where the users are represented using the content
of current and past emails. We also introduce additional similarity features to
further explore the affinities between users and emails. Experiments on the
publicly available Avocado email collection demonstrate the advantages of our
proposed recommendation framework and our method is able to achieve better
performance compared to the state-of-the-art deep recommendation methods. More
importantly, we provide valuable insight into the effectiveness of different
textual and user representations and show that traditional bag-of-words
approaches, with the help from the similarity features, compete favorably with
the more advanced neural embedding methods.Comment: User representations, Personalization, Email response prediction,
Similarity feature
Hierarchical Neural Networks for Sequential Sentence Classification in Medical Scientific Abstracts
Prevalent models based on artificial neural network (ANN) for sentence
classification often classify sentences in isolation without considering the
context in which sentences appear. This hampers the traditional sentence
classification approaches to the problem of sequential sentence classification,
where structured prediction is needed for better overall classification
performance. In this work, we present a hierarchical sequential labeling
network to make use of the contextual information within surrounding sentences
to help classify the current sentence. Our model outperforms the
state-of-the-art results by 2%-3% on two benchmarking datasets for sequential
sentence classification in medical scientific abstracts.Comment: Accepted by EMNLP 201
Ranking-Based Autoencoder for Extreme Multi-label Classification
Extreme Multi-label classification (XML) is an important yet challenging
machine learning task, that assigns to each instance its most relevant
candidate labels from an extremely large label collection, where the numbers of
labels, features and instances could be thousands or millions. XML is more and
more on demand in the Internet industries, accompanied with the increasing
business scale / scope and data accumulation. The extremely large label
collections yield challenges such as computational complexity, inter-label
dependency and noisy labeling. Many methods have been proposed to tackle these
challenges, based on different mathematical formulations. In this paper, we
propose a deep learning XML method, with a word-vector-based self-attention,
followed by a ranking-based AutoEncoder architecture. The proposed method has
three major advantages: 1) the autoencoder simultaneously considers the
inter-label dependencies and the feature-label dependencies, by projecting
labels and features onto a common embedding space; 2) the ranking loss not only
improves the training efficiency and accuracy but also can be extended to
handle noisy labeled data; 3) the efficient attention mechanism improves
feature representation by highlighting feature importance. Experimental results
on benchmark datasets show the proposed method is competitive to
state-of-the-art methods.Comment: Accepted by NAACL-HLT 2019 as a long pape
Semi-Supervised Multi-aspect Detection of Misinformation using Hierarchical Joint Decomposition
Distinguishing between misinformation and real information is one of the most
challenging problems in today's interconnected world. The vast majority of the
state-of-the-art in detecting misinformation is fully supervised, requiring a
large number of high-quality human annotations. However, the availability of
such annotations cannot be taken for granted, since it is very costly,
time-consuming, and challenging to do so in a way that keeps up with the
proliferation of misinformation. In this work, we are interested in exploring
scenarios where the number of annotations is limited. In such scenarios, we
investigate how tapping on a diverse number of resources that characterize a
news article, henceforth referred to as "aspects" can compensate for the lack
of labels. In particular, our contributions in this paper are twofold: 1) We
propose the use of three different aspects: article content, context of social
sharing behaviors, and host website/domain features, and 2) We introduce a
principled tensor based embedding framework that combines all those aspects
effectively. We propose HiJoD a 2-level decomposition pipeline which not only
outperforms state-of-the-art methods with F1-scores of 74% and 81% on Twitter
and Politifact datasets respectively but also is an order of magnitude faster
than similar ensemble approaches
- …