4,093 research outputs found
A Survey on Semi-Supervised Learning Techniques
Semisupervised learning is a learning standard which deals with the study of
how computers and natural systems such as human beings acquire knowledge in the
presence of both labeled and unlabeled data. Semisupervised learning based
methods are preferred when compared to the supervised and unsupervised learning
because of the improved performance shown by the semisupervised approaches in
the presence of large volumes of data. Labels are very hard to attain while
unlabeled data are surplus, therefore semisupervised learning is a noble
indication to shrink human labor and improve accuracy. There has been a large
spectrum of ideas on semisupervised learning. In this paper we bring out some
of the key approaches for semisupervised learning.Comment: 5 Pages, 3 figures, Published with International Journal of Computer
Trends and Technology (IJCTT
Semi-supervised emotion lexicon expansion with label propagation and specialized word embeddings
There exist two main approaches to automatically extract affective
orientation: lexicon-based and corpus-based. In this work, we argue that these
two methods are compatible and show that combining them can improve the
accuracy of emotion classifiers. In particular, we introduce a novel variant of
the Label Propagation algorithm that is tailored to distributed word
representations, we apply batch gradient descent to accelerate the optimization
of label propagation and to make the optimization feasible for large graphs,
and we propose a reproducible method for emotion lexicon expansion. We conclude
that label propagation can expand an emotion lexicon in a meaningful way and
that the expanded emotion lexicon can be leveraged to improve the accuracy of
an emotion classifier
Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification
In recent years great success has been achieved in sentiment classification
for English, thanks in part to the availability of copious annotated resources.
Unfortunately, most languages do not enjoy such an abundance of labeled data.
To tackle the sentiment classification problem in low-resource languages
without adequate annotated data, we propose an Adversarial Deep Averaging
Network (ADAN) to transfer the knowledge learned from labeled data on a
resource-rich source language to low-resource languages where only unlabeled
data exists. ADAN has two discriminative branches: a sentiment classifier and
an adversarial language discriminator. Both branches take input from a shared
feature extractor to learn hidden representations that are simultaneously
indicative for the classification task and invariant across languages.
Experiments on Chinese and Arabic sentiment classification demonstrate that
ADAN significantly outperforms state-of-the-art systems.Comment: TACL journal versio
Theory-Driven Automated Content Analysis of Suicidal Tweets : Using Typicality-Based Classification for LDA Dataset
This study provides a methodological framework for the computer to classify
tweets according to variables of the Theory of Planned Behavior. We present a
sequential process of automated text analysis which combined supervised
approach and unsupervised approach in order to make the computer to detect one
of TPB variables in each tweet. We conducted Latent Dirichlet Allocation (LDA),
Nearest Neighbor, and then assessed "typicality" of newly labeled tweets in
order to predict classification boundary. Furthermore, this study reports
findings from a content analysis of suicide-related tweets which identify
traits of information environment in Twitter. Consistent with extant literature
about suicide coverage, the findings demonstrate that tweets often contain
information which prompt perceived behavior control of committing suicide,
while rarely provided deterring information on suicide. We conclude by
highlighting implications for methodological advances and empirical theory
studies.Comment: Accepted to ICA 2018:
https://convention2.allacademic.com/one/ica/ica18/index.php?program_focus=view_paper&selected_paper_id=1366376&cmd=online_program_direct_link&sub_action=online_progra
SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis
Recently, sentiment analysis has seen remarkable advance with the help of
pre-training approaches. However, sentiment knowledge, such as sentiment words
and aspect-sentiment pairs, is ignored in the process of pre-training, despite
the fact that they are widely used in traditional sentiment analysis
approaches. In this paper, we introduce Sentiment Knowledge Enhanced
Pre-training (SKEP) in order to learn a unified sentiment representation for
multiple sentiment analysis tasks. With the help of automatically-mined
knowledge, SKEP conducts sentiment masking and constructs three sentiment
knowledge prediction objectives, so as to embed sentiment information at the
word, polarity and aspect level into pre-trained sentiment representation. In
particular, the prediction of aspect-sentiment pairs is converted into
multi-label classification, aiming to capture the dependency between words in a
pair. Experiments on three kinds of sentiment tasks show that SKEP
significantly outperforms strong pre-training baseline, and achieves new
state-of-the-art results on most of the test datasets. We release our code at
https://github.com/baidu/Senta.Comment: Accepted by ACL202
DEDPUL: Difference-of-Estimated-Densities-based Positive-Unlabeled Learning
Positive-Unlabeled (PU) learning is an analog to supervised binary
classification for the case when only the positive sample is clean, while the
negative sample is contaminated with latent instances of positive class and
hence can be considered as an unlabeled mixture. The objectives are to classify
the unlabeled sample and train an unbiased PN classifier, which generally
requires to identify the mixing proportions of positives and negatives first.
Recently, unbiased risk estimation framework has achieved state-of-the-art
performance in PU learning. This approach, however, exhibits two major
bottlenecks. First, the mixing proportions are assumed to be identified, i.e.
known in the domain or estimated with additional methods. Second, the approach
relies on the classifier being a neural network. In this paper, we propose
DEDPUL, a method that solves PU Learning without the aforementioned issues. The
mechanism behind DEDPUL is to apply a computationally cheap post-processing
procedure to the predictions of any classifier trained to distinguish positive
and unlabeled data. Instead of assuming the proportions to be identified,
DEDPUL estimates them alongside with classifying unlabeled sample. Experiments
show that DEDPUL outperforms the current state-of-the-art in both proportion
estimation and PU Classification.Comment: Implementation of DEDPUL and experimental data are available at
https://github.com/dimonenka/DEDPU
Analysis of Chinese Tourists in Japan by Text Mining of a Hotel Portal Site
With an increasingly large number of Chinese tourists in Japan, the hotel
industry is in need of an affordable market research tool that does not rely on
expensive and time-consuming surveys or interviews. Because this problem is
real and relevant to the hotel industry in Japan, and otherwise completely
unexplored in other studies, we have extracted a list of potential keywords
from Chinese reviews of Japanese hotels in the hotel portal site Ctrip1 using a
mathematical model to then use them in a sentiment analysis with a machine
learning classifier. While most studies that use information collected from the
internet use pre-existing data analysis tools, in our study, we designed the
mathematical model to have the highest possible performing results in
classification, while also exploring on the potential business implications
these may have.Comment: arXiv admin note: substantial text overlap with arXiv:1904.11797,
arXiv:1904.13213, arXiv:1904.1203
Cross-language Learning with Adversarial Neural Networks: Application to Community Question Answering
We address the problem of cross-language adaptation for question-question
similarity reranking in community question answering, with the objective to
port a system trained on one input language to another input language given
labeled training data for the first language and only unlabeled data for the
second language. In particular, we propose to use adversarial training of
neural networks to learn high-level features that are discriminative for the
main learning task, and at the same time are invariant across the input
languages. The evaluation results show sizable improvements for our
cross-language adversarial neural network (CLANN) model over a strong
non-adversarial system.Comment: CoNLL-2017: The SIGNLL Conference on Computational Natural Language
Learning; cross-language adversarial neural network (CLANN) model;
adversarial training; cross-language adaptation; community question
answering; question-question similarit
Multiple Document Representations from News Alerts for Automated Bio-surveillance Event Detection
Due to globalization, geographic boundaries no longer serve as effective
shields for the spread of infectious diseases. In order to aid bio-surveillance
analysts in disease tracking, recent research has been devoted to developing
information retrieval and analysis methods utilizing the vast corpora of
publicly available documents on the internet. In this work, we present methods
for the automated retrieval and classification of documents related to active
public health events. We demonstrate classification performance on an
auto-generated corpus, using recurrent neural network, TF-IDF, and Naive Bayes
log count ratio document representations. By jointly modeling the title and
description of a document, we achieve 97% recall and 93.3% accuracy with our
best performing bio-surveillance event classification model: logistic
regression on the combined output from a pair of bidirectional recurrent neural
networks.Comment: Presented at the 5th Pacific Northwest Regional NLP Workshop: NW-NLP
201
Positive-Unlabeled Classification under Class Prior Shift and Asymmetric Error
Bottlenecks of binary classification from positive and unlabeled data (PU
classification) are the requirements that given unlabeled patterns are drawn
from the test marginal distribution, and the penalty of the false positive
error is identical to the false negative error. However, such requirements are
often not fulfilled in practice. In this paper, we generalize PU classification
to the class prior shift and asymmetric error scenarios. Based on the analysis
of the Bayes optimal classifier, we show that given a test class prior, PU
classification under class prior shift is equivalent to PU classification with
asymmetric error. Then, we propose two different frameworks to handle these
problems, namely, a risk minimization framework and density ratio estimation
framework. Finally, we demonstrate the effectiveness of the proposed frameworks
and compare both frameworks through experiments using benchmark datasets.Comment: 21 pages, 1 figure, added a citation of the work by Lu et al. (2018)
and regard our risk minimization framework as a special case of their work.
Improved a figure and clarity of the paper, changed the margin to a4pape
- …