Search CORE

4,093 research outputs found

A Survey on Semi-Supervised Learning Techniques

Author: Nithya Dr. L. M.
Prakash V. Jothi
Publication venue: 'Seventh Sense Research Group Journals'
Publication date: 19/02/2014
Field of study

Semisupervised learning is a learning standard which deals with the study of how computers and natural systems such as human beings acquire knowledge in the presence of both labeled and unlabeled data. Semisupervised learning based methods are preferred when compared to the supervised and unsupervised learning because of the improved performance shown by the semisupervised approaches in the presence of large volumes of data. Labels are very hard to attain while unlabeled data are surplus, therefore semisupervised learning is a noble indication to shrink human labor and improve accuracy. There has been a large spectrum of ideas on semisupervised learning. In this paper we bring out some of the key approaches for semisupervised learning.Comment: 5 Pages, 3 figures, Published with International Journal of Computer Trends and Technology (IJCTT

arXiv.org e-Print Archive

Semi-supervised emotion lexicon expansion with label propagation and specialized word embeddings

Author: Giulianelli Mario
Publication venue
Publication date: 13/08/2017
Field of study

There exist two main approaches to automatically extract affective orientation: lexicon-based and corpus-based. In this work, we argue that these two methods are compatible and show that combining them can improve the accuracy of emotion classifiers. In particular, we introduce a novel variant of the Label Propagation algorithm that is tailored to distributed word representations, we apply batch gradient descent to accelerate the optimization of label propagation and to make the optimization feasible for large graphs, and we propose a reproducible method for emotion lexicon expansion. We conclude that label propagation can expand an emotion lexicon in a meaningful way and that the expanded emotion lexicon can be leveraged to improve the accuracy of an emotion classifier

arXiv.org e-Print Archive

Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification

Author: Athiwaratkun Ben
Cardie Claire
Chen Xilun
Sun Yu
Weinberger Kilian
Publication venue
Publication date: 18/08/2018
Field of study

In recent years great success has been achieved in sentiment classification for English, thanks in part to the availability of copious annotated resources. Unfortunately, most languages do not enjoy such an abundance of labeled data. To tackle the sentiment classification problem in low-resource languages without adequate annotated data, we propose an Adversarial Deep Averaging Network (ADAN) to transfer the knowledge learned from labeled data on a resource-rich source language to low-resource languages where only unlabeled data exists. ADAN has two discriminative branches: a sentiment classifier and an adversarial language discriminator. Both branches take input from a shared feature extractor to learn hidden representations that are simultaneously indicative for the classification task and invariant across languages. Experiments on Chinese and Arabic sentiment classification demonstrate that ADAN significantly outperforms state-of-the-art systems.Comment: TACL journal versio

arXiv.org e-Print Archive

Theory-Driven Automated Content Analysis of Suicidal Tweets : Using Typicality-Based Classification for LDA Dataset

Author: Jang Yunseok
Lee Chul-joo
Park Joon-Mo
Publication venue
Publication date: 24/08/2018
Field of study

This study provides a methodological framework for the computer to classify tweets according to variables of the Theory of Planned Behavior. We present a sequential process of automated text analysis which combined supervised approach and unsupervised approach in order to make the computer to detect one of TPB variables in each tweet. We conducted Latent Dirichlet Allocation (LDA), Nearest Neighbor, and then assessed "typicality" of newly labeled tweets in order to predict classification boundary. Furthermore, this study reports findings from a content analysis of suicide-related tweets which identify traits of information environment in Twitter. Consistent with extant literature about suicide coverage, the findings demonstrate that tweets often contain information which prompt perceived behavior control of committing suicide, while rarely provided deterring information on suicide. We conclude by highlighting implications for methodological advances and empirical theory studies.Comment: Accepted to ICA 2018: https://convention2.allacademic.com/one/ica/ica18/index.php?program_focus=view_paper&selected_paper_id=1366376&cmd=online_program_direct_link&sub_action=online_progra

arXiv.org e-Print Archive

SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis

Author: Gao Can
He Bolei
Liu Hao
Tian Hao
Wang Haifeng
Wu Feng
Wu Hua
Xiao Xinyan
Publication venue
Publication date: 20/05/2020
Field of study

Recently, sentiment analysis has seen remarkable advance with the help of pre-training approaches. However, sentiment knowledge, such as sentiment words and aspect-sentiment pairs, is ignored in the process of pre-training, despite the fact that they are widely used in traditional sentiment analysis approaches. In this paper, we introduce Sentiment Knowledge Enhanced Pre-training (SKEP) in order to learn a unified sentiment representation for multiple sentiment analysis tasks. With the help of automatically-mined knowledge, SKEP conducts sentiment masking and constructs three sentiment knowledge prediction objectives, so as to embed sentiment information at the word, polarity and aspect level into pre-trained sentiment representation. In particular, the prediction of aspect-sentiment pairs is converted into multi-label classification, aiming to capture the dependency between words in a pair. Experiments on three kinds of sentiment tasks show that SKEP significantly outperforms strong pre-training baseline, and achieves new state-of-the-art results on most of the test datasets. We release our code at https://github.com/baidu/Senta.Comment: Accepted by ACL202

arXiv.org e-Print Archive

DEDPUL: Difference-of-Estimated-Densities-based Positive-Unlabeled Learning

Author: Ivanov Dmitry
Publication venue
Publication date: 07/06/2020
Field of study

Positive-Unlabeled (PU) learning is an analog to supervised binary classification for the case when only the positive sample is clean, while the negative sample is contaminated with latent instances of positive class and hence can be considered as an unlabeled mixture. The objectives are to classify the unlabeled sample and train an unbiased PN classifier, which generally requires to identify the mixing proportions of positives and negatives first. Recently, unbiased risk estimation framework has achieved state-of-the-art performance in PU learning. This approach, however, exhibits two major bottlenecks. First, the mixing proportions are assumed to be identified, i.e. known in the domain or estimated with additional methods. Second, the approach relies on the classifier being a neural network. In this paper, we propose DEDPUL, a method that solves PU Learning without the aforementioned issues. The mechanism behind DEDPUL is to apply a computationally cheap post-processing procedure to the predictions of any classifier trained to distinguish positive and unlabeled data. Instead of assuming the proportions to be identified, DEDPUL estimates them alongside with classifying unlabeled sample. Experiments show that DEDPUL outperforms the current state-of-the-art in both proportion estimation and PU Classification.Comment: Implementation of DEDPUL and experimental data are available at https://github.com/dimonenka/DEDPU

arXiv.org e-Print Archive

Analysis of Chinese Tourists in Japan by Text Mining of a Hotel Portal Site

Author: Carreón Elisa Claire Alemán
Hiraoka Toru
Nonaka Hirofumi
Publication venue
Publication date: 01/05/2019
Field of study

With an increasingly large number of Chinese tourists in Japan, the hotel industry is in need of an affordable market research tool that does not rely on expensive and time-consuming surveys or interviews. Because this problem is real and relevant to the hotel industry in Japan, and otherwise completely unexplored in other studies, we have extracted a list of potential keywords from Chinese reviews of Japanese hotels in the hotel portal site Ctrip1 using a mathematical model to then use them in a sentiment analysis with a machine learning classifier. While most studies that use information collected from the internet use pre-existing data analysis tools, in our study, we designed the mathematical model to have the highest possible performing results in classification, while also exploring on the potential business implications these may have.Comment: arXiv admin note: substantial text overlap with arXiv:1904.11797, arXiv:1904.13213, arXiv:1904.1203

arXiv.org e-Print Archive

Cross-language Learning with Adversarial Neural Networks: Application to Community Question Answering

Author: Jaradat Israa
Joty Shafiq
Màrquez Lluís
Nakov Preslav
Publication venue
Publication date: 21/06/2017
Field of study

We address the problem of cross-language adaptation for question-question similarity reranking in community question answering, with the objective to port a system trained on one input language to another input language given labeled training data for the first language and only unlabeled data for the second language. In particular, we propose to use adversarial training of neural networks to learn high-level features that are discriminative for the main learning task, and at the same time are invariant across the input languages. The evaluation results show sizable improvements for our cross-language adversarial neural network (CLANN) model over a strong non-adversarial system.Comment: CoNLL-2017: The SIGNLL Conference on Computational Natural Language Learning; cross-language adversarial neural network (CLANN) model; adversarial training; cross-language adaptation; community question answering; question-question similarit

arXiv.org e-Print Archive

Multiple Document Representations from News Alerts for Automated Bio-surveillance Event Detection

Author: Anubhav Fnu
Charles Lauren
Tuor Aaron
Publication venue
Publication date: 17/02/2019
Field of study

Due to globalization, geographic boundaries no longer serve as effective shields for the spread of infectious diseases. In order to aid bio-surveillance analysts in disease tracking, recent research has been devoted to developing information retrieval and analysis methods utilizing the vast corpora of publicly available documents on the internet. In this work, we present methods for the automated retrieval and classification of documents related to active public health events. We demonstrate classification performance on an auto-generated corpus, using recurrent neural network, TF-IDF, and Naive Bayes log count ratio document representations. By jointly modeling the title and description of a document, we achieve 97% recall and 93.3% accuracy with our best performing bio-surveillance event classification model: logistic regression on the combined output from a pair of bidirectional recurrent neural networks.Comment: Presented at the 5th Pacific Northwest Regional NLP Workshop: NW-NLP 201

arXiv.org e-Print Archive

Positive-Unlabeled Classification under Class Prior Shift and Asymmetric Error

Author: Charoenphakdee Nontawat
Sugiyama Masashi
Publication venue
Publication date: 16/10/2018
Field of study

Bottlenecks of binary classification from positive and unlabeled data (PU classification) are the requirements that given unlabeled patterns are drawn from the test marginal distribution, and the penalty of the false positive error is identical to the false negative error. However, such requirements are often not fulfilled in practice. In this paper, we generalize PU classification to the class prior shift and asymmetric error scenarios. Based on the analysis of the Bayes optimal classifier, we show that given a test class prior, PU classification under class prior shift is equivalent to PU classification with asymmetric error. Then, we propose two different frameworks to handle these problems, namely, a risk minimization framework and density ratio estimation framework. Finally, we demonstrate the effectiveness of the proposed frameworks and compare both frameworks through experiments using benchmark datasets.Comment: 21 pages, 1 figure, added a citation of the work by Lu et al. (2018) and regard our risk minimization framework as a special case of their work. Improved a figure and clarity of the paper, changed the margin to a4pape

arXiv.org e-Print Archive