4,093 research outputs found

    A Survey on Semi-Supervised Learning Techniques

    Full text link
    Semisupervised learning is a learning standard which deals with the study of how computers and natural systems such as human beings acquire knowledge in the presence of both labeled and unlabeled data. Semisupervised learning based methods are preferred when compared to the supervised and unsupervised learning because of the improved performance shown by the semisupervised approaches in the presence of large volumes of data. Labels are very hard to attain while unlabeled data are surplus, therefore semisupervised learning is a noble indication to shrink human labor and improve accuracy. There has been a large spectrum of ideas on semisupervised learning. In this paper we bring out some of the key approaches for semisupervised learning.Comment: 5 Pages, 3 figures, Published with International Journal of Computer Trends and Technology (IJCTT

    Semi-supervised emotion lexicon expansion with label propagation and specialized word embeddings

    Full text link
    There exist two main approaches to automatically extract affective orientation: lexicon-based and corpus-based. In this work, we argue that these two methods are compatible and show that combining them can improve the accuracy of emotion classifiers. In particular, we introduce a novel variant of the Label Propagation algorithm that is tailored to distributed word representations, we apply batch gradient descent to accelerate the optimization of label propagation and to make the optimization feasible for large graphs, and we propose a reproducible method for emotion lexicon expansion. We conclude that label propagation can expand an emotion lexicon in a meaningful way and that the expanded emotion lexicon can be leveraged to improve the accuracy of an emotion classifier

    Adversarial Deep Averaging Networks for Cross-Lingual Sentiment Classification

    Full text link
    In recent years great success has been achieved in sentiment classification for English, thanks in part to the availability of copious annotated resources. Unfortunately, most languages do not enjoy such an abundance of labeled data. To tackle the sentiment classification problem in low-resource languages without adequate annotated data, we propose an Adversarial Deep Averaging Network (ADAN) to transfer the knowledge learned from labeled data on a resource-rich source language to low-resource languages where only unlabeled data exists. ADAN has two discriminative branches: a sentiment classifier and an adversarial language discriminator. Both branches take input from a shared feature extractor to learn hidden representations that are simultaneously indicative for the classification task and invariant across languages. Experiments on Chinese and Arabic sentiment classification demonstrate that ADAN significantly outperforms state-of-the-art systems.Comment: TACL journal versio

    Theory-Driven Automated Content Analysis of Suicidal Tweets : Using Typicality-Based Classification for LDA Dataset

    Full text link
    This study provides a methodological framework for the computer to classify tweets according to variables of the Theory of Planned Behavior. We present a sequential process of automated text analysis which combined supervised approach and unsupervised approach in order to make the computer to detect one of TPB variables in each tweet. We conducted Latent Dirichlet Allocation (LDA), Nearest Neighbor, and then assessed "typicality" of newly labeled tweets in order to predict classification boundary. Furthermore, this study reports findings from a content analysis of suicide-related tweets which identify traits of information environment in Twitter. Consistent with extant literature about suicide coverage, the findings demonstrate that tweets often contain information which prompt perceived behavior control of committing suicide, while rarely provided deterring information on suicide. We conclude by highlighting implications for methodological advances and empirical theory studies.Comment: Accepted to ICA 2018: https://convention2.allacademic.com/one/ica/ica18/index.php?program_focus=view_paper&selected_paper_id=1366376&cmd=online_program_direct_link&sub_action=online_progra

    SKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis

    Full text link
    Recently, sentiment analysis has seen remarkable advance with the help of pre-training approaches. However, sentiment knowledge, such as sentiment words and aspect-sentiment pairs, is ignored in the process of pre-training, despite the fact that they are widely used in traditional sentiment analysis approaches. In this paper, we introduce Sentiment Knowledge Enhanced Pre-training (SKEP) in order to learn a unified sentiment representation for multiple sentiment analysis tasks. With the help of automatically-mined knowledge, SKEP conducts sentiment masking and constructs three sentiment knowledge prediction objectives, so as to embed sentiment information at the word, polarity and aspect level into pre-trained sentiment representation. In particular, the prediction of aspect-sentiment pairs is converted into multi-label classification, aiming to capture the dependency between words in a pair. Experiments on three kinds of sentiment tasks show that SKEP significantly outperforms strong pre-training baseline, and achieves new state-of-the-art results on most of the test datasets. We release our code at https://github.com/baidu/Senta.Comment: Accepted by ACL202

    DEDPUL: Difference-of-Estimated-Densities-based Positive-Unlabeled Learning

    Full text link
    Positive-Unlabeled (PU) learning is an analog to supervised binary classification for the case when only the positive sample is clean, while the negative sample is contaminated with latent instances of positive class and hence can be considered as an unlabeled mixture. The objectives are to classify the unlabeled sample and train an unbiased PN classifier, which generally requires to identify the mixing proportions of positives and negatives first. Recently, unbiased risk estimation framework has achieved state-of-the-art performance in PU learning. This approach, however, exhibits two major bottlenecks. First, the mixing proportions are assumed to be identified, i.e. known in the domain or estimated with additional methods. Second, the approach relies on the classifier being a neural network. In this paper, we propose DEDPUL, a method that solves PU Learning without the aforementioned issues. The mechanism behind DEDPUL is to apply a computationally cheap post-processing procedure to the predictions of any classifier trained to distinguish positive and unlabeled data. Instead of assuming the proportions to be identified, DEDPUL estimates them alongside with classifying unlabeled sample. Experiments show that DEDPUL outperforms the current state-of-the-art in both proportion estimation and PU Classification.Comment: Implementation of DEDPUL and experimental data are available at https://github.com/dimonenka/DEDPU

    Analysis of Chinese Tourists in Japan by Text Mining of a Hotel Portal Site

    Full text link
    With an increasingly large number of Chinese tourists in Japan, the hotel industry is in need of an affordable market research tool that does not rely on expensive and time-consuming surveys or interviews. Because this problem is real and relevant to the hotel industry in Japan, and otherwise completely unexplored in other studies, we have extracted a list of potential keywords from Chinese reviews of Japanese hotels in the hotel portal site Ctrip1 using a mathematical model to then use them in a sentiment analysis with a machine learning classifier. While most studies that use information collected from the internet use pre-existing data analysis tools, in our study, we designed the mathematical model to have the highest possible performing results in classification, while also exploring on the potential business implications these may have.Comment: arXiv admin note: substantial text overlap with arXiv:1904.11797, arXiv:1904.13213, arXiv:1904.1203

    Cross-language Learning with Adversarial Neural Networks: Application to Community Question Answering

    Full text link
    We address the problem of cross-language adaptation for question-question similarity reranking in community question answering, with the objective to port a system trained on one input language to another input language given labeled training data for the first language and only unlabeled data for the second language. In particular, we propose to use adversarial training of neural networks to learn high-level features that are discriminative for the main learning task, and at the same time are invariant across the input languages. The evaluation results show sizable improvements for our cross-language adversarial neural network (CLANN) model over a strong non-adversarial system.Comment: CoNLL-2017: The SIGNLL Conference on Computational Natural Language Learning; cross-language adversarial neural network (CLANN) model; adversarial training; cross-language adaptation; community question answering; question-question similarit

    Multiple Document Representations from News Alerts for Automated Bio-surveillance Event Detection

    Full text link
    Due to globalization, geographic boundaries no longer serve as effective shields for the spread of infectious diseases. In order to aid bio-surveillance analysts in disease tracking, recent research has been devoted to developing information retrieval and analysis methods utilizing the vast corpora of publicly available documents on the internet. In this work, we present methods for the automated retrieval and classification of documents related to active public health events. We demonstrate classification performance on an auto-generated corpus, using recurrent neural network, TF-IDF, and Naive Bayes log count ratio document representations. By jointly modeling the title and description of a document, we achieve 97% recall and 93.3% accuracy with our best performing bio-surveillance event classification model: logistic regression on the combined output from a pair of bidirectional recurrent neural networks.Comment: Presented at the 5th Pacific Northwest Regional NLP Workshop: NW-NLP 201

    Positive-Unlabeled Classification under Class Prior Shift and Asymmetric Error

    Full text link
    Bottlenecks of binary classification from positive and unlabeled data (PU classification) are the requirements that given unlabeled patterns are drawn from the test marginal distribution, and the penalty of the false positive error is identical to the false negative error. However, such requirements are often not fulfilled in practice. In this paper, we generalize PU classification to the class prior shift and asymmetric error scenarios. Based on the analysis of the Bayes optimal classifier, we show that given a test class prior, PU classification under class prior shift is equivalent to PU classification with asymmetric error. Then, we propose two different frameworks to handle these problems, namely, a risk minimization framework and density ratio estimation framework. Finally, we demonstrate the effectiveness of the proposed frameworks and compare both frameworks through experiments using benchmark datasets.Comment: 21 pages, 1 figure, added a citation of the work by Lu et al. (2018) and regard our risk minimization framework as a special case of their work. Improved a figure and clarity of the paper, changed the margin to a4pape
    • …
    corecore