580 research outputs found

    Detecting Sockpuppets in Deceptive Opinion Spam

    Full text link
    This paper explores the problem of sockpuppet detection in deceptive opinion spam using authorship attribution and verification approaches. Two methods are explored. The first is a feature subsampling scheme that uses the KL-Divergence on stylistic language models of an author to find discriminative features. The second is a transduction scheme, spy induction that leverages the diversity of authors in the unlabeled test set by sending a set of spies (positive samples) from the training set to retrieve hidden samples in the unlabeled test set using nearest and farthest neighbors. Experiments using ground truth sockpuppet data show the effectiveness of the proposed schemes.Comment: 18 pages, Accepted at CICLing 2017, 18th International Conference on Intelligent Text Processing and Computational Linguistic

    Detecting Positive and Negative Deceptive Opinions using PU-learning

    Full text link
    [EN] Nowadays a large number of opinion reviews are posted on the Web. Such reviews are a very important source of information for customers and companies. The former rely more than ever on online reviews to make their purchase decisions, and the latter to respond promptly to their clients’ expectations. Unfortunately, due to the business that is behind, there is an increasing number of deceptive opinions, that is, fictitious opinions that have been deliberately written to sound authentic, in order to deceive the consumers promoting a low quality product (positive deceptive opinions) or criticizing a potentially good quality one (negative deceptive opinions). In this paper we focus on the detection of both types of deceptive opinions, positive and negative. Due to the scarcity of examples of deceptive opinions, we propose to approach the problem of the detection of deceptive opinions employing PU-learning. PU-learning is a semi-supervised technique for building a binary classifier on the basis of positive (i.e., deceptive opinions) and unlabeled examples only. Concretely, we propose a novel method that with respect to its original version is much more conservative at the moment of selecting the negative examples (i.e., not deceptive opinions) from the unlabeled ones. The obtained results show that the proposed PU-learning method consistently outperformed the original PU-learning approach. In particular, results show an average improvement of 8.2% and 1.6% over the original approach in the detection of positive and negative deceptive opinions respectively. 2014 Elsevier Ltd. All rights reserved.This work is the result of the collaboration in the framework of the WIQEI IRSES project (Grant No. 269180) within the FP 7 Marie Curie. The work of the third author was in the framework the DIANA-APPLICATIONS-Finding Hidden Knowledge in Texts: Applications (TIN2012-38603-C02-01) project, and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Hernández Fusilier, D.; Montes Gómez, M.; Rosso, P.; Guzmán Cabrera, R. (2015). Detecting Positive and Negative Deceptive Opinions using PU-learning. Information Processing and Management. 51(4):433-443. https://doi.org/10.1016/j.ipm.2014.11.001S43344351

    "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection

    Full text link
    Automatic fake news detection is a challenging problem in deception detection, and it has tremendous real-world political and social impacts. However, statistical approaches to combating fake news has been dramatically limited by the lack of labeled benchmark datasets. In this paper, we present liar: a new, publicly available dataset for fake news detection. We collected a decade-long, 12.8K manually labeled short statements in various contexts from PolitiFact.com, which provides detailed analysis report and links to source documents for each case. This dataset can be used for fact-checking research as well. Notably, this new dataset is an order of magnitude larger than previously largest public fake news datasets of similar type. Empirically, we investigate automatic fake news detection based on surface-level linguistic patterns. We have designed a novel, hybrid convolutional neural network to integrate meta-data with text. We show that this hybrid approach can improve a text-only deep learning model.Comment: ACL 201

    Survey of review spam detection using machine learning techniques

    Get PDF

    Online Deception Detection Refueled by Real World Data Collection

    Full text link
    The lack of large realistic datasets presents a bottleneck in online deception detection studies. In this paper, we apply a data collection method based on social network analysis to quickly identify high-quality deceptive and truthful online reviews from Amazon. The dataset contains more than 10,000 deceptive reviews and is diverse in product domains and reviewers. Using this dataset, we explore effective general features for online deception detection that perform well across domains. We demonstrate that with generalized features - advertising speak and writing complexity scores - deception detection performance can be further improved by adding additional deceptive reviews from assorted domains in training. Finally, reviewer level evaluation gives an interesting insight into different deceptive reviewers' writing styles.Comment: 10 pages, Accepted to Recent Advances in Natural Language Processing (RANLP) 201

    Review Spam Detection Using Machine Learning Techniques

    Get PDF
    Nowadays with the increasing popularity of internet, online marketing is going to become more and more popular. This is because, a lot of products and services are easily available online. Hence, reviews about these all products and services are very important for customers as well as organizations. Unfortunately, driven by the will for profit or promotion, fraudsters used to produce fake reviews. These fake reviews written by fraudsters prevent customers and organizations reaching actual conclusions about the products. Hence, fake reviews or review spam must be detected and eliminated so as to prevent deceptive potential customers. In our work, supervised and semi-supervised learning technique have been applied to detect review spam. The most apt data sets in the research area of review spam detection has been used in proposed work. For supervised learning, we try to obtain some feature sets from different automated approaches such as LIWC, POS Tagging, N-gram etc., that can best distinguish the spam and non-spam reviews. Along with these features sentiment analysis, data mining and opinion mining technique have also been applied. For semi-supervised learning, PU-learning algorithm is being used along with six different classifiers (Decision Tree, Naive Bayes, Support Vector Machine, k-Nearest Neighbor, Random Forest, Logistic Regression) to detect review spam from the available data set. Finally, a comparison of proposed technique with some existing review spam detection techniques has been done

    Fake Opinion Detection: How Similar are Crowdsourced Datasets to Real Data?

    Full text link
    [EN] Identifying deceptive online reviews is a challenging tasks for Natural Language Processing (NLP). Collecting corpora for the task is difficult, because normally it is not possible to know whether reviews are genuine. A common workaround involves collecting (supposedly) truthful reviews online and adding them to a set of deceptive reviews obtained through crowdsourcing services. Models trained this way are generally successful at discriminating between `genuine¿ online reviews and the crowdsourced deceptive reviews. It has been argued that the deceptive reviews obtained via crowdsourcing are very different from real fake reviews, but the claim has never been properly tested. In this paper, we compare (false) crowdsourced reviews with a set of `real¿ fake reviews published on line. We evaluate their degree of similarity and their usefulness in training models for the detection of untrustworthy reviews. We find that the deceptive reviews collected via crowdsourcing are significantly different from the fake reviews published online. In the case of the artificially produced deceptive texts, it turns out that their domain similarity with the targets affects the models¿ performance, much more than their untruthfulness. This suggests that the use of crowdsourced datasets for opinion spam detection may not result in models applicable to the real task of detecting deceptive reviews. As an alternative method to create large-size datasets for the fake reviews detection task, we propose methods based on the probabilistic annotation of unlabeled texts, relying on the use of meta-information generally available on the e-commerce sites. Such methods are independent from the content of the reviews and allow to train reliable models for the detection of fake reviews.Leticia Cagnina thanks CONICET for the continued financial support. This work was funded by MINECO/FEDER (Grant No. SomEMBED TIN2015-71147-C2-1-P). The work of Paolo Rosso was partially funded by the MISMIS-FAKEnHATE Spanish MICINN research project (PGC2018-096212-B-C31). Massimo Poesio was in part supported by the UK Economic and Social Research Council (Grant Number ES/M010236/1).Fornaciari, T.; Cagnina, L.; Rosso, P.; Poesio, M. (2020). Fake Opinion Detection: How Similar are Crowdsourced Datasets to Real Data?. Language Resources and Evaluation. 54(4):1019-1058. https://doi.org/10.1007/s10579-020-09486-5S10191058544Baeza-Yates, R. (2018). Bias on the web. Communications of the ACM, 61(6), 54–61.Banerjee, S., & Chua, A. Y. (2014). Applauses in hotel reviews: Genuine or deceptive? In: Science and Information Conference (SAI), 2014 (pp. 938–942). New York: IEEE.Bhargava, R., Baoni, A., & Sharma, Y. (2018). Composite sequential modeling for identifying fake reviews. Journal of Intelligent Systems,. https://doi.org/10.1515/jisys-2017-0501.Bickel, P. J., & Doksum, K. A. (2015). Mathematical statistics: Basic ideas and selected topics (2nd ed., Vol. 1). Boca Raton: Chapman and Hall/CRC Press.Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.Blum, A., & Mitchell, T. (1998). Combining labeled and unlabeled data with co-training. In: Proceedings of the eleventh annual conference on computational learning theory (pp. 92–100). New York: ACM.Cagnina, L. C., & Rosso, P. (2017). Detecting deceptive opinions: Intra and cross-domain classification using an efficient representation. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 25(Suppl. 2), 151–174. https://doi.org/10.1142/S0218488517400165.Cardoso, E. F., Silva, R. M., & Almeida, T. A. (2018). Towards automatic filtering of fake reviews. Neurocomputing, 309, 106–116. https://doi.org/10.1016/j.neucom.2018.04.074.Carpenter, B. (2008). Multilevel bayesian models of categorical data annotation. Retrieved from http://lingpipe.files.wordpress.com/2008/11/carp-bayesian-multilevel-annotation.pdf.Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273–297.Costa, P. T., & MacCrae, R. R. (1992). Revised NEO personality inventory (NEO PI-R) and NEO five-factor inventory (NEO FFI): Professional manual. Psychological Assessment Resources.Dawid, A. P., & Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied Statistics, 28(1), 20–28.Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society Series B (Methodological), 39(1), 1–38.Elkan, C., & Noto, K. (2008). Learning classifiers from only positive and unlabeled data. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 213–220). New York: ACM.Fei, G., Mukherjee, A., Liu, B., Hsu, M., Castellanos, M., & Ghosh, R. (2013). Exploiting burstiness in reviews for review spammer detection. In: Proceedings of the Seventh International AAAI Conference on Weblogs and Social Media (Vol. 13, pp. 175–184).Feng, S., Banerjee, R., & Choi, Y. (2012). Syntactic stylometry for deception detection. In: Proceedings of the 50th annual meeting of the association for computational linguistics (Vol. 2: Short Papers, pp. 171–175). Jeju Island: Association for Computational Linguistics.Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. Journal of Machine Learning Research, 3, 1289–1305.Fornaciari, T., & Poesio, M. (2013). Automatic deception detection in Italian court cases. Artificial intelligence and law, 21(3), 303–340. https://doi.org/10.1007/s10506-013-9140-4.Fornaciari, T., & Poesio, M. (2014). Identifying fake amazon reviews as learning from crowds. In: Proceedings of the 14th conference of the European chapter of the Association for Computational Linguistics (pp. 279–287). Gothenburg: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/E14-1030.Gelman, A., & Hill, J. (2007). Data analysis using regression and multilevel/hierarchical models., Analytical methods for social research Cambridge: Cambridge University Press.Graves, A., Jaitly, N., & Mohamed, A. R. (2013). Hybrid speech recognition with deep bidirectional LSTM. In: 2013 IEEE workshop on automatic speech recognition and understanding (ASRU) (pp. 273–278). New York: IEEE.Hernández-Castañeda, Á., & Calvo, H. (2017). Deceptive text detection using continuous semantic space models. Intelligent Data Analysis, 21(3), 679–695.Hernández Fusilier, D., Guzmán, R., Móntes y Gomez, M., & Rosso, P. (2013). Using pu-learning to detect deceptive opinion spam. In: Proc. of the 4th workshop on computational approaches to subjectivity, sentiment and social media analysis (pp. 38–45).Hernández Fusilier, D., Montes-y Gómez, M., Rosso, P., & Cabrera, R. G. (2015). Detecting positive and negative deceptive opinions using pu-learning. Information Processing & Management, 51(4), 433–443.Hovy, D. (2016). The enemy in your own camp: How well can we detect statistically-generated fake reviews–an adversarial study. In: The 54th annual meeting of the association for computational linguistics (p 351).Jelinek, F., Lafferty, J. D., & Mercer, R. L. (1992). Basic methods of probabilistic context free grammars. Speech recognition and understanding (pp. 345–360). New York: Springer.Jindal, N., & Liu, B. (2008). Opinion spam and analysis. In: Proceedings of the 2008 international conference on web search and data mining (pp. 219–230). New York: ACM.Karatzoglou, A., Meyer, D., & Hornik, K. (2006). Support vector machines in R. Journal of Statistical Software, 15(9), 1–28.Kim, S., Lee, S., Park, D., & Kang, J. (2017). Constructing and evaluating a novel crowdsourcing-based paraphrased opinion spam dataset. In: Proceedings of the 26th international conference on world wide web (pp. 827–836). Geneva: International World Wide Web Conferences Steering Committee.Li, F., Huang, M., Yang, Y., & Zhu, X. (2011). Learning to identify review spam. IJCAI Proceedings-International Joint Conference on Artificial Intelligence, 22(3), 2488–2493.Li, H., Chen, Z., Liu, B., Wei, X., & Shao, J. (2014a). Spotting fake reviews via collective positive-unlabeled learning. In: 2014 IEEE international conference on data mining (ICDM) (pp. 899–904). New York: IEEE.Li, H., Fei, G., Wang, S., Liu, B., Shao, W., Mukherjee, A., & Shao, J. (2017). Bimodal distribution and co-bursting in review spam detection. In: Proceedings of the 26th international conference on world wide web (pp. 1063–1072). Geneva: International World Wide Web Conferences Steering Committee.Li, H., Liu, B., Mukherjee, A., & Shao, J. (2014b). Spotting fake reviews using positive-unlabeled learning. Computación y Sistemas, 18(3), 467–475.Li, J., Ott, M., Cardie, C., & Hovy, E. H. (2014c). Towards a general rule for identifying deceptive opinion spam. In: ACL (Vol. 1, pp. 1566–1576).Lin, C. H., Hsu, P. Y., Cheng, M. S., Lei, H. T., & Hsu, M. C. (2017). Identifying deceptive review comments with rumor and lie theories. In: International conference in swarm intelligence (pp. 412–420). New York: Springer.Liu, B., Dai, Y., Li, X., Lee, W. S., & Yu, P. S. (2003). Building text classifiers using positive and unlabeled examples. In: Third IEEE international conference on data mining (pp. 179–186). New York: IEEE.Liu, B., Lee, W. S., Yu, P. S., & Li, X. (2002). Partially supervised classification of text documents. ICML, 2, 387–394.Martens, D., & Maalej, W. (2019). Towards understanding and detecting fake reviews in app stores. Empirical Software Engineering,. https://doi.org/10.1007/s10664-019-09706-9.Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:13013781.Mukherjee, A., Kumar, A., Liu, B., Wang, J., Hsu, M., Castellanos, M., & Ghosh, R. (2013a). Spotting opinion spammers using behavioral footprints. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 632–640) New York: ACM.Mukherjee, A., Venkataraman, V., Liu, B., & Glance, N. S. (2013b). What yelp fake review filter might be doing? In: Proceedings of the seventh international AAAI conference on weblogs and social media.Negri, M., Bentivogli, L., Mehdad, Y., Giampiccolo, D., & Marchetti, A. (2011). Divide and conquer: Crowdsourcing the creation of cross-lingual textual entailment corpora. In: Proceedings of the conference on empirical methods in natural language processing (pp. 670–679). Stroudsburg: Association for Computational Linguistics.Ott, M., Cardie, C., & Hancock, J. T. (2013). Negative deceptive opinion spam. In: Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies (pp. 497–501).Ott, M., Choi, Y., Cardie, C., & Hancock, J. (2011). Finding deceptive opinion spam by any stretch of the imagination. In: Proceedings of the 49th Annual meeting of the association for computational linguistics: human language technologies (pp. 309–319). Portland, Oregon: Association for Computational Linguistics.Pennebaker, J. W., Francis, M. E., & Booth, R. J. (2001). Linguistic inquiry and word count (LIWC): LIWC2001. Mahwah: Lawrence Erlbaum Associates.Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., et al. (2010). Learning from crowds. Journal of Machine Learning Research, 11, 1297–1322.Ren, Y., & Ji, D. (2017). Neural networks for deceptive opinion spam detection: An empirical study. Information Sciences, 385, 213–224.Rout, J. K., Dalmia, A., Choo, K. K. R., Bakshi, S., & Jena, S. K. (2017). Revisiting semi-supervised learning for online deceptive review detection. IEEE Access, 5(1), 1319–1327.Saini, M., & Sharan, A. (2017). Ensemble learning to find deceptive reviews using personality traits and reviews specific features. Journal of Digital Information Management, 12(2), 84–94.Salloum, W., Edwards, E., Ghaffarzadegan, S., Suendermann-Oeft, D., & Miller, M. (2017). Crowdsourced continuous improvement of medical speech recognition. In: The AAAI-17 workshop on crowdsourcing, deep learning, and artificial intelligence agents.Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In: Proceedings of international conference on new methods in language processing. Retrieved from http://www.ims.uni-stuttgart.de/ftp/pub/corpora/tree-tagger1.pdf.Shehnepoor, S., Salehi, M., Farahbakhsh, R., & Crespi, N. (2017). Netspam: A network-based spam detection framework for reviews in online social media. IEEE Transactions on Information Forensics and Security, 12(7), 1585–1595.Skeppstedt, M., Peldszus, A., & Stede, M. (2018). More or less controlled elicitation of argumentative text: Enlarging a microtext corpus via crowdsourcing. In: Proceedings of the 5th workshop on argument mining (pp. 155–163).Strapparava, C., & Mihalcea, R. (2009). The lie detector: Explorations in the automatic recognition of deceptive language. In: Proceedings of the 47th annual meeting of the association for computational linguistics and the 4th international joint conference on natural language processing.Streitfeld, D. (August 25th25{{\rm th}}, 2012). The best book reviews money can buy. The New York Times.Whitehill, J., Wu, T., Bergsma, F., Movellan, J. R., & Ruvolo, P. L. (2009). Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. Advances in neural information processing systems (pp. 2035–2043). Cambridge: MIT Press.Xie, S., Wang, G., Lin, S., & Yu, P. S. (2012). Review spam detection via temporal pattern discovery. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining (pp 823–831). New York: ACM.Yang, Y., & Liu, X. (1999). A re-examination of text categorization methods. In: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’99 (pp. 42–49). New York: ACM.Zhang, W., Bu, C., Yoshida, T., & Zhang, S. (2016). Cospa: A co-training approach for spam review identification with support vector machine. Information, 7(1), 12.Zhang, W., Du, Y., Yoshida, T., & Wang, Q. (2018). DRI-RCNN: An approach to deceptive review identification using recurrent convolutional neural network. Information Processing & Management, 54(4), 576–592.Zhou, L., Shi, Y., & Zhang, D. (2008). A Statistical Language Modeling Approach to Online Deception Detection. IEEE Transactions on Knowledge and Data Engineering, 20(8), 1077–1081

    Fake Review Detection using Data Mining

    Get PDF
    Online spam reviews are deceptive evaluations of products and services. They are often carried out as a deliberate manipulation strategy to deceive the readers. Recognizing such reviews is an important but challenging problem. In this work, I try to solve this problem by using different data mining techniques. I explore the strength and weakness of those data mining techniques in detecting fake review. I start with different supervised techniques such as Support Vector Ma- chine (SVM), Multinomial Naive Bayes (MNB), and Multilayer Perceptron. The results attest that all the above mentioned supervised techniques can successfully detect fake review with more than 86% accuracy. Then, I work on a semi-supervised technique which reduces the dimension- ality of the input features vector but offers similar performance to existing approaches. I use a combination of topic modeling and SVM for the implementation of the semi-supervised tech- nique. I also compare the results with other approaches that consider all the words of a dataset as input features. I found that topic words are enough as input features to get similar accuracy compared to other approaches where researchers consider all the words as input features. At the end, I propose an unsupervised learning approach named as Words Basket Analysis for fake re- view detection. I utilize five Amazon products review dataset for an experiment and report the performance of the proposed on these datasets
    corecore