21,691 research outputs found

    Detection of opinion spam with character n-grams

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-18117-2_21In this paper we consider the detection of opinion spam as a stylistic classi cation task because, given a particular domain, the deceptive and truthful opinions are similar in content but di ffer in the way opinions are written (style). Particularly, we propose using character ngrams as features since they have shown to capture lexical content as well as stylistic information. We evaluated our approach on a standard corpus composed of 1600 hotel reviews, considering positive and negative reviews. We compared the results obtained with character n-grams against the ones with word n-grams. Moreover, we evaluated the e ffectiveness of character n-grams decreasing the training set size in order to simulate real training conditions. The results obtained show that character n-grams are good features for the detection of opinion spam; they seem to be able to capture better than word n-grams the content of deceptive opinions and the writing style of the deceiver. In particular, results show an improvement of 2:3% and 2:1% over the word-based representations in the detection of positive and negative deceptive opinions respectively. Furthermore, character n-grams allow to obtain a good performance also with a very small training corpus. Using only 25% of the training set, a Na ve Bayes classi er showed F1 values up to 0.80 for both opinion polarities.This work is the result of the collaboration in the frame-work of the WIQEI IRSES project (Grant No. 269180) within the FP7 Marie Curie. The second author was partially supported by the LACCIR programme under project ID R1212LAC006. Accordingly, the work of the third author was in the framework the DIANA-APPLICATIONS-Finding Hidden Knowledge inTexts: Applications (TIN2012-38603-C02-01) project, and the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems.Hernández Fusilier, D.; Montes Gomez, M.; Rosso, P.; Guzmán Cabrera, R. (2015). Detection of opinion spam with character n-grams. En Computational Linguistics and Intelligent Text Processing: 16th International Conference, CICLing 2015, Cairo, Egypt, April 14-20, 2015, Proceedings, Part II. Springer International Publishing. 285-294. https://doi.org/10.1007/978-3-319-18117-2_21S285294Blamey, B., Crick, T., Oatley, G.: RU:-) or:-(? character-vs. word-gram feature selection for sentiment classification of OSN corpora. Research and Development in Intelligent Systems XXIX, 207–212 (2012)Drucker, H., Wu, D., Vapnik, V.N.: Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (2002)Feng, S., Banerjee, R., Choi, Y.: Syntactic Stylometry for Deception Detection. Association for Computational Linguistics, short paper. ACL (2012)Feng, S., Xing, L., Gogar, A., Choi, Y.: Distributional Footprints of Deceptive Product Reviews. In: Proceedings of the 2012 International AAAI Conference on WebBlogs and Social Media (June 2012)Gyongyi, Z., Garcia-Molina, H., Pedersen, J.: Combating Web Spam with Trust Rank. In: Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol. 30, pp. 576–587. VLDB Endowment (2004)Hall, M., Eibe, F., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA Data Mining Software: an Update. SIGKDD Explor. Newsl. 10–18 (2009)Hernández-Fusilier, D., Guzmán-Cabrera, R., Montes-y-Gómez, M., Rosso, P.: Using PU-learning to Detect Deceptive Opinion Spam. In: Proceedings of the 4th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, USA, pp. 38–45 (2013)Hernández-Fusilier, D., Montes-y-Gómez, M., Rosso, P., Guzmán-Cabrera, R.: Detecting Positive and Negative Deceptive Opinions using PU-learning. Information Processing & Management (2014), doi:10.1016/j.ipm.2014.11.001Jindal, N., Liu, B.: Opinion Spam and Analysis. In: Proceedings of the International Conference on Web Search and Web Data Mining, pp. 219–230 (2008)Jindal, N., Liu, B., Lim, E.: Finding Unusual Review Patterns Using Unexpected Rules. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM 2010, pp. 210–220(October 2010)Kanaris, I., Kanaris, K., Houvardas, I., Stamatatos, E.: Word versus character n-grams for anti-spam filtering. International Journal on Artificial Intelligence Tools 16(6), 1047–1067 (2007)Lim, E.P., Nguyen, V.A., Jindal, N., Liu, B., Lauw, H.W.: Detecting Product Review Spammers Using Rating Behaviours. In: CIKM, pp. 939–948 (2010)Liu, B.: Sentiment Analysis and Opinion Mining. Synthesis Lecture on Human Language Technologies. Morgan & Claypool Publishers (2012)Mukherjee, A., Liu, B., Wang, J., Glance, N., Jindal, N.: Detecting Group Review Spam. In: Proceedings of the 20th International Conference Companion on World Wide Web, pp. 93–94 (2011)Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting Spam Web Pages through Content Analysis. Transactions on Management Information Systems (TMIS), 83–92 (2006)Ott, M., Choi, Y., Cardie, C., Hancock, J.T.: Finding Deceptive Opinion Spam by any Stretch of the Imagination. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, Oregon, USA, pp. 309–319 (2011)Ott, M., Cardie, C., Hancock, J.T.: Negative Deceptive Opinion Spam. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Atlanta, Georgia, USA, pp. 309–319 (2013)Raymond, Y.K., Lau, S.Y., Liao, R., Chi-Wai, K., Kaiquan, X., Yunqing, X., Yuefeng, L.: Text Mining and Probabilistic Modeling for Online Review Spam Detection. ACM Transactions on Management Information Systems 2(4), Article: 25, 1–30 (2011)Stamatatos, E.: On the robustness of authorship attribution based on character n-gram features. Journal of Law & Policy 21(2) (2013)Wu, G., Greene, D., Cunningham, P.: Merging Multiple Criteria to Identify Suspicious Reviews. In: RecSys 2010, pp. 241–244 (2010)Xie, S., Wang, G., Lin, S., Yu, P.S.: Review Spam Detection via Time Series Pattern Discovery. In: Proceedings of the 21st International Conference Companion on World Wide Web, pp. 635–636 (2012)Zhou, L., Sh, Y., Zhang, D.: A Statistical Language Modeling Approach to Online Deception Detection. IEEE Transactions on Knowledge and Data Engineering 20(8), 1077–1081 (2008

    Review Spam Detection Using Machine Learning Techniques

    Get PDF
    Nowadays with the increasing popularity of internet, online marketing is going to become more and more popular. This is because, a lot of products and services are easily available online. Hence, reviews about these all products and services are very important for customers as well as organizations. Unfortunately, driven by the will for profit or promotion, fraudsters used to produce fake reviews. These fake reviews written by fraudsters prevent customers and organizations reaching actual conclusions about the products. Hence, fake reviews or review spam must be detected and eliminated so as to prevent deceptive potential customers. In our work, supervised and semi-supervised learning technique have been applied to detect review spam. The most apt data sets in the research area of review spam detection has been used in proposed work. For supervised learning, we try to obtain some feature sets from different automated approaches such as LIWC, POS Tagging, N-gram etc., that can best distinguish the spam and non-spam reviews. Along with these features sentiment analysis, data mining and opinion mining technique have also been applied. For semi-supervised learning, PU-learning algorithm is being used along with six different classifiers (Decision Tree, Naive Bayes, Support Vector Machine, k-Nearest Neighbor, Random Forest, Logistic Regression) to detect review spam from the available data set. Finally, a comparison of proposed technique with some existing review spam detection techniques has been done

    Towards a Modular Ontology for Cloud Consumer Review Mining

    Get PDF
    Nowadays, online consumer reviews are used to enhance the effectiveness of finding useful product information that impacts the consumers’ decision-making process. Many studies have been proposed to analyze these reviews for many purposes, such as opinion-based recommendation, spam review detection, opinion leader analysis, etc. A standard model that presents the different aspects of online review (review, product/service, user) is needed to facilitate the review analysis task. This research suggests SOPA, a modular ontology for cloud Service OPinion Analysis. SOPA represents the content of a product/service and its related opinions extracted from the online reviews written in a specific context. The SOPA is evaluated and validated using cloud consumer reviews from social media and using quality metrics. The experiments revealed that the SOPA-related modules exhibit a high cohesion and a low coupling, besides their usefulness and applicability in real use case studies

    Detecting Singleton Review Spammers Using Semantic Similarity

    Full text link
    Online reviews have increasingly become a very important resource for consumers when making purchases. Though it is becoming more and more difficult for people to make well-informed buying decisions without being deceived by fake reviews. Prior works on the opinion spam problem mostly considered classifying fake reviews using behavioral user patterns. They focused on prolific users who write more than a couple of reviews, discarding one-time reviewers. The number of singleton reviewers however is expected to be high for many review websites. While behavioral patterns are effective when dealing with elite users, for one-time reviewers, the review text needs to be exploited. In this paper we tackle the problem of detecting fake reviews written by the same person using multiple names, posting each review under a different name. We propose two methods to detect similar reviews and show the results generally outperform the vectorial similarity measures used in prior works. The first method extends the semantic similarity between words to the reviews level. The second method is based on topic modeling and exploits the similarity of the reviews topic distributions using two models: bag-of-words and bag-of-opinion-phrases. The experiments were conducted on reviews from three different datasets: Yelp (57K reviews), Trustpilot (9K reviews) and Ott dataset (800 reviews).Comment: 6 pages, WWW 201

    Unfair Commercial Practices, Spam and Fake Online Reviews. The Italian Perspective and Comparative Profiles

    Get PDF
    This paper starts its analysis from Legislative Decree number 146/2007 which incorporated Directive 2005/29/EC into the Italian Consumer Code. This Directive is about unfair commercial practices, useful in illustrating the phenomenon undertaken by unscrupulous businessmen against consumers. Ten years after the enforcement and entry of this legislation into Italian law, the balance is still not positive because consumers do not seem to be totally protected from the implementation of those devious entrepreneurial strategies designed to mislead the consumer from taking an informed decision of a commercial nature. More specifically, in my study I analyze the lack of legislation, above all on unfair trade practices classified as spam and fake reviews (otherwise known as ‘opinion spam’) against which Italian private law (different from other legal systems) is totally insufficient to protect consumers

    Detecting Sockpuppets in Deceptive Opinion Spam

    Full text link
    This paper explores the problem of sockpuppet detection in deceptive opinion spam using authorship attribution and verification approaches. Two methods are explored. The first is a feature subsampling scheme that uses the KL-Divergence on stylistic language models of an author to find discriminative features. The second is a transduction scheme, spy induction that leverages the diversity of authors in the unlabeled test set by sending a set of spies (positive samples) from the training set to retrieve hidden samples in the unlabeled test set using nearest and farthest neighbors. Experiments using ground truth sockpuppet data show the effectiveness of the proposed schemes.Comment: 18 pages, Accepted at CICLing 2017, 18th International Conference on Intelligent Text Processing and Computational Linguistic
    corecore