18,051 research outputs found

    A Survey on Trust and Distrust Propagation for Web Pages

    Get PDF
    Search engines are the hub for information retrieval from the web. But due to the web spam, we may not get the desired information from the search engines. The phrase web spam is used for the web pages that are designed to spam the web search results by using some unacceptable tactics. Web spam pages use different techniques to achieve undeserved ranking in the web. Over the last decades researchers are trying to design different techniques to identify the web spam pages so that it does not deteriorate the quality of the search results. In this paper we present a survey on different web spam techniques with underlying principles and algorithms. We have surveyed all the major spam detection techniques and provided a brief discussion on the pros and cons of all the existing techniques. Finally, we summarized the various observations and underlying principles that are applied for spam detection techniques.Keywords:TrustRank, Anti-TrustRank, Good-Bad Rank, Spam Detection, Demotio

    CAPTBHA: COMPLETELY AUTOMATED PROOF-OF-CONCEPT TEST TO TELL BOT AND HUMAN APART IMPLEMENTATION OF BOT DETECTION TECHNIQUE BASED ON WEB NAVIGATION BEHAVIOUR IN JACK-MAPS

    Get PDF
    CAPTBHA: COMPLETELY AUTOMATED PROOF-OF-CONCEPT TEST TO TELL BOT AND HUMAN APART IMPLEMENTATION OF BOT DETECTION TECHNIQUE BASED ON WEB NAVIGATION BEHAVIOUR IN JACK-MAPS - Bot detection, web navigation behavior, link obfuscation, Support Vector Machine, KNearest Neighbor, Naïve Bayes, Jack-Maps, Web 2.0, Spam 2.

    Survey on Web Spam Detection using Link and Content Based Features

    Get PDF
    Web spam is one of the recent problems of search engines because it powerfully reduced the quality of the Web page. Web spam has an economic impact because spammers provide a large free advertising data or sites on the search engines and so an increase in the web traffic volume. In this paper we Survey on efficient spam detection techniques based on a classifier that combines new link based features with language models. Link Based features are related to qualitative data extracted from the web pages and also to the qualitative properties of the page links. Spam technique applies LM approach to different sources of information from a web page that belongs to the context of a link in order to provide high quality indicators of web spam. Specifically Detection technique applied the Kullback Leibler divergence on different combinations of these sources of information in order to characterize the relationship between two linked pages

    An improved framework for content and link-based web spam detection: a combined approach

    Get PDF
    In the modern digital era, the Web has been utilized for searching information by using different search engines (SE) as a tool. However, web spammers misuse the web for financial benefits by ranking the irrelevant and spam web pages higher than relevant pages in the search engine's results pages (SERPs) by using web spamming techniques. Furthermore, those top-ranked unrelated web pages contain insufficient or inappropriate information for the user. In addition, web spamming techniques dramatically affect the quality of the search engine. Researchers introduced several web spam detection techniques such as content-based features, link-based features, label propagation, label refinement, click-based web spamming detection, and real-time web spam detection. However, identifying all spam pages on the Web with high accuracy is still remains unsolved. This work proposes a content-based web spam detection framework, link-based web spam detection framework, and a combined approach to identify both types of web spams with high accuracy that can detect the newly evolved link pyramid. The content-based web spam detection framework uses three proposed and two improved content-based algorithms for web spam detection. The link-based web spam detection framework initially exposed the relationship network behind the link spamming and then used the paid-links database algorithm, spam signals algorithm, and improved link farms algorithm for link-based web spam identification. Finally, the combination of both content and link-based frameworks enhance the accuracy of web spam detection. The proposed combined approach's performance has been evaluated and compared with the J48 classifier, C4.5 decision tree classifier, SVM classifier, and heuristic combined approach. Some experiments were conducted to obtain the threshold values using the proposed collection architecture on well-known datasets WEB SPAM-UK2006 and WEB SPAM-UK2007. The results show that the proposed methods outperform other methods with 82.1% precision and an F-measure of 80.6% to illustrate the proposed framework's effectiveness and applicability

    Addressing the new generation of spam (Spam 2.0) through Web usage models

    Get PDF
    New Internet collaborative media introduce new ways of communicating that are not immune to abuse. A fake eye-catching profile in social networking websites, a promotional review, a response to a thread in online forums with unsolicited content or a manipulated Wiki page, are examples of new the generation of spam on the web, referred to as Web 2.0 Spam or Spam 2.0. Spam 2.0 is defined as the propagation of unsolicited, anonymous, mass content to infiltrate legitimate Web 2.0 applications.The current literature does not address Spam 2.0 in depth and the outcome of efforts to date are inadequate. The aim of this research is to formalise a definition for Spam 2.0 and provide Spam 2.0 filtering solutions. Early-detection, extendibility, robustness and adaptability are key factors in the design of the proposed method.This dissertation provides a comprehensive survey of the state-of-the-art web spam and Spam 2.0 filtering methods to highlight the unresolved issues and open problems, while at the same time effectively capturing the knowledge in the domain of spam filtering.This dissertation proposes three solutions in the area of Spam 2.0 filtering including: (1) characterising and profiling Spam 2.0, (2) Early-Detection based Spam 2.0 Filtering (EDSF) approach, and (3) On-the-Fly Spam 2.0 Filtering (OFSF) approach. All the proposed solutions are tested against real-world datasets and their performance is compared with that of existing Spam 2.0 filtering methods.This work has coined the term ‘Spam 2.0’, provided insight into the nature of Spam 2.0, and proposed filtering mechanisms to address this new and rapidly evolving problem

    Combining Textual Content and Hyperlinks in Web Spam Detection

    Get PDF
    In this work1, we tackle the problem of spam detection on the Web. Spam web pages have become a problem for Web search engines, due to the negative effects that this phenomenon can cause in their retrieval results. Our approach is based on a random-walk algorithm that obtains a ranking of pages according to their relevance and their spam likelihood. We introduce the novelty of taking into account the content of the web pages to characterize the web graph and to obtain an a priori estimation of the spam likelihood of the web pages. Our graph-based algorithm computes two scores for each node in the graph. Intuitively, these values represent how bad or good (spam-like or not) a web page is, according to its textual content and the relations in the graph. Our experiments show that our proposed technique outperforms other link-based techniques for spam detection.Ministerio de Educación y Ciencia HUM2007-66607-C04-0

    Spam detection with a content-based random-walk algorithm

    Get PDF
    In this work we tackle the problem of the spam detection on the Web. Spam web pages have become a problem for Web search engines, due to the negative effects that this phe-nomenon can cause in their retrieval results. Our approach is based on a random-walk algorithm that obtains a ranking of pages according to their relevance and their spam likelihood. We introduce the novelty of taking into account the content of the web pages to characterize the web graph and to ob-tain an a- priori estimation of the spam likekihood of the web pages. Our graph-based algorithm computes two scores for each node in the graph. Intuitively, these values represent how bad or good (spam-like or not) is a web page, according to its textual content and the relations in the graph. Our experiments show that our proposed technique outperforms other link-based techniques for spam detection.Ministerio de Educación y Ciencia HUM2007-66607-C04-0

    SURVEY ON REVIEW SPAM DETECTION

    Get PDF
    The proliferation of E-commerce sites has made web an excellent source of gathering customer reviews about products; as there is no quality control anyone one can write anything which leads to review spam. This paper previews and reviews the substantial research on Review Spam detection technique. Further it provides state of art depicting some previous attempt to study review spam detection

    Web Spambot Detection Based on Web Navigation Behaviour

    Get PDF
    Web robots have been widely used for various beneficial and malicious activities. Web spambots are a type of web robot that spreads spam content throughout the web by typically targeting Web 2.0 applications. They are intelligently designed to replicate human behaviour in order to bypass system checks. Spam content not only wastes valuable resources but can also mislead users to unsolicited websites and award undeserved search engine rankings to spammers' campaign websites. While most of the research in anti-spam filtering focuses on the identification of spam content on the web, only a few have investigated the origin of spam content, hence identification and detection of web spambots still remains an open area of research.In this paper, we describe an automated supervised machine learning solution which utilises web navigation behaviour to detect web spambots. We propose a new feature set (referred to as an action set) as a representation of user behaviour to differentiate web spambots from human users. Our experimental results show that our solution achieves a 96.24% accuracy in classifying web spambots
    • …
    corecore