36 research outputs found

    Combining Textual Content and Hyperlinks in Web Spam Detection

    Get PDF
    In this work1, we tackle the problem of spam detection on the Web. Spam web pages have become a problem for Web search engines, due to the negative effects that this phenomenon can cause in their retrieval results. Our approach is based on a random-walk algorithm that obtains a ranking of pages according to their relevance and their spam likelihood. We introduce the novelty of taking into account the content of the web pages to characterize the web graph and to obtain an a priori estimation of the spam likelihood of the web pages. Our graph-based algorithm computes two scores for each node in the graph. Intuitively, these values represent how bad or good (spam-like or not) a web page is, according to its textual content and the relations in the graph. Our experiments show that our proposed technique outperforms other link-based techniques for spam detection.Ministerio de Educación y Ciencia HUM2007-66607-C04-0

    Efficient computation of the Weighted Clustering Coefficient

    Get PDF
    The clustering coefficient of an unweighted network has been extensively used to quantify how tightly connected is the neighbor around a node and it has been widely adopted for assessing the quality of nodes in a social network. The computation of the clustering coefficient is challenging since it requires to count the number of triangles in the graph. Several recent works proposed efficient sampling, streaming and MapReduce algorithms that allow to overcome this computational bottleneck. As a matter of fact, the intensity of the interaction between nodes, that is usually represented with weights on the edges of the graph, is also an important measure of the statistical cohesiveness of a network. Recently various notions of weighted clustering coefficient have been proposed but all those techniques are hard to implement on large-scale graphs. In this work we show how standard sampling techniques can be used to obtain efficient estimators for the most commonly used measures of weighted clustering coefficient. Furthermore we also propose a novel graph-theoretic notion of clustering coefficient in weighted networks. © 2016, Copyright © Taylor & Francis Group, LL

    Investigating Spam Mass Variations for Detecting Web Spam

    Full text link
    In this paper, we investigate variations of Spam Mass for filtering web spam. Firstly, we propose two strategies for designing new variations of the Spam Mass algorithm. Then, we perform experiments among different versions of Spam Mass using WEBSPAM-UK2006 data set. Finally, we show improvement through proposed strategy by up to 1.33 times in recall and 1.02 times in precision over the original version of Spam Mass

    Survey on Web Spam Detection using Link and Content Based Features

    Get PDF
    Web spam is one of the recent problems of search engines because it powerfully reduced the quality of the Web page. Web spam has an economic impact because spammers provide a large free advertising data or sites on the search engines and so an increase in the web traffic volume. In this paper we Survey on efficient spam detection techniques based on a classifier that combines new link based features with language models. Link Based features are related to qualitative data extracted from the web pages and also to the qualitative properties of the page links. Spam technique applies LM approach to different sources of information from a web page that belongs to the context of a link in order to provide high quality indicators of web spam. Specifically Detection technique applied the Kullback Leibler divergence on different combinations of these sources of information in order to characterize the relationship between two linked pages

    Identification of Web Spam through Clustering of Website Structures

    Get PDF
    Spam websites are domains whose owners are not interested in using them as gates for their activities but they are parked to be sold in the secondary market of web domains. To transform the costs of the annual registration fees in an opportunity of revenues, spam websites most often host a large amount of ads in the hope that someone who lands on the site by chance clicks on some ads. Since parking has become a widespread activity, a large number of specialized companies have come out and made parking a straightforward task that simply requires to set the domain?s name servers appropriately. Although parking is a legal activity, spam websites have a deep negative impact on the information quality of the web and can significantly deteriorate the performances of most web mining tools. For example these websites can influence search engines results or introduce an extra burden for crawling systems. In addition, spam websites represent a cost for ad bidders that are obliged to pay for impressions or clicks that have a negligible probability to produce revenues. In this paper, we experimentally show that spam websites hosted by the same service provider tend to have similar look-and-feel. Exploiting this structural similarity we face the problem of the automatic identification of spam websites. In addition, we use the outcome of the classification for compiling the list of the name servers used by spam websites so that they can be discarded before the first connection just after the first DNS query. A dump of our dataset (including web pages and meta information) and the corresponding manual classification is freely available upon request

    Comprehensive Literature Review on Machine Learning Structures for Web Spam Classification

    Get PDF
    AbstractVarious Web spam features and machine learning structures were constantly proposed to classify Web spam in recent years. The aim of this paper was to provide a comprehensive machine learning algorithms comparison within the Web spam detection community. Several machine learning algorithms and ensemble meta-algorithms as classifiers, area under receiver operating characteristic as performance evaluation and two public available datasets (WEBSPAM-UK2006 and WEBSPAM-UK2007) were experimented in this study. The results have shown that random forest with variations of AdaBoost had achieved 0.937 in WEBSPAM-UK2006 and 0.852 in WEBSPAM-UK2007

    Метод определения искусственных текстов на основе расчета меры принадлежности к инвариантам

    Get PDF
    The work is devoted to the identification of texts generated automatically (artificially) with the use of software algorithms. This is an important and topical issue, because such texts are being widely spread on the Internet. Created «copies» of the web pages are used to attract readers to online resources as well as to disseminate a large number of unique copies of pages with content specific orientation. This article describes the features of determining the origin of the text by the example of working on texts generated by synonymization as the most common method of generating artificial web content. The author provides an invariant of artificial texts as a set of the values of text characteristics, which allows classification of texts according to the process of their creation. The article proposes a method of the artificial texts identification based on the calculation of the belonging measure to the invariants, which allows making a decision about the origin of the text. The article also presents values obtained from the experiments on identifying artificial texts.Работа посвящена вопросу идентификации текстов, сгенерированных автоматически (искусственно) с помощью программных алгоритмов. Данная задача является актуальной в связи с ростом распространения таких текстов, распространяемых в Интернете. Создаваемые «копии» веб-страниц используются для привлечения читателей к интернет-ресурсам, а также для распространения большого количества уникальных экземпляров страниц с контентом определенной направленности. В статье описаны особенности определения происхождения текста на примере работы с текстами, порожденными методом синонимизации, как наиболее распространенного метода генерации искусственных текстов, представляющих собой веб-контент. Предложен инвариант искусственно созданных текстов, представляющий собой набор значений текстовых характеристик, который позволяет классифицировать тексты по способу их создания. Предложен метод определения искусственно созданных текстов на основе расчета меры принадлежности входного текста к инвариантам, позволяющий принять решение о происхождении текста. В статье также приведены значения, полученные в ходе проведения серии экспериментов по определению искусственно созданных текстов
    corecore