38 research outputs found

    PageRank optimization applied to spam detection

    Full text link
    We give a new link spam detection and PageRank demotion algorithm called MaxRank. Like TrustRank and AntiTrustRank, it starts with a seed of hand-picked trusted and spam pages. We define the MaxRank of a page as the frequency of visit of this page by a random surfer minimizing an average cost per time unit. On a given page, the random surfer selects a set of hyperlinks and clicks with uniform probability on any of these hyperlinks. The cost function penalizes spam pages and hyperlink removals. The goal is to determine a hyperlink deletion policy that minimizes this score. The MaxRank is interpreted as a modified PageRank vector, used to sort web pages instead of the usual PageRank vector. The bias vector of this ergodic control problem, which is unique up to an additive constant, is a measure of the "spamicity" of each page, used to detect spam pages. We give a scalable algorithm for MaxRank computation that allowed us to perform experimental results on the WEBSPAM-UK2007 dataset. We show that our algorithm outperforms both TrustRank and AntiTrustRank for spam and nonspam page detection.Comment: 8 pages, 6 figure

    Выделение поискового спама на основе меры ссылочной схожести веб-страниц

    Get PDF
    В статье описывается мера ссылочной схожести веб-страниц и предложен простой алгоритм для выделения кластеров веб-страниц, подозрительных с точки зрения использование ссылочного спама. Кластеризация основывается на взвешенном графе схожести страниц, который может быть получен из ориентированного графа связей веб-страниц.У статті описується міра посилальної схожості веб-сторінок і запропонований простий алгоритм для виділення кластерів веб-сторінок, підозрілих з точки зору використовування посилального спама. Кластеризація ґрунтується на зваженому графі схожості сторінок, який може бути одержаний з орієнтованого графа зв'язків веб-сторінок

    Задача мінімізації сумарного запізнення виконання незалежних завдань з директивними строками одним приладом в системі планування та управління дрібносерійним виробництвом (СПУДВ)

    Get PDF
    В статті розглянута задача мінімізації сумарного запізнення виконання незалежних завдань з директивними строками одним приладом, яка входить до складу математичного забезпечення системи СПУДВ. Ця задача відноситься до NP- складних, що обумовлює складність пошуку не тільки точних методів розв’язання задачі, але і наближених. Запропоновано ефективний точний ПДС-алгоритм (алгоритм із поліноміальною й експоненційною складовими) розв’язання задачі, заснований на новому підході до розв’язання задач з директивними строками, що полягає в оптимальному використанні резервів часу незапізнених завдань

    The hw-rank: an h-index variant for ranking web pages

    Get PDF
    We introduce a novel ranking of search results based on a variant of the h-index for directed information networks such as the Web. The h-index was originally introduced to measure an individual researcher’s scientific output and influence, but here a variant of it is applied to assess the ‘‘importance’’ of web pages. Like PageRank, the‘‘importance’’ of a page is defined by the ‘‘importance’’ of the pages linking to it. However, unlike the computation of PageRank which involves the whole web graph, computing the h-index for web pages (the hw-rank) is based on a local computation and only the neighbors of the neighbors of the given node are considered. Preliminary results show a strong correlation between ranking with the hw-rank and PageRank, and moreover its computation is simpler and less complex than computation of the PageRank. Further, larger scale experiments are needed in order to assess the applicability of the method

    Reverse Engineering Socialbot Infiltration Strategies in Twitter

    Full text link
    Data extracted from social networks like Twitter are increasingly being used to build applications and services that mine and summarize public reactions to events, such as traffic monitoring platforms, identification of epidemic outbreaks, and public perception about people and brands. However, such services are vulnerable to attacks from socialbots - automated accounts that mimic real users - seeking to tamper statistics by posting messages generated automatically and interacting with legitimate users. Potentially, if created in large scale, socialbots could be used to bias or even invalidate many existing services, by infiltrating the social networks and acquiring trust of other users with time. This study aims at understanding infiltration strategies of socialbots in the Twitter microblogging platform. To this end, we create 120 socialbot accounts with different characteristics and strategies (e.g., gender specified in the profile, how active they are, the method used to generate their tweets, and the group of users they interact with), and investigate the extent to which these bots are able to infiltrate the Twitter social network. Our results show that even socialbots employing simple automated mechanisms are able to successfully infiltrate the network. Additionally, using a 2k2^k factorial design, we quantify infiltration effectiveness of different bot strategies. Our analysis unveils findings that are key for the design of detection and counter measurements approaches

    Ergodic Control and Polyhedral approaches to PageRank Optimization

    Full text link
    We study a general class of PageRank optimization problems which consist in finding an optimal outlink strategy for a web site subject to design constraints. We consider both a continuous problem, in which one can choose the intensity of a link, and a discrete one, in which in each page, there are obligatory links, facultative links and forbidden links. We show that the continuous problem, as well as its discrete variant when there are no constraints coupling different pages, can both be modeled by constrained Markov decision processes with ergodic reward, in which the webmaster determines the transition probabilities of websurfers. Although the number of actions turns out to be exponential, we show that an associated polytope of transition measures has a concise representation, from which we deduce that the continuous problem is solvable in polynomial time, and that the same is true for the discrete problem when there are no coupling constraints. We also provide efficient algorithms, adapted to very large networks. Then, we investigate the qualitative features of optimal outlink strategies, and identify in particular assumptions under which there exists a "master" page to which all controlled pages should point. We report numerical results on fragments of the real web graph.Comment: 39 page

    An Analysis of Optimal Link Bombs

    Get PDF
    We analyze the phenomenon of collusion for the purpose of boosting the pagerank of a node in an interlinked environment. We investigate the optimal attack pattern for a group of nodes (attackers) attempting to improve the ranking of a specific node (the victim). We consider attacks where the attackers can only manipulate their own outgoing links. We show that the optimal attacks in this scenario are uncoordinated, i.e. the attackers link directly to the victim and no one else. nodes do not link to each other. We also discuss optimal attack patterns for a group that wants to hide itself by not pointing directly to the victim. In these disguised attacks, the attackers link to nodes ll hops away from the victim. We show that an optimal disguised attack exists and how it can be computed. The optimal disguised attack also allows us to find optimal link farm configurations. A link farm can be considered a special case of our approach: the target page of the link farm is the victim and the other nodes in the link farm are the attackers for the purpose of improving the rank of the victim. The target page can however control its own outgoing links for the purpose of improving its own rank, which can be modeled as an optimal disguised attack of 1-hop on itself. Our results are unique in the literature as we show optimality not only in the pagerank score, but also in the rank based on the pagerank score. We further validate our results with experiments on a variety of random graph models.Comment: Full Version of a version which appeared in AIRweb 200

    An Efficient Clustering System for the Measure of Page (Document) Authoritativeness

    Get PDF
    A collection of documents D1 of a search result R1 is a cluster if all the documents in D1 are similar in a way and dissimilar to another collection say D2 for a given query Q1. Implying that, given a new query Q2, the search result R2 may pose an intersection or a union of documents from D1 and D2 or more to form D3. However within these collections say D1, D2, D3 etc, one or two pages certainly would be better in relevance to the query that invokes them. Such a page is regarded being ‘authoritative’ than others. Therefore in a query context, a given search result has pages of authority. The most important measure of a search engine’s efficiency is the quality of its search results. This work seeks to cluster search results to ease the matching of searched documents with user’s need by attaching a page authority value (pav). We developed a classifier that falls in the margin of supervised and unsupervised learning which would be computationally feasible and producing most authoritative pages. A novel searching and clustering engine was developed using several measure-factors such as anchor text, proximity, page rank, and features of neighbors to rate the pages so searched. Documents or corpora of known measures from the Text Retrieval Conference (TREC), the Initiative for the Evaluation of XML Retrieval (INEX) and Reuter’s Collection, were fed into our work and evaluated comparatively with existing search engines (Google, VIVISIMO and Wikipedia). We got very impressive results based on our evaluation. Additionally, our system could add a value – pav to every searched and classified page to indicate a page’s relevance over the other. A document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often. This approach thus provides a different realization of some of the basic ideas for document ranking which could be applied through some acceptable rules: number of occurrence, document zone and relevance measures. The biggest problem facing users of web search engines today is the quality of the results they get back. While the results are often amusing and expand users' horizons, they are often frustrating and consume precious time. We have made available a better page ranker that do not depend heavily on the page developer’s inflicted weights but considers the actual factors within and without the target page. Though very experimental on research collections, the user can within the collection of the first ten search results listing, extract his or her relevant pages with ease. Keywords: page Authoritativeness, page Rank, search results, clustering algorithm, web crawling

    Detecting fake news in tweets from text and propagation graph: IRISA's participation to the FakeNews task at MediaEval 2020

    Get PDF
    International audienceThis paper presents the participation of IRISA to the task of fake news detection from tweets, relying either on the text or on propagation information. For the text based detection, variants of BERT-based classification are proposed. In order to improve this standard approach, we investigate the interest of augmenting the dataset by creating tweets with fine-tuned generative models. For the graph based detection, we have proposed models characterizing the propagation of the news or the users' reputation
    corecore