38 research outputs found
PageRank optimization applied to spam detection
We give a new link spam detection and PageRank demotion algorithm called
MaxRank. Like TrustRank and AntiTrustRank, it starts with a seed of hand-picked
trusted and spam pages. We define the MaxRank of a page as the frequency of
visit of this page by a random surfer minimizing an average cost per time unit.
On a given page, the random surfer selects a set of hyperlinks and clicks with
uniform probability on any of these hyperlinks. The cost function penalizes
spam pages and hyperlink removals. The goal is to determine a hyperlink
deletion policy that minimizes this score. The MaxRank is interpreted as a
modified PageRank vector, used to sort web pages instead of the usual PageRank
vector. The bias vector of this ergodic control problem, which is unique up to
an additive constant, is a measure of the "spamicity" of each page, used to
detect spam pages. We give a scalable algorithm for MaxRank computation that
allowed us to perform experimental results on the WEBSPAM-UK2007 dataset. We
show that our algorithm outperforms both TrustRank and AntiTrustRank for spam
and nonspam page detection.Comment: 8 pages, 6 figure
Выделение поискового спама на основе меры ссылочной схожести веб-страниц
В статье описывается мера ссылочной схожести веб-страниц и предложен простой алгоритм для выделения кластеров веб-страниц, подозрительных с точки зрения использование ссылочного спама. Кластеризация основывается на взвешенном графе схожести страниц, который может быть получен из ориентированного графа связей веб-страниц.У статті описується міра посилальної схожості веб-сторінок і запропонований простий алгоритм для виділення кластерів веб-сторінок, підозрілих з точки зору використовування посилального спама. Кластеризація ґрунтується на зваженому графі схожості сторінок, який може бути одержаний з орієнтованого графа зв'язків веб-сторінок
Задача мінімізації сумарного запізнення виконання незалежних завдань з директивними строками одним приладом в системі планування та управління дрібносерійним виробництвом (СПУДВ)
В статті розглянута задача мінімізації сумарного запізнення виконання незалежних завдань з директивними строками одним приладом, яка входить до складу математичного забезпечення системи СПУДВ. Ця задача відноситься до NP- складних, що обумовлює складність пошуку не тільки точних методів розв’язання задачі, але і наближених. Запропоновано ефективний точний ПДС-алгоритм (алгоритм із поліноміальною й експоненційною складовими) розв’язання задачі, заснований на новому підході до розв’язання задач з директивними строками, що полягає в оптимальному використанні резервів часу незапізнених завдань
The hw-rank: an h-index variant for ranking web pages
We introduce a novel ranking of search results based on a variant of the h-index for directed information networks such as the Web. The h-index was originally
introduced to measure an individual researcher’s scientific output and influence, but here a variant of it is applied to assess the ‘‘importance’’ of web pages. Like PageRank, the‘‘importance’’ of a page is defined by the ‘‘importance’’ of the pages linking to it. However,
unlike the computation of PageRank which involves the whole web graph, computing the h-index for web pages (the hw-rank) is based on a local computation and only the
neighbors of the neighbors of the given node are considered. Preliminary results show a strong correlation between ranking with the hw-rank and PageRank, and moreover its computation is simpler and less complex than computation of the PageRank. Further, larger scale experiments are needed in order to assess the applicability of the method
Reverse Engineering Socialbot Infiltration Strategies in Twitter
Data extracted from social networks like Twitter are increasingly being used
to build applications and services that mine and summarize public reactions to
events, such as traffic monitoring platforms, identification of epidemic
outbreaks, and public perception about people and brands. However, such
services are vulnerable to attacks from socialbots automated accounts that
mimic real users seeking to tamper statistics by posting messages generated
automatically and interacting with legitimate users. Potentially, if created in
large scale, socialbots could be used to bias or even invalidate many existing
services, by infiltrating the social networks and acquiring trust of other
users with time. This study aims at understanding infiltration strategies of
socialbots in the Twitter microblogging platform. To this end, we create 120
socialbot accounts with different characteristics and strategies (e.g., gender
specified in the profile, how active they are, the method used to generate
their tweets, and the group of users they interact with), and investigate the
extent to which these bots are able to infiltrate the Twitter social network.
Our results show that even socialbots employing simple automated mechanisms are
able to successfully infiltrate the network. Additionally, using a
factorial design, we quantify infiltration effectiveness of different bot
strategies. Our analysis unveils findings that are key for the design of
detection and counter measurements approaches
Ergodic Control and Polyhedral approaches to PageRank Optimization
We study a general class of PageRank optimization problems which consist in
finding an optimal outlink strategy for a web site subject to design
constraints. We consider both a continuous problem, in which one can choose the
intensity of a link, and a discrete one, in which in each page, there are
obligatory links, facultative links and forbidden links. We show that the
continuous problem, as well as its discrete variant when there are no
constraints coupling different pages, can both be modeled by constrained Markov
decision processes with ergodic reward, in which the webmaster determines the
transition probabilities of websurfers. Although the number of actions turns
out to be exponential, we show that an associated polytope of transition
measures has a concise representation, from which we deduce that the continuous
problem is solvable in polynomial time, and that the same is true for the
discrete problem when there are no coupling constraints. We also provide
efficient algorithms, adapted to very large networks. Then, we investigate the
qualitative features of optimal outlink strategies, and identify in particular
assumptions under which there exists a "master" page to which all controlled
pages should point. We report numerical results on fragments of the real web
graph.Comment: 39 page
An Analysis of Optimal Link Bombs
We analyze the phenomenon of collusion for the purpose of boosting the
pagerank of a node in an interlinked environment. We investigate the optimal
attack pattern for a group of nodes (attackers) attempting to improve the
ranking of a specific node (the victim). We consider attacks where the
attackers can only manipulate their own outgoing links. We show that the
optimal attacks in this scenario are uncoordinated, i.e. the attackers link
directly to the victim and no one else. nodes do not link to each other. We
also discuss optimal attack patterns for a group that wants to hide itself by
not pointing directly to the victim. In these disguised attacks, the attackers
link to nodes hops away from the victim. We show that an optimal disguised
attack exists and how it can be computed. The optimal disguised attack also
allows us to find optimal link farm configurations. A link farm can be
considered a special case of our approach: the target page of the link farm is
the victim and the other nodes in the link farm are the attackers for the
purpose of improving the rank of the victim. The target page can however
control its own outgoing links for the purpose of improving its own rank, which
can be modeled as an optimal disguised attack of 1-hop on itself. Our results
are unique in the literature as we show optimality not only in the pagerank
score, but also in the rank based on the pagerank score. We further validate
our results with experiments on a variety of random graph models.Comment: Full Version of a version which appeared in AIRweb 200
An Efficient Clustering System for the Measure of Page (Document) Authoritativeness
A collection of documents D1 of a search result R1 is a cluster if all the documents in D1 are similar in a way and dissimilar to another collection say D2 for a given query Q1. Implying that, given a new query Q2, the search result R2 may pose an intersection or a union of documents from D1 and D2 or more to form D3. However within these collections say D1, D2, D3 etc, one or two pages certainly would be better in relevance to the query that invokes them. Such a page is regarded being ‘authoritative’ than others. Therefore in a query context, a given search result has pages of authority. The most important measure of a search engine’s efficiency is the quality of its search results. This work seeks to cluster search results to ease the matching of searched documents with user’s need by attaching a page authority value (pav). We developed a classifier that falls in the margin of supervised and unsupervised learning which would be computationally feasible and producing most authoritative pages. A novel searching and clustering engine was developed using several measure-factors such as anchor text, proximity, page rank, and features of neighbors to rate the pages so searched. Documents or corpora of known measures from the Text Retrieval Conference (TREC), the Initiative for the Evaluation of XML Retrieval (INEX) and Reuter’s Collection, were fed into our work and evaluated comparatively with existing search engines (Google, VIVISIMO and Wikipedia). We got very impressive results based on our evaluation. Additionally, our system could add a value – pav to every searched and classified page to indicate a page’s relevance over the other. A document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often. This approach thus provides a different realization of some of the basic ideas for document ranking which could be applied through some acceptable rules: number of occurrence, document zone and relevance measures. The biggest problem facing users of web search engines today is the quality of the results they get back. While the results are often amusing and expand users' horizons, they are often frustrating and consume precious time. We have made available a better page ranker that do not depend heavily on the page developer’s inflicted weights but considers the actual factors within and without the target page. Though very experimental on research collections, the user can within the collection of the first ten search results listing, extract his or her relevant pages with ease. Keywords: page Authoritativeness, page Rank, search results, clustering algorithm, web crawling
Detecting fake news in tweets from text and propagation graph: IRISA's participation to the FakeNews task at MediaEval 2020
International audienceThis paper presents the participation of IRISA to the task of fake news detection from tweets, relying either on the text or on propagation information. For the text based detection, variants of BERT-based classification are proposed. In order to improve this standard approach, we investigate the interest of augmenting the dataset by creating tweets with fine-tuned generative models. For the graph based detection, we have proposed models characterizing the propagation of the news or the users' reputation