101 research outputs found
An Analysis of Optimal Link Bombs
We analyze the phenomenon of collusion for the purpose of boosting the
pagerank of a node in an interlinked environment. We investigate the optimal
attack pattern for a group of nodes (attackers) attempting to improve the
ranking of a specific node (the victim). We consider attacks where the
attackers can only manipulate their own outgoing links. We show that the
optimal attacks in this scenario are uncoordinated, i.e. the attackers link
directly to the victim and no one else. nodes do not link to each other. We
also discuss optimal attack patterns for a group that wants to hide itself by
not pointing directly to the victim. In these disguised attacks, the attackers
link to nodes hops away from the victim. We show that an optimal disguised
attack exists and how it can be computed. The optimal disguised attack also
allows us to find optimal link farm configurations. A link farm can be
considered a special case of our approach: the target page of the link farm is
the victim and the other nodes in the link farm are the attackers for the
purpose of improving the rank of the victim. The target page can however
control its own outgoing links for the purpose of improving its own rank, which
can be modeled as an optimal disguised attack of 1-hop on itself. Our results
are unique in the literature as we show optimality not only in the pagerank
score, but also in the rank based on the pagerank score. We further validate
our results with experiments on a variety of random graph models.Comment: Full Version of a version which appeared in AIRweb 200
Evaluation of Spam Impact on Arabic Websites Popularity
The expansion of the Web and its information in all aspects of life raises the concern of how to trust information published on the Web especially in cases where publisher may not be known. Websites strive to be more popular and make themselves visible to search engines and eventually to users. Website popularity can be measured using several metrics such as the Web traffic (e.g. Website: visitors\u27 number and visited page number). A link or page popularity refers to the total number of hyperlinks referring to a certain Web page. In this study, several top ranked Arabic Websites are selected for evaluating possible Web spam behavior. Websites use spam techniques to boost their ranks within Search Engine Results Page (SERP). Results of this study showed that some of these popular Websites are using techniques that are considered spam techniques according to Search Engine Optimization guidelines
Malware distributions and graph structure of the Web
Knowledge about the graph structure of the Web is important for understanding
this complex socio-technical system and for devising proper policies supporting
its future development. Knowledge about the differences between clean and
malicious parts of the Web is important for understanding potential treats to
its users and for devising protection mechanisms. In this study, we conduct
data science methods on a large crawl of surface and deep Web pages with the
aim to increase such knowledge. To accomplish this, we answer the following
questions. Which theoretical distributions explain important local
characteristics and network properties of websites? How are these
characteristics and properties different between clean and malicious
(malware-affected) websites? What is the prediction power of local
characteristics and network properties to classify malware websites? To the
best of our knowledge, this is the first large-scale study describing the
differences in global properties between malicious and clean parts of the Web.
In other words, our work is building on and bridging the gap between
\textit{Web science} that tackles large-scale graph representations and
\textit{Web cyber security} that is concerned with malicious activities on the
Web. The results presented herein can also help antivirus vendors in devising
approaches to improve their detection algorithms
Machine Learning Techniques For Detecting Untrusted Pages on the Web
The Web is both an excellent medium for sharing information, as well as an attractive platform for delivering products and services. This platform is, to some extent, mediated by search engines in order to meet the needs of users seeking information. Search engines are the “dragons” that keep a valuable treasure: information.
Many web pages are unscrupulous and try to fool search engines to get to the top of ranking. The goal of this project is to detect such spam pages. We will particularly consider content spam and link spam, where untrusted pages use link structure to increase their importance. We pose this as a machine learning problem and build a classifier to classify pages into two category - trustworthy and untrusted .We use different link features, in other words structural characteristics of the web graph and content based features, as input to the classifier.
We propose link-based techniques and context based techniques for automating the detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. We propose Naïve Bayesian Classifier to detect the content Spam and PageRank and TrustRank to detect the link spam
Essays on the Computation of Economic Equilibria and Its Applications.
The computation of economic equilibria is a central
problem in algorithmic game theory. In this dissertation, we
investigate the existence of economic equilibria in several
markets and games, the complexity of computing economic
equilibria, and its application to rankings.
It is well known that a competitive economy always has an
equilibrium under mild conditions. In this dissertation, we study
the complexity of computing competitive equilibria. We show that
given a competitive economy that fully respects all the conditions
of Arrow-Debreu's existence theorem, it is PPAD-hard to compute an
approximate competitive equilibrium. Furthermore, it is still
PPAD-Complete to compute an approximate equilibrium for economies
with additively separable piecewise linear concave utility
functions.
Degeneracy is an important concept in game theory. We study the
complexity of deciding degeneracy in games. We show that it is
NP-Complete to decide whether a bimatrix game is degenerate.
With the advent of the Internet, an agent can easily have access
to multiple accounts. In this dissertation we study the path
auction game, which is a model for QoS routing, supply chain
management, and so on, with multiple edge ownership. We show that
the condition of multiple edge ownership eliminates the
possibility of reasonable solution concepts, such as a
strategyproof or false-name-proof mechanism or Pareto efficient
Nash equilibria.
The stationary distribution (an equilibrium point) of a Markov
chain is widely used for ranking purposes. One of the most
important applications is PageRank, part of the ranking algorithm
of Google. By making use of perturbation theories of Markov
chains, we show the optimal manipulation strategies of a Web
spammer against PageRank under a few natural constraints. Finally,
we make a connection between the ranking vector of PageRank or the
Invariant method and the equilibrium of a Cobb-Douglas market.
Furthermore, we propose the CES ranking method based on the
Constant Elasticity of Substitution (CES) utility functions.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/64821/1/duye_1.pd
Measuring success of open source projects using web search engines
What makes an open source project successful?
In this paper we show that the traditional factors of success of open source projects, such as number of downloads, deployments or commits are sometimes inconvenient or even insufficient. We then correlate success of an open source project with its popularity on the Web. We show several ideas of how such popularity could be measured using Web search engines and provide experimental results from quantitative analysis of the proposed measures on representative large samples of open source projects from SourceForge
Categorizing Blog Spam
The internet has matured into the focal point of our era. Its ecosystem is vast, complex, and in many regards unaccounted for. One of the most prevalent aspects of the internet is spam. Similar to the rest of the internet, spam has evolved from simply meaning ‘unwanted emails’ to a blanket term that encompasses any unsolicited or illegitimate content that appears in the wide range of media that exists on the internet.
Many forms of spam permeate the internet, and spam architects continue to develop tools and methods to avoid detection. On the other side, cyber security engineers continue to develop more sophisticated detection tools to curb the harmful effects that come with spam. This virtual arms race has no end in sight. Most efforts thus far have been toward accurately detecting spam from ham, and rightfully so since initial detection is essential. However, research is lacking in understanding the current ecosystem of spam, spam campaigns, and the behavior of the botnets that drive the majority of spam traffic.
This thesis focuses on characterizing spam, particularly the spam that appears in forums, where the spam is delivered by bots posing as legitimate users. Forum spam is used primarily to push advertisements or to boost other websites’ perceived popularity by including HTTP links in the content of the post. We conduct an experiment to collect a sample of the blog posts and network activity of the spambots that exist in the internet. We then present a corpora available to conduct analysis on and proceed with our own analysis. We cluster associated groups of users and IP addresses into entities, which we accept as a model of the underlying botnets that interact with our honeypots. We use Natural Language Processing (NLP) and Machine Learning (ML) to determine that creating semantic-based models of botnets are sufficient for distinguishing them from one another. We also find that the syntactic structure of posts has little variation from botnet to botnet. Finally we confirm that to a large degree botnet behavior and content hold across different domains
Efficient External-Memory Algorithms for Graph Mining
The explosion of big data in areas like the web and social networks has posed big challenges to research activities, including data mining, information retrieval, security etc. This dissertation focuses on a particular area, graph mining, and specifically proposes several novel algorithms to solve the problems of triangle listing and computation of neighborhood function in large-scale graphs.
We first study the classic problem of triangle listing. We generalize the existing in-memory algorithms into a single framework of 18 triangle-search techniques. We then develop a novel external-memory approach, which we call Pruned Companion Files (PCF), that supports disk operation of all 18 algorithms. When compared to state-of-the-art available implementations MGT and PDTL, PCF runs 5-10 times faster and exhibits orders of magnitude less I/O.
We next focus on I/O complexity of triangle listing. Recent work by Pagh etc. provides an appealing theoretical I/O complexity for triangle listing via graph partitioning by random coloring of nodes. Since no implementation of Pagh is available and little is known about the comparison between Pagh and PCF, we carefully implement Pagh, undertake an investigation into the properties of these algorithms, model their I/O cost, understand their shortcomings, and shed light on the conditions under which each method defeats the other. This insight leads us to develop a novel framework we call Trigon that surpasses the I/O performance of both techniques in all graphs and under all RAM conditions.
We finally turn our attention to neighborhood function. Exact computation of neighborhood function is expensive in terms of CPU and I/O cost. Previous work mostly focuses on approximations. We show that our novel techniques developed for triangle listing can also be applied to this problem. We next study an application of neighborhood function to ranking of Internet hosts. Our method computes neighborhood functions for each host as an indication of its reputation. The evaluation shows that our method is robust to ranking manipulation and brings less spam to its top ranking list compared to PageRank and TrustRank
Detect Spammers in Online Social Networks
Fake followers in online social networks (OSNs) are the accounts that are created to boost the rank of some targets. These spammers can be generated by programs or human beings, making them hard to identify. In this thesis, we propose a novel spammer detection method by detecting near-duplicate accounts who share most of the followers. It is hard to discover such near-duplicates on large social networks that provide limited remote access. We identify the near-duplicates and the corresponding spammers by estimating the Jaccard similarity using star sampling, a combination of uniform random sampling and breadth-first crawling. Then we applied our methods in Sina Weibo and Twitter. For Weibo, we find 395 near-duplicates, 12 millions suspected spammers and 741 millions spam links. In Twitter, we find 129 near-duplicates, 4.93 million suspected spammers and 2.608 billion spam links. Moreover, we cluster the near-duplicates and the corresponding spammers, and analyze the properties of each group
- …