101 research outputs found

    An Analysis of Optimal Link Bombs

    Get PDF
    We analyze the phenomenon of collusion for the purpose of boosting the pagerank of a node in an interlinked environment. We investigate the optimal attack pattern for a group of nodes (attackers) attempting to improve the ranking of a specific node (the victim). We consider attacks where the attackers can only manipulate their own outgoing links. We show that the optimal attacks in this scenario are uncoordinated, i.e. the attackers link directly to the victim and no one else. nodes do not link to each other. We also discuss optimal attack patterns for a group that wants to hide itself by not pointing directly to the victim. In these disguised attacks, the attackers link to nodes ll hops away from the victim. We show that an optimal disguised attack exists and how it can be computed. The optimal disguised attack also allows us to find optimal link farm configurations. A link farm can be considered a special case of our approach: the target page of the link farm is the victim and the other nodes in the link farm are the attackers for the purpose of improving the rank of the victim. The target page can however control its own outgoing links for the purpose of improving its own rank, which can be modeled as an optimal disguised attack of 1-hop on itself. Our results are unique in the literature as we show optimality not only in the pagerank score, but also in the rank based on the pagerank score. We further validate our results with experiments on a variety of random graph models.Comment: Full Version of a version which appeared in AIRweb 200

    Evaluation of Spam Impact on Arabic Websites Popularity

    Get PDF
    The expansion of the Web and its information in all aspects of life raises the concern of how to trust information published on the Web especially in cases where publisher may not be known. Websites strive to be more popular and make themselves visible to search engines and eventually to users. Website popularity can be measured using several metrics such as the Web traffic (e.g. Website: visitors\u27 number and visited page number). A link or page popularity refers to the total number of hyperlinks referring to a certain Web page. In this study, several top ranked Arabic Websites are selected for evaluating possible Web spam behavior. Websites use spam techniques to boost their ranks within Search Engine Results Page (SERP). Results of this study showed that some of these popular Websites are using techniques that are considered spam techniques according to Search Engine Optimization guidelines

    Malware distributions and graph structure of the Web

    Full text link
    Knowledge about the graph structure of the Web is important for understanding this complex socio-technical system and for devising proper policies supporting its future development. Knowledge about the differences between clean and malicious parts of the Web is important for understanding potential treats to its users and for devising protection mechanisms. In this study, we conduct data science methods on a large crawl of surface and deep Web pages with the aim to increase such knowledge. To accomplish this, we answer the following questions. Which theoretical distributions explain important local characteristics and network properties of websites? How are these characteristics and properties different between clean and malicious (malware-affected) websites? What is the prediction power of local characteristics and network properties to classify malware websites? To the best of our knowledge, this is the first large-scale study describing the differences in global properties between malicious and clean parts of the Web. In other words, our work is building on and bridging the gap between \textit{Web science} that tackles large-scale graph representations and \textit{Web cyber security} that is concerned with malicious activities on the Web. The results presented herein can also help antivirus vendors in devising approaches to improve their detection algorithms

    Machine Learning Techniques For Detecting Untrusted Pages on the Web

    Get PDF
    The Web is both an excellent medium for sharing information, as well as an attractive platform for delivering products and services. This platform is, to some extent, mediated by search engines in order to meet the needs of users seeking information. Search engines are the “dragons” that keep a valuable treasure: information. Many web pages are unscrupulous and try to fool search engines to get to the top of ranking. The goal of this project is to detect such spam pages. We will particularly consider content spam and link spam, where untrusted pages use link structure to increase their importance. We pose this as a machine learning problem and build a classifier to classify pages into two category - trustworthy and untrusted .We use different link features, in other words structural characteristics of the web graph and content based features, as input to the classifier. We propose link-based techniques and context based techniques for automating the detection of Web spam, a term referring to pages which use deceptive techniques to obtain undeservedly high scores in search engines. We propose Naïve Bayesian Classifier to detect the content Spam and PageRank and TrustRank to detect the link spam

    Essays on the Computation of Economic Equilibria and Its Applications.

    Full text link
    The computation of economic equilibria is a central problem in algorithmic game theory. In this dissertation, we investigate the existence of economic equilibria in several markets and games, the complexity of computing economic equilibria, and its application to rankings. It is well known that a competitive economy always has an equilibrium under mild conditions. In this dissertation, we study the complexity of computing competitive equilibria. We show that given a competitive economy that fully respects all the conditions of Arrow-Debreu's existence theorem, it is PPAD-hard to compute an approximate competitive equilibrium. Furthermore, it is still PPAD-Complete to compute an approximate equilibrium for economies with additively separable piecewise linear concave utility functions. Degeneracy is an important concept in game theory. We study the complexity of deciding degeneracy in games. We show that it is NP-Complete to decide whether a bimatrix game is degenerate. With the advent of the Internet, an agent can easily have access to multiple accounts. In this dissertation we study the path auction game, which is a model for QoS routing, supply chain management, and so on, with multiple edge ownership. We show that the condition of multiple edge ownership eliminates the possibility of reasonable solution concepts, such as a strategyproof or false-name-proof mechanism or Pareto efficient Nash equilibria. The stationary distribution (an equilibrium point) of a Markov chain is widely used for ranking purposes. One of the most important applications is PageRank, part of the ranking algorithm of Google. By making use of perturbation theories of Markov chains, we show the optimal manipulation strategies of a Web spammer against PageRank under a few natural constraints. Finally, we make a connection between the ranking vector of PageRank or the Invariant method and the equilibrium of a Cobb-Douglas market. Furthermore, we propose the CES ranking method based on the Constant Elasticity of Substitution (CES) utility functions.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/64821/1/duye_1.pd

    Measuring success of open source projects using web search engines

    Get PDF
    What makes an open source project successful? In this paper we show that the traditional factors of success of open source projects, such as number of downloads, deployments or commits are sometimes inconvenient or even insufficient. We then correlate success of an open source project with its popularity on the Web. We show several ideas of how such popularity could be measured using Web search engines and provide experimental results from quantitative analysis of the proposed measures on representative large samples of open source projects from SourceForge

    Categorizing Blog Spam

    Get PDF
    The internet has matured into the focal point of our era. Its ecosystem is vast, complex, and in many regards unaccounted for. One of the most prevalent aspects of the internet is spam. Similar to the rest of the internet, spam has evolved from simply meaning ‘unwanted emails’ to a blanket term that encompasses any unsolicited or illegitimate content that appears in the wide range of media that exists on the internet. Many forms of spam permeate the internet, and spam architects continue to develop tools and methods to avoid detection. On the other side, cyber security engineers continue to develop more sophisticated detection tools to curb the harmful effects that come with spam. This virtual arms race has no end in sight. Most efforts thus far have been toward accurately detecting spam from ham, and rightfully so since initial detection is essential. However, research is lacking in understanding the current ecosystem of spam, spam campaigns, and the behavior of the botnets that drive the majority of spam traffic. This thesis focuses on characterizing spam, particularly the spam that appears in forums, where the spam is delivered by bots posing as legitimate users. Forum spam is used primarily to push advertisements or to boost other websites’ perceived popularity by including HTTP links in the content of the post. We conduct an experiment to collect a sample of the blog posts and network activity of the spambots that exist in the internet. We then present a corpora available to conduct analysis on and proceed with our own analysis. We cluster associated groups of users and IP addresses into entities, which we accept as a model of the underlying botnets that interact with our honeypots. We use Natural Language Processing (NLP) and Machine Learning (ML) to determine that creating semantic-based models of botnets are sufficient for distinguishing them from one another. We also find that the syntactic structure of posts has little variation from botnet to botnet. Finally we confirm that to a large degree botnet behavior and content hold across different domains

    Efficient External-Memory Algorithms for Graph Mining

    Get PDF
    The explosion of big data in areas like the web and social networks has posed big challenges to research activities, including data mining, information retrieval, security etc. This dissertation focuses on a particular area, graph mining, and specifically proposes several novel algorithms to solve the problems of triangle listing and computation of neighborhood function in large-scale graphs. We first study the classic problem of triangle listing. We generalize the existing in-memory algorithms into a single framework of 18 triangle-search techniques. We then develop a novel external-memory approach, which we call Pruned Companion Files (PCF), that supports disk operation of all 18 algorithms. When compared to state-of-the-art available implementations MGT and PDTL, PCF runs 5-10 times faster and exhibits orders of magnitude less I/O. We next focus on I/O complexity of triangle listing. Recent work by Pagh etc. provides an appealing theoretical I/O complexity for triangle listing via graph partitioning by random coloring of nodes. Since no implementation of Pagh is available and little is known about the comparison between Pagh and PCF, we carefully implement Pagh, undertake an investigation into the properties of these algorithms, model their I/O cost, understand their shortcomings, and shed light on the conditions under which each method defeats the other. This insight leads us to develop a novel framework we call Trigon that surpasses the I/O performance of both techniques in all graphs and under all RAM conditions. We finally turn our attention to neighborhood function. Exact computation of neighborhood function is expensive in terms of CPU and I/O cost. Previous work mostly focuses on approximations. We show that our novel techniques developed for triangle listing can also be applied to this problem. We next study an application of neighborhood function to ranking of Internet hosts. Our method computes neighborhood functions for each host as an indication of its reputation. The evaluation shows that our method is robust to ranking manipulation and brings less spam to its top ranking list compared to PageRank and TrustRank

    Detect Spammers in Online Social Networks

    Get PDF
    Fake followers in online social networks (OSNs) are the accounts that are created to boost the rank of some targets. These spammers can be generated by programs or human beings, making them hard to identify. In this thesis, we propose a novel spammer detection method by detecting near-duplicate accounts who share most of the followers. It is hard to discover such near-duplicates on large social networks that provide limited remote access. We identify the near-duplicates and the corresponding spammers by estimating the Jaccard similarity using star sampling, a combination of uniform random sampling and breadth-first crawling. Then we applied our methods in Sina Weibo and Twitter. For Weibo, we find 395 near-duplicates, 12 millions suspected spammers and 741 millions spam links. In Twitter, we find 129 near-duplicates, 4.93 million suspected spammers and 2.608 billion spam links. Moreover, we cluster the near-duplicates and the corresponding spammers, and analyze the properties of each group
    corecore