615 research outputs found

    A Survey to Fix the Threshold and Implementation for Detecting Duplicate Web Documents

    Get PDF
    The drastic development in the information accessible on the World Wide Web has made the employment of automated tools to locate the information resources of interest, and for tracking and analyzing the same a certainty. Web Mining is the branch of data mining that deals with the analysis of World Wide Web. The concepts from various areas such as Data Mining, Internet technology and World Wide Web, and recently, Semantic Web can be said as the origin of web mining. Web mining can be defined as the procedure of determining hidden yet potentially beneficial knowledge from the data accessible in the web. Web mining comprise the sub areas: web content mining, web structure mining, and web usage mining. Web content mining is the process of mining knowledge from the web pages besides other web objects. The process of mining knowledge about the link structure linking web pages and some other web objects is defined as Web structure mining. Web usage mining is defined as the process of mining the usage patterns created by the users accessing the web pages. The search engine technology has led to the development of World Wide. The search engines are the chief gateways for access of information in the web. The ability to locate contents of particular interest amidst a huge heap has turned businesses beneficial and productive. The search engines respond to the queries by employing the process of web crawling that populates an indexed repository of web pages. The programs construct a confined repository of the segment of the web that they visit by navigating the web graph and retrieving pages. There are two main types of crawling, namely, Generic and Focused crawling. Generic crawlers crawls documents and links of diverse topics. Focused crawlers limit the number of pages with the aid of some prior obtained specialized knowledge. The systems that index, mine, and otherwise analyze pages (such as, the search engines) are provided with inputs from the repositories of web pages built by the web crawlers. The drastic development of the Internet and the growing necessity to incorporate heterogeneous data is accompanied by the issue of the existence of near duplicate data. Even if the near duplicate data don’t exhibit bit wise identical nature they are remarkably similar. The duplicate and near duplicate web pages either increase the index storage space or slow down or increase the serving costs which annoy the users, thus causing huge problems for the web search engines. Hence it is inevitable to design algorithms to detect such pages

    POISED: Spotting Twitter Spam Off the Beaten Paths

    Get PDF
    Cybercriminals have found in online social networks a propitious medium to spread spam and malicious content. Existing techniques for detecting spam include predicting the trustworthiness of accounts and analyzing the content of these messages. However, advanced attackers can still successfully evade these defenses. Online social networks bring people who have personal connections or share common interests to form communities. In this paper, we first show that users within a networked community share some topics of interest. Moreover, content shared on these social network tend to propagate according to the interests of people. Dissemination paths may emerge where some communities post similar messages, based on the interests of those communities. Spam and other malicious content, on the other hand, follow different spreading patterns. In this paper, we follow this insight and present POISED, a system that leverages the differences in propagation between benign and malicious messages on social networks to identify spam and other unwanted content. We test our system on a dataset of 1.3M tweets collected from 64K users, and we show that our approach is effective in detecting malicious messages, reaching 91% precision and 93% recall. We also show that POISED's detection is more comprehensive than previous systems, by comparing it to three state-of-the-art spam detection systems that have been proposed by the research community in the past. POISED significantly outperforms each of these systems. Moreover, through simulations, we show how POISED is effective in the early detection of spam messages and how it is resilient against two well-known adversarial machine learning attacks

    Combating Threats to the Quality of Information in Social Systems

    Get PDF
    Many large-scale social systems such as Web-based social networks, online social media sites and Web-scale crowdsourcing systems have been growing rapidly, enabling millions of human participants to generate, share and consume content on a massive scale. This reliance on users can lead to many positive effects, including large-scale growth in the size and content in the community, bottom-up discovery of “citizen-experts”, serendipitous discovery of new resources beyond the scope of the system designers, and new social-based information search and retrieval algorithms. But the relative openness and reliance on users coupled with the widespread interest and growth of these social systems carries risks and raises growing concerns over the quality of information in these systems. In this dissertation research, we focus on countering threats to the quality of information in self-managing social systems. Concretely, we identify three classes of threats to these systems: (i) content pollution by social spammers, (ii) coordinated campaigns for strategic manipulation, and (iii) threats to collective attention. To combat these threats, we propose three inter-related methods for detecting evidence of these threats, mitigating their impact, and improving the quality of information in social systems. We augment this three-fold defense with an exploration of their origins in “crowdturfing” – a sinister counterpart to the enormous positive opportunities of crowdsourcing. In particular, this dissertation research makes four unique contributions: • The first contribution of this dissertation research is a framework for detecting and filtering social spammers and content polluters in social systems. To detect and filter individual social spammers and content polluters, we propose and evaluate a novel social honeypot-based approach. • Second, we present a set of methods and algorithms for detecting coordinated campaigns in large-scale social systems. We propose and evaluate a content- driven framework for effectively linking free text posts with common “talking points” and extracting campaigns from large-scale social systems. • Third, we present a dual study of the robustness of social systems to collective attention threats through both a data-driven modeling approach and deploy- ment over a real system trace. We evaluate the effectiveness of countermeasures deployed based on the first moments of a bursting phenomenon in a real system. • Finally, we study the underlying ecosystem of crowdturfing for engaging in each of the three threat types. We present a framework for “pulling back the curtain” on crowdturfers to reveal their underlying ecosystem on both crowdsourcing sites and social media

    Detecting Abnormal Behavior in Web Applications

    Get PDF
    The rapid advance of web technologies has made the Web an essential part of our daily lives. However, network attacks have exploited vulnerabilities of web applications, and caused substantial damages to Internet users. Detecting network attacks is the first and important step in network security. A major branch in this area is anomaly detection. This dissertation concentrates on detecting abnormal behaviors in web applications by employing the following methodology. For a web application, we conduct a set of measurements to reveal the existence of abnormal behaviors in it. We observe the differences between normal and abnormal behaviors. By applying a variety of methods in information extraction, such as heuristics algorithms, machine learning, and information theory, we extract features useful for building a classification system to detect abnormal behaviors.;In particular, we have studied four detection problems in web security. The first is detecting unauthorized hotlinking behavior that plagues hosting servers on the Internet. We analyze a group of common hotlinking attacks and web resources targeted by them. Then we present an anti-hotlinking framework for protecting materials on hosting servers. The second problem is detecting aggressive behavior of automation on Twitter. Our work determines whether a Twitter user is human, bot or cyborg based on the degree of automation. We observe the differences among the three categories in terms of tweeting behavior, tweet content, and account properties. We propose a classification system that uses the combination of features extracted from an unknown user to determine the likelihood of being a human, bot or cyborg. Furthermore, we shift the detection perspective from automation to spam, and introduce the third problem, namely detecting social spam campaigns on Twitter. Evolved from individual spammers, spam campaigns manipulate and coordinate multiple accounts to spread spam on Twitter, and display some collective characteristics. We design an automatic classification system based on machine learning, and apply multiple features to classifying spam campaigns. Complementary to conventional spam detection methods, our work brings efficiency and robustness. Finally, we extend our detection research into the blogosphere to capture blog bots. In this problem, detecting the human presence is an effective defense against the automatic posting ability of blog bots. We introduce behavioral biometrics, mainly mouse and keyboard dynamics, to distinguish between human and bot. By passively monitoring user browsing activities, this detection method does not require any direct user participation, and improves the user experience
    • …
    corecore