4,775 research outputs found

    Locality-Sensitive Hashing for Efficient Web Application Security Testing

    Full text link
    Web application security has become a major concern in recent years, as more and more content and services are available online. A useful method for identifying security vulnerabilities is black-box testing, which relies on an automated crawling of web applications. However, crawling Rich Internet Applications (RIAs) is a very challenging task. One of the key obstacles crawlers face is the state similarity problem: how to determine if two client-side states are equivalent. As current methods do not completely solve this problem, a successful scan of many real-world RIAs is still not possible. We present a novel approach to detect redundant content for security testing purposes. The algorithm applies locality-sensitive hashing using MinHash sketches in order to analyze the Document Object Model (DOM) structure of web pages, and to efficiently estimate similarity between them. Our experimental results show that this approach allows a successful scan of RIAs that cannot be crawled otherwise

    An Efficient Approach for Finding Near Duplicate Web pages using Minimum Weight Overlapping Method

    Get PDF
    The existence of billions of web data has severely affected the performance and reliability of web search. The presence of near duplicate web pages plays an important role in this performance degradation while integrating data from heterogeneous sources. Web mining faces huge problems due to the existence of such documents. These pages increase the index storage space and thereby increase the serving cost. By introducing efficient methods to detect and remove such documents from the Web not only decreases the computation time but also increases the relevancy of search results. We aim a novel idea for finding near duplicate web pages which can be incorporated in the field of plagiarism detection, spam detection and focused web crawling scenarios. Here we propose an efficient method for finding near duplicates of an input web page, from a huge repository. A TDW matrix based algorithm is proposed with three phases, rendering, filtering and verification, which receives an input web page and a threshold in its first phase, prefix filtering and positional filtering to reduce the size of record set in the second phase and returns an optimal set of near duplicate web pages in the verification phase by using Minimum Weight Overlapping (MWO) method. The experimental results show that our algorithm outperforms in terms of two benchmark measures, precision and recall, and a reduction in the size of competing record set.DOI:http://dx.doi.org/10.11591/ijece.v1i2.7

    Do we really need to catch them all? A new User-guided Social Media Crawling method

    Full text link
    With the growing use of popular social media services like Facebook and Twitter it is challenging to collect all content from the networks without access to the core infrastructure or paying for it. Thus, if all content cannot be collected one must consider which data are of most importance. In this work we present a novel User-guided Social Media Crawling method (USMC) that is able to collect data from social media, utilizing the wisdom of the crowd to decide the order in which user generated content should be collected to cover as many user interactions as possible. USMC is validated by crawling 160 public Facebook pages, containing content from 368 million users including 1.3 billion interactions, and it is compared with two other crawling methods. The results show that it is possible to cover approximately 75% of the interactions on a Facebook page by sampling just 20% of its posts, and at the same time reduce the crawling time by 53%. In addition, the social network constructed from the 20% sample contains more than 75% of the users and edges compared to the social network created from all posts, and it has similar degree distribution

    Monte Carlo methods of PageRank computation

    Get PDF
    We describe and analyze an on-line Monte Carlo method of PageRank computation. The PageRank is being estimated basing on results of a large number of short independent simulation runs initiated from each page that contains outgoing hyperlinks. The method does not require any storage of the hyperlink matrix and is highly parallelizable. We study confidence intervals, and discover drawbacks of the absolute error criterion and the relative error criterion. Further, we suggest a so-called weighted relative error criterion, which ensures a good accuracy in a relatively small number of simulation runs. Moreover, with the weighted relative error measure, the complexity of the algorithm does not depend on the web structure

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System
    corecore