664 research outputs found

    Automated Discovery of Internet Censorship by Web Crawling

    Full text link
    Censorship of the Internet is widespread around the world. As access to the web becomes increasingly ubiquitous, filtering of this resource becomes more pervasive. Transparency about specific content that citizens are denied access to is atypical. To counter this, numerous techniques for maintaining URL filter lists have been proposed by various individuals and organisations that aim to empirical data on censorship for benefit of the public and wider censorship research community. We present a new approach for discovering filtered domains in different countries. This method is fully automated and requires no human interaction. The system uses web crawling techniques to traverse between filtered sites and implements a robust method for determining if a domain is filtered. We demonstrate the effectiveness of the approach by running experiments to search for filtered content in four different censorship regimes. Our results show that we perform better than the current state of the art and have built domain filter lists an order of magnitude larger than the most widely available public lists as of Jan 2018. Further, we build a dataset mapping the interlinking nature of blocked content between domains and exhibit the tightly networked nature of censored web resources

    Updating collection representations for federated search

    Get PDF
    To facilitate the search for relevant information across a set of online distributed collections, a federated information retrieval system typically represents each collection, centrally, by a set of vocabularies or sampled documents. Accurate retrieval is therefore related to how precise each representation reflects the underlying content stored in that collection. As collections evolve over time, collection representations should also be updated to reflect any change, however, a current solution has not yet been proposed. In this study we examine both the implications of out-of-date representation sets on retrieval accuracy, as well as proposing three different policies for managing necessary updates. Each policy is evaluated on a testbed of forty-four dynamic collections over an eight-week period. Our findings show that out-of-date representations significantly degrade performance overtime, however, adopting a suitable update policy can minimise this problem

    A Bandwidth-Conserving Architecture for Crawling Virtual Worlds

    Get PDF
    A virtual world is a computer-based simulated environment intended for its users to inhabit via avatars. Content in virtual worlds such as Second Life or OpenSimulator is increasingly presented using three-dimensional (3D) dynamic presentation technologies that challenge traditional search technologies. As 3D environments become both more prevalent and more fragmented, the need for a data crawler and distributed search service will continue to grow. By increasing the visibility of content across virtual world servers in order to better collect and integrate the 3D data we can also improve the crawling and searching efficiency and accuracy by avoiding crawling unchanged regions or downloading unmodified objects that already exist in our collection. This will help to save bandwidth resources and Internet traffic during the content collection and indexing and, for a fixed amount of bandwidth, maximize the freshness of the collection. This work presents a new services paradigm for virtual world crawler interaction that is co-operative and exploits information about 3D objects in the virtual world. Our approach supports analyzing redundant information crawled from virtual worlds in order to decrease the amount of data collected by crawlers, keep search engine collections up to date, and provide an efficient mechanism for collecting and searching information from multiple virtual worlds. Experimental results with data crawled from Second Life servers demonstrate that our approach provides the ability to save crawling bandwidth consumption, to explore more hidden objects and new regions to be crawled that facilitate the search service in virtual worlds

    Spam detection in collaborative tagging

    Get PDF
    The algorithm proposed will be able to identify the spammers and demote their ranks cocooning the users from their malicious intents and gives popular and relevant resources in a collaborative tagging system or in online dating sites, or any other online forum where there are discussions like quora, amazon feedbacks etc. by a suitable algorithm on lines of an existing one but with multifaceating dimensions as against them. We have taken the assumption that there are two factors on which the virtuosity of a user with reference to a resource or a document depends on. First and foremost an expert should have a rich content resource in his repertoire and his dexterity to find good resources, however the paraphernalia for rich resource is virtuosity of users who tagged it. Secondly, an expert should be first to identify intriguing or riveting documents

    A novel defense mechanism against web crawler intrusion

    Get PDF
    Web robots also known as crawlers or spiders are used by search engines, hackers and spammers to gather information about web pages. Timely detection and prevention of unwanted crawlers increases privacy and security of websites. In this research, a novel method to identify web crawlers is proposed to prevent unwanted crawler to access websites. The proposed method suggests a five-factor identification process to detect unwanted crawlers. This study provides the pretest and posttest results along with a systematic evaluation of web pages with the proposed identification technique versus web pages without the proposed identification process. An experiment was performed with repeated measures for two groups with each group containing ninety web pages. The outputs of the logistic regression analysis of treatment and control groups confirm the novel five-factor identification process as an effective mechanism to prevent unwanted web crawlers. This study concluded that the proposed five distinct identifier process is a very effective technique as demonstrated by a successful outcome
    corecore