664 research outputs found
Automated Discovery of Internet Censorship by Web Crawling
Censorship of the Internet is widespread around the world. As access to the
web becomes increasingly ubiquitous, filtering of this resource becomes more
pervasive. Transparency about specific content that citizens are denied access
to is atypical. To counter this, numerous techniques for maintaining URL filter
lists have been proposed by various individuals and organisations that aim to
empirical data on censorship for benefit of the public and wider censorship
research community.
We present a new approach for discovering filtered domains in different
countries. This method is fully automated and requires no human interaction.
The system uses web crawling techniques to traverse between filtered sites and
implements a robust method for determining if a domain is filtered. We
demonstrate the effectiveness of the approach by running experiments to search
for filtered content in four different censorship regimes. Our results show
that we perform better than the current state of the art and have built domain
filter lists an order of magnitude larger than the most widely available public
lists as of Jan 2018. Further, we build a dataset mapping the interlinking
nature of blocked content between domains and exhibit the tightly networked
nature of censored web resources
Updating collection representations for federated search
To facilitate the search for relevant information across a set of online distributed collections, a federated information retrieval system typically represents each collection, centrally, by a set of vocabularies or sampled documents. Accurate retrieval is therefore related to how precise each representation reflects the underlying content stored in that collection. As collections evolve over time, collection representations should also be updated to reflect any change, however, a current solution has not yet been proposed. In this study we examine both the implications of out-of-date representation sets on retrieval accuracy, as well as proposing three different policies for managing necessary updates. Each policy is evaluated on a testbed of forty-four dynamic collections over an eight-week period. Our findings show that out-of-date representations significantly degrade performance overtime, however, adopting a suitable update policy can minimise this problem
A Bandwidth-Conserving Architecture for Crawling Virtual Worlds
A virtual world is a computer-based simulated environment intended for its users to inhabit via avatars. Content in virtual worlds such as Second Life or OpenSimulator is increasingly presented using three-dimensional (3D) dynamic presentation technologies that challenge traditional search technologies. As 3D environments become both more prevalent and more fragmented, the need for a data crawler and distributed search service will continue to grow. By increasing the visibility of content across virtual world servers in order to better collect and integrate the 3D data we can also improve the crawling and searching efficiency and accuracy by avoiding crawling unchanged regions or downloading unmodified objects that already exist in our collection. This will help to save bandwidth resources and Internet traffic during the content collection and indexing and, for a fixed amount of bandwidth, maximize the freshness of the collection. This work presents a new services paradigm for virtual world crawler interaction that is co-operative and exploits information about 3D objects in the virtual world. Our approach supports analyzing redundant information crawled from virtual worlds in order to decrease the amount of data collected by crawlers, keep search engine collections up to date, and provide an efficient mechanism for collecting and searching information from multiple virtual worlds. Experimental results with data crawled from Second Life servers demonstrate that our approach provides the ability to save crawling bandwidth consumption, to explore more hidden objects and new regions to be crawled that facilitate the search service in virtual worlds
Spam detection in collaborative tagging
The algorithm proposed will be able to identify the spammers and demote their ranks cocooning the users from their malicious intents and gives popular and relevant resources in a collaborative tagging system or in online dating sites, or any other online forum where there are discussions like quora, amazon feedbacks etc. by a suitable algorithm on lines of an existing one but with multifaceating dimensions as against them. We have taken the assumption that there are two factors on which the virtuosity of a user with reference to a resource or a document depends on. First and foremost an expert should have a rich content resource in his repertoire and his dexterity to find good resources, however the paraphernalia for rich resource is virtuosity of users who tagged it. Secondly, an expert should be first to identify intriguing or riveting documents
A novel defense mechanism against web crawler intrusion
Web robots also known as crawlers or spiders are used by search engines, hackers and spammers to gather information about web pages. Timely detection and prevention of unwanted crawlers increases privacy and security of websites. In this research, a novel method to identify web crawlers is proposed to prevent unwanted crawler to access websites. The proposed method suggests a five-factor identification process to detect unwanted crawlers. This study provides the pretest and posttest results along with a systematic evaluation of web pages with the proposed identification technique versus web pages without the proposed identification process. An experiment was performed with repeated measures for two groups with each group containing ninety web pages. The outputs of the logistic regression analysis of treatment and control groups confirm the novel five-factor identification process as an effective mechanism to prevent unwanted web crawlers. This study concluded that the proposed five distinct identifier process is a very effective technique as demonstrated by a successful outcome
- …