4 research outputs found
Automated Discovery of Internet Censorship by Web Crawling
Censorship of the Internet is widespread around the world. As access to the
web becomes increasingly ubiquitous, filtering of this resource becomes more
pervasive. Transparency about specific content that citizens are denied access
to is atypical. To counter this, numerous techniques for maintaining URL filter
lists have been proposed by various individuals and organisations that aim to
empirical data on censorship for benefit of the public and wider censorship
research community.
We present a new approach for discovering filtered domains in different
countries. This method is fully automated and requires no human interaction.
The system uses web crawling techniques to traverse between filtered sites and
implements a robust method for determining if a domain is filtered. We
demonstrate the effectiveness of the approach by running experiments to search
for filtered content in four different censorship regimes. Our results show
that we perform better than the current state of the art and have built domain
filter lists an order of magnitude larger than the most widely available public
lists as of Jan 2018. Further, we build a dataset mapping the interlinking
nature of blocked content between domains and exhibit the tightly networked
nature of censored web resources
FilteredWeb: A Framework for the Automated Search-Based Discovery of Blocked URLs
Various methods have been proposed for creating and maintaining lists of
potentially filtered URLs to allow for measurement of ongoing internet
censorship around the world. Whilst testing a known resource for evidence of
filtering can be relatively simple, given appropriate vantage points,
discovering previously unknown filtered web resources remains an open
challenge.
We present a new framework for automating the process of discovering filtered
resources through the use of adaptive queries to well-known search engines. Our
system applies information retrieval algorithms to isolate characteristic
linguistic patterns in known filtered web pages; these are then used as the
basis for web search queries. The results of these queries are then checked for
evidence of filtering, and newly discovered filtered resources are fed back
into the system to detect further filtered content.
Our implementation of this framework, applied to China as a case study, shows
that this approach is demonstrably effective at detecting significant numbers
of previously unknown filtered web pages, making a significant contribution to
the ongoing detection of internet filtering as it develops.
Our tool is currently deployed and has been used to discover 1355 domains
that are poisoned within China as of Feb 2017 - 30 times more than are
contained in the most widely-used public filter list. Of these, 759 are outside
of the Alexa Top 1000 domains list, demonstrating the capability of this
framework to find more obscure filtered content. Further, our initial analysis
of filtered URLs, and the search terms that were used to discover them, gives
further insight into the nature of the content currently being blocked in
China.Comment: To appear in "Network Traffic Measurement and Analysis Conference
2017" (TMA2017