1,315 research outputs found
FilteredWeb: A Framework for the Automated Search-Based Discovery of Blocked URLs
Various methods have been proposed for creating and maintaining lists of
potentially filtered URLs to allow for measurement of ongoing internet
censorship around the world. Whilst testing a known resource for evidence of
filtering can be relatively simple, given appropriate vantage points,
discovering previously unknown filtered web resources remains an open
challenge.
We present a new framework for automating the process of discovering filtered
resources through the use of adaptive queries to well-known search engines. Our
system applies information retrieval algorithms to isolate characteristic
linguistic patterns in known filtered web pages; these are then used as the
basis for web search queries. The results of these queries are then checked for
evidence of filtering, and newly discovered filtered resources are fed back
into the system to detect further filtered content.
Our implementation of this framework, applied to China as a case study, shows
that this approach is demonstrably effective at detecting significant numbers
of previously unknown filtered web pages, making a significant contribution to
the ongoing detection of internet filtering as it develops.
Our tool is currently deployed and has been used to discover 1355 domains
that are poisoned within China as of Feb 2017 - 30 times more than are
contained in the most widely-used public filter list. Of these, 759 are outside
of the Alexa Top 1000 domains list, demonstrating the capability of this
framework to find more obscure filtered content. Further, our initial analysis
of filtered URLs, and the search terms that were used to discover them, gives
further insight into the nature of the content currently being blocked in
China.Comment: To appear in "Network Traffic Measurement and Analysis Conference
2017" (TMA2017
Web Content Extraction - a Meta-Analysis of its Past and Thoughts on its Future
In this paper, we present a meta-analysis of several Web content extraction
algorithms, and make recommendations for the future of content extraction on
the Web. First, we find that nearly all Web content extractors do not consider
a very large, and growing, portion of modern Web pages. Second, it is well
understood that wrapper induction extractors tend to break as the Web changes;
heuristic/feature engineering extractors were thought to be immune to a Web
site's evolution, but we find that this is not the case: heuristic content
extractor performance also tends to degrade over time due to the evolution of
Web site forms and practices. We conclude with recommendations for future work
that address these and other findings.Comment: Accepted for publication in SIGKDD Exploration
Mapping Big Data into Knowledge Space with Cognitive Cyber-Infrastructure
Big data research has attracted great attention in science, technology,
industry and society. It is developing with the evolving scientific paradigm,
the fourth industrial revolution, and the transformational innovation of
technologies. However, its nature and fundamental challenge have not been
recognized, and its own methodology has not been formed. This paper explores
and answers the following questions: What is big data? What are the basic
methods for representing, managing and analyzing big data? What is the
relationship between big data and knowledge? Can we find a mapping from big
data into knowledge space? What kind of infrastructure is required to support
not only big data management and analysis but also knowledge discovery, sharing
and management? What is the relationship between big data and science paradigm?
What is the nature and fundamental challenge of big data computing? A
multi-dimensional perspective is presented toward a methodology of big data
computing.Comment: 59 page
Hierarchical clustering-based navigation of image search results
Usually, the image search results contain multiple topics on semantic level and even semantically consistent images have diverse appearances on visual level. How to organize the results into semantically and visually consistent clusters becomes a necessary task to facilitate users ’ navigation. To attack this, HiCluster, an effective method to organize image search results is designed in this paper, which employs both textual and visual analysis. First, we extract some query-related key phrases to enumerate specific semantics of the given query and cluster them into some semantic clusters using K-lines-based clustering algorithm. Second, the resulting images corresponding to each key phrase are clustered with Bregman Bubble Clustering (BBC) algorithm, which partially groups images in the whole set while discarding some scattered noisy ones. At last, a novel user interface (UI) is designed to provide users with the diverse and helpful information based on the hierarchical clustering structure. Experiments on web images demonstrate the effectiveness and potential of the system
A Broad Evaluation of the Tor English Content Ecosystem
Tor is among most well-known dark net in the world. It has noble uses,
including as a platform for free speech and information dissemination under the
guise of true anonymity, but may be culturally better known as a conduit for
criminal activity and as a platform to market illicit goods and data. Past
studies on the content of Tor support this notion, but were carried out by
targeting popular domains likely to contain illicit content. A survey of past
studies may thus not yield a complete evaluation of the content and use of Tor.
This work addresses this gap by presenting a broad evaluation of the content of
the English Tor ecosystem. We perform a comprehensive crawl of the Tor dark web
and, through topic and network analysis, characterize the types of information
and services hosted across a broad swath of Tor domains and their hyperlink
relational structure. We recover nine domain types defined by the information
or service they host and, among other findings, unveil how some types of
domains intentionally silo themselves from the rest of Tor. We also present
measurements that (regrettably) suggest how marketplaces of illegal drugs and
services do emerge as the dominant type of Tor domain. Our study is the product
of crawling over 1 million pages from 20,000 Tor seed addresses, yielding a
collection of over 150,000 Tor pages. We make a dataset of the intend to make
the domain structure publicly available as a dataset at
https://github.com/wsu-wacs/TorEnglishContent.Comment: 11 page
- …