20,937 research outputs found
Towards using web-crawled data for domain adaptation in statistical machine translation
This paper reports on the ongoing work focused on domain adaptation of statistical machine translation using domain-specific data obtained by domain-focused web crawling. We present a strategy for crawling monolingual and parallel data and their exploitation for testing, language modelling, and system tuning in a phrase--based machine translation framework. The proposed approach is evaluated on the domains of Natural Environment and Labour Legislation and two language
pairs: English–French and English–Greek
Improving Data Collection on Article Clustering by Using Distributed Focused Crawler
Collecting or harvesting data from the Internet is often done by using web crawler. General web crawler is developed to be more focus on certain topic. The type of this web crawler called focused crawler. To improve the datacollection performance, creating focused crawler is not enough as the focused crawler makes efficient usage of network bandwidth and storage capacity. This research proposes a distributed focused crawler in order to improve the web crawler performance which also efficient in network bandwidth and storage capacity. This distributed focused crawler implements crawling scheduling, site ordering to determine URL queue, and focused crawler by using Naïve Bayes. This research also tests the web crawling performance by conducting multithreaded, then observe the CPU and memory utilization. The conclusion is the web crawling performance will be decrease when too many threads are used. As the consequences, the CPU and memory utilization will be very high, meanwhile performance of the distributed focused crawler will be low
iCrawl: Improving the Freshness of Web Collections by Integrating Social Web and Focused Web Crawling
Researchers in the Digital Humanities and journalists need to monitor,
collect and analyze fresh online content regarding current events such as the
Ebola outbreak or the Ukraine crisis on demand. However, existing focused
crawling approaches only consider topical aspects while ignoring temporal
aspects and therefore cannot achieve thematically coherent and fresh Web
collections. Especially Social Media provide a rich source of fresh content,
which is not used by state-of-the-art focused crawlers. In this paper we
address the issues of enabling the collection of fresh and relevant Web and
Social Web content for a topic of interest through seamless integration of Web
and Social Media in a novel integrated focused crawler. The crawler collects
Web and Social Media content in a single system and exploits the stream of
fresh Social Media content for guiding the crawler.Comment: Published in the Proceedings of the 15th ACM/IEEE-CS Joint Conference
on Digital Libraries 201
Building domain-specific web collections for scientific digital libraries: A meta-search enhanced focused crawling method
Collecting domain-specific documents from the Web using focused crawlers has been considered one of the most important strategies to build digital libraries that serve the scientific community. However, because most focused crawlers use local search algorithms to traverse the Web space, they could be easily trapped within a limited sub-graph of the Web that surrounds the starting URLs and build domain-specific collections that are not comprehensive and diverse enough to scientists and researchers. In this study, we investigated the problems of traditional focused crawlers caused by local search algorithms and proposed a new crawling approach, meta-search enhanced focused crawling, to address the problems. We conducted two user evaluation experiments to examine the performance of our proposed approach and the results showed that our approach could build domain-specific collections with higher quality than traditional focused crawling techniques
A Study of Focused Web Crawling Techniques
In the recent years, the growth of data on the web is increasing exponentially. Due to this exponential growth, it is very crucial to find the accurate and significant information on the Web. Web crawlers are the tools or programs which find the web pages from the World Wide Web by following hyperlinks. Search engines indexes web pages which can be further retrieved by entering a query given by a user. The immense size and an assortment of the Web make it troublesome for any crawler to recover every pertinent information from the Web. In this way, different variations of Web crawling techniques are emerging as an active research area. In this paper, we survey the learnable focused crawlers
Tree-based Focused Web Crawling with Reinforcement Learning
A focused crawler aims at discovering as many web pages relevant to a target
topic as possible, while avoiding irrelevant ones. Reinforcement Learning (RL)
has been utilized to optimize focused crawling. In this paper, we propose TRES,
an RL-empowered framework for focused crawling. We model the crawling
environment as a Markov Decision Process, which the RL agent aims at solving by
determining a good crawling strategy. Starting from a few human provided
keywords and a small text corpus, that are expected to be relevant to the
target topic, TRES follows a keyword set expansion procedure, which guides
crawling, and trains a classifier that constitutes the reward function. To
avoid a computationally infeasible brute force method for selecting a best
action, we propose Tree-Frontier, a decision-tree-based algorithm that
adaptively discretizes the large state and action spaces and finds only a few
representative actions. Tree-Frontier allows the agent to be likely to select
near-optimal actions by being greedy over selecting the best representative
action. Experimentally, we show that TRES significantly outperforms
state-of-the-art methods in terms of harvest rate (ratio of relevant pages
crawled), while Tree-Frontier reduces by orders of magnitude the number of
actions needed to be evaluated at each timestep
Hybrid focused crawling on the Surface and the Dark Web
Focused crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating
through the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic of
interest. This work proposes a generic focused crawling framework for discovering resources on any given topic
that reside on the Surface or the Dark Web. The proposed crawler is able to seamlessly navigate through the
Surface Web and several darknets present in the Dark Web (i.e., Tor, I2P, and Freenet) during a single crawl by
automatically adapting its crawling behavior and its classifier-guided hyperlink selection strategy based on the
destination network type and the strength of the local evidence present in the vicinity of a hyperlink. It investigates
11 hyperlink selection methods, among which a novel strategy proposed based on the dynamic linear combination
of a link-based and a parent Web page classifier. This hybrid focused crawler is demonstrated for the discovery of
Web resources containing recipes for producing homemade explosives. The evaluation experiments indicate the
effectiveness of the proposed focused crawler both for the Surface and the Dark Web
Focused Crawler Optimization Using Genetic Algorithm
As the size of the Web continues to grow, searching it for useful information has become more difficult. Focused crawler intends to explore the Web conform to a specific topic. This paper discusses the problems caused by local searching algorithms. Crawler can be trapped within a limited Web community and overlook suitable Web pages outside its track. A genetic algorithm as a global searching algorithm is modified to address the problems. The genetic algorithm is used to optimize Web crawling and to select more suitable Web pages to be fetched by the crawler. Several evaluation experiments are conducted to examine the effectiveness of the approach. The crawler delivers collections consist of 3396 Web pages from 5390 links which had been visited, or filtering rate of Roulette-Wheel selection at 63% and precision level at 93% in 5 different categories. The result showed that the utilization of genetic algorithm had empowered focused crawler to traverse the Web comprehensively, despite it relatively small collections. Furthermore, it brought up a great potential for building an exemplary collections compared to traditional focused crawling methods
- …