554 research outputs found
Accelerated focused crawling through online relevance feedback
The organization of HTML into a tag tree structure, which is rendered by browsers as roughly rectangular regions with embedded text and HREF links, greatly helps surfers locate and click on links that best satisfy their information need. Can an automatic program emulate this human behavior and thereby learn to predict the relevance of an unseen HREF target page w.r.t. an information need, based on information limited to the HREF source page? Such a capability would be of great interest in focused crawling and resource discovery, because it can fine-tune the priority of unvisited URLs in the crawl frontier, and reduce the number of irrelevant pages which are fetched and discarded
A Conceptual Framework for Efficient Web Crawling in Virtual Integration Contexts
Virtual Integration systems require a crawling tool able to
navigate and reach relevant pages in the Web in an efficient way. Existing
proposals in the crawling area are aware of the efficiency problem,
but still most of them need to download pages in order to classify them
as relevant or not. In this paper, we present a conceptual framework for
designing crawlers supported by a web page classifier that relies solely
on URLs to determine page relevance. Such a crawler is able to choose
in each step only the URLs that lead to relevant pages, and therefore
reduces the number of unnecessary pages downloaded, optimising bandwidth
and making it efficient and suitable for virtual integration systems.
Our preliminary experiments show that such a classifier is able to distinguish
between links leading to different kinds of pages, without previous
intervention from the user.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08- TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-
Methodologies for the Automatic Location of Academic and Educational Texts on the Internet
Traditionally online databases of web resources have been compiled by a human editor, or though the submissions of authors or interested parties. Considerable resources are needed to maintain a constant level of input and relevance in the face of increasing material quantity and quality, and much of what is in databases is of an ephemeral nature. These pressures dictate that many databases stagnate after an initial period of enthusiastic data entry. The solution to this problem would seem to be the automatic harvesting of resources, however, this process necessitates the automatic classification of resources as ‘appropriate’ to a given database, a problem only solved by complex text content analysis.
This paper outlines the component methodologies necessary to construct such an automated harvesting system, including a number of novel approaches. In particular this paper looks at the specific problems of automatically identifying academic research work and Higher Education pedagogic materials. Where appropriate, experimental data is presented from searches in the field of Geography as well as the Earth and Environmental Sciences. In addition, appropriate software is reviewed where it exists, and future directions are outlined
Methodologies for the Automatic Location of Academic and Educational Texts on the Internet
Traditionally online databases of web resources have been compiled by a human editor, or though the submissions of authors or interested parties. Considerable resources are needed to maintain a constant level of input and relevance in the face of increasing material quantity and quality, and much of what is in databases is of an ephemeral nature. These pressures dictate that many databases stagnate after an initial period of enthusiastic data entry. The solution to this problem would seem to be the automatic harvesting of resources, however, this process necessitates the automatic classification of resources as ‘appropriate’ to a given database, a problem only solved by complex text content analysis.
This paper outlines the component methodologies necessary to construct such an automated harvesting system, including a number of novel approaches. In particular this paper looks at the specific problems of automatically identifying academic research work and Higher Education pedagogic materials. Where appropriate, experimental data is presented from searches in the field of Geography as well as the Earth and Environmental Sciences. In addition, appropriate software is reviewed where it exists, and future directions are outlined
Advanced Data Mining Techniques for Compound Objects
Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in large data collections. The most important step within the process of KDD is data mining which is concerned with the extraction of the valid patterns. KDD is necessary to analyze the steady growing amount of data caused by the enhanced performance of modern computer systems. However, with the growing amount of data the complexity of data objects increases as well. Modern methods of KDD should therefore examine more complex objects than simple feature vectors to solve real-world KDD applications adequately. Multi-instance and multi-represented objects are two important types of object representations for complex objects. Multi-instance objects consist of a set of object representations that all belong to the same feature space. Multi-represented objects are constructed as a tuple of feature representations where each feature representation belongs to a different feature space.
The contribution of this thesis is the development of new KDD methods for the classification and clustering of complex objects. Therefore, the thesis introduces solutions for real-world applications that are based on multi-instance and
multi-represented object representations. On the basis of these solutions, it is shown that a more general object representation often provides better results for many relevant KDD applications.
The first part of the thesis is concerned with two KDD problems for which employing multi-instance objects provides efficient and effective solutions. The first is the data mining in CAD parts, e.g. the use of hierarchic clustering for the automatic construction of product hierarchies. The introduced solution decomposes a single part into a set of feature vectors and compares them by using a metric on multi-instance objects. Furthermore, multi-step query processing using a novel filter step is employed, enabling the user to efficiently process similarity queries. On the basis of this similarity search system, it is possible to perform several distance based data mining algorithms like the hierarchical clustering algorithm OPTICS to derive product hierarchies.
The second important application is the classification and search for complete websites in the world wide web (WWW). A website is a set of HTML-documents that is published by the same person, group or organization and usually serves a common purpose. To perform data mining for websites, the thesis presents several methods to classify websites. After introducing naive methods modelling websites as webpages, two more sophisticated approaches to website classification are introduced. The first approach uses a preprocessing that maps single HTML-documents within each website to so-called page classes. The second approach directly compares websites as sets of word vectors and uses nearest neighbor classification. To search the WWW for new, relevant websites, a focused crawler is introduced that efficiently retrieves relevant websites. This crawler minimizes the number of HTML-documents and increases the accuracy of website retrieval.
The second part of the thesis is concerned with the data mining in multi-represented objects. An important example application for this kind of complex objects are proteins that can be represented as a tuple of a protein sequence and a text annotation. To analyze multi-represented objects, a clustering method for multi-represented objects is introduced that is based on the density based clustering algorithm DBSCAN. This method uses all representations that are provided to find a global clustering of the given data objects. However, in many applications there already exists a sophisticated class ontology for the given data objects, e.g. proteins. To map new objects into an ontology a new
method for the hierarchical classification of multi-represented objects is described. The system employs the hierarchical structure of the ontology to efficiently classify new proteins, using support vector machines
Tree-based Focused Web Crawling with Reinforcement Learning
A focused crawler aims at discovering as many web pages relevant to a target
topic as possible, while avoiding irrelevant ones. Reinforcement Learning (RL)
has been utilized to optimize focused crawling. In this paper, we propose TRES,
an RL-empowered framework for focused crawling. We model the crawling
environment as a Markov Decision Process, which the RL agent aims at solving by
determining a good crawling strategy. Starting from a few human provided
keywords and a small text corpus, that are expected to be relevant to the
target topic, TRES follows a keyword set expansion procedure, which guides
crawling, and trains a classifier that constitutes the reward function. To
avoid a computationally infeasible brute force method for selecting a best
action, we propose Tree-Frontier, a decision-tree-based algorithm that
adaptively discretizes the large state and action spaces and finds only a few
representative actions. Tree-Frontier allows the agent to be likely to select
near-optimal actions by being greedy over selecting the best representative
action. Experimentally, we show that TRES significantly outperforms
state-of-the-art methods in terms of harvest rate (ratio of relevant pages
crawled), while Tree-Frontier reduces by orders of magnitude the number of
actions needed to be evaluated at each timestep
Never-ending Learning of User Interfaces
Machine learning models have been trained to predict semantic information
about user interfaces (UIs) to make apps more accessible, easier to test, and
to automate. Currently, most models rely on datasets that are collected and
labeled by human crowd-workers, a process that is costly and surprisingly
error-prone for certain tasks. For example, it is possible to guess if a UI
element is "tappable" from a screenshot (i.e., based on visual signifiers) or
from potentially unreliable metadata (e.g., a view hierarchy), but one way to
know for certain is to programmatically tap the UI element and observe the
effects. We built the Never-ending UI Learner, an app crawler that
automatically installs real apps from a mobile app store and crawls them to
discover new and challenging training examples to learn from. The Never-ending
UI Learner has crawled for more than 5,000 device-hours, performing over half a
million actions on 6,000 apps to train three computer vision models for i)
tappability prediction, ii) draggability prediction, and iii) screen
similarity
Multiple-Goal Heuristic Search
This paper presents a new framework for anytime heuristic search where the
task is to achieve as many goals as possible within the allocated resources. We
show the inadequacy of traditional distance-estimation heuristics for tasks of
this type and present alternative heuristics that are more appropriate for
multiple-goal search. In particular, we introduce the marginal-utility
heuristic, which estimates the cost and the benefit of exploring a subtree
below a search node. We developed two methods for online learning of the
marginal-utility heuristic. One is based on local similarity of the partial
marginal utility of sibling nodes, and the other generalizes marginal-utility
over the state feature space. We apply our adaptive and non-adaptive
multiple-goal search algorithms to several problems, including focused
crawling, and show their superiority over existing methods
Information Retrieval on the World Wide Web and Active Logic: A Survey and Problem Definition
As more information becomes available on the World Wide Web (there are
currently over 4 billion pages covering most areas of human endeavor), it
becomes more difficult to provide effective search tools for information
access. Today, people access web information through two main kinds of
search interfaces: Browsers (clicking and following hyperlinks) and Query
Engines (queries in the form of a set of keywords showing the topic of
interest). The first process is tentative and time consuming and the second
may not satisfy the user because of many inaccurate and irrelevant results.
Better support is needed for expressing one's information need and returning
high quality search results by web search tools. There appears to be a need
for systems that do reasoning under uncertainty and are flexible enough to
recover from the contradictions, inconsistencies, and irregularities that such reasoning involves.
Active Logic is a formalism that has been developed with real-world
applications and their challenges in mind. Motivating its design is the
thought that one of the factors that supports the flexibility of human
reasoning is that it takes place step-wise, in time. Active Logic is one of
a family of inference engines (step-logics) that explicitly reason in time,
and incorporate a history of their reasoning as they run. This
characteristic makes Active Logic systems more flexible than traditional AI
systems and therefore more suitable for commonsense, real-world reasoning.
In this report we mainly will survey recent advances in machine learning and
crawling problems related to the web. We will review the continuum of
supervised to semi-supervised to unsupervised learning problems, highlight
the specific challenges which distinguish information retrieval in the
hypertext domain and will summarize the key areas of recent and ongoing
research. We will concentrate on topic-specific search engines, focused
crawling, and finally will propose an Information Integration Environment,
based on the Active Logic framework.
Keywords: Web Information Retrieval, Web Crawling, Focused Crawling, Machine
Learning, Active Logic
(Also UMIACS-TR-2001-69
- …