374,881 research outputs found
Query-related data extraction of hidden web documents
The larger amount of information on the Web is stored in document databases and is not indexed by general-purpose search engines (i.e., Google and Yahoo). Such information is
dynamically generated through querying databases â which are
referred to as Hidden Web databases. Documents returned in
response to a user query are typically presented using templategenerated Web pages. This paper proposes a novel approach that identifies Web page templates by analysing the textual contents and the adjacent tag structures of a document in order to extract query-related data. Preliminary results demonstrate that our approach effectively detects templates and retrieves data with high recall and precision
Information extraction from template-generated hidden web documents
The larger amount of information on the Web is stored in document databases and is not indexed by general-purpose
search engines (such as Google and Yahoo). Databases dynamically generate a list of documents in response to a user
query â which are referred to as Hidden Web databases. Such documents are typically presented to users as templategenerated
Web pages. This paper presents a new approach that identifies Web page templates in order to extract queryrelated
information from documents. We propose two forms of representation to analyse the content of a document â
Text with Immediate Adjacent Tag Segments (TIATS) and Text with Neighbouring Adjacent Tag Segments (TNATS).
Our techniques exploit tag structures that surround the textual contents of documents in order to detect Web page
templates thereby extracting query-related information. Experimental results demonstrate that TNATS detects Web page
templates most effectively and extracts information with high recall and precision
Automatic Classification of Text Databases through Query Probing
Many text databases on the web are "hidden" behind search interfaces, and
their documents are only accessible through querying. Search engines typically
ignore the contents of such search-only databases. Recently, Yahoo-like
directories have started to manually organize these databases into categories
that users can browse to find these valuable resources. We propose a novel
strategy to automate the classification of search-only text databases. Our
technique starts by training a rule-based document classifier, and then uses
the classifier's rules to generate probing queries. The queries are sent to the
text databases, which are then classified based on the number of matches that
they produce for each query. We report some initial exploratory experiments
that show that our approach is promising to automatically characterize the
contents of text databases accessible on the web.Comment: 7 pages, 1 figur
Implementation of Multidimensional Databases with Document-Oriented NoSQL
International audienceNoSQL (Not Only SQL) systems are becoming popular due to known advantages such as horizontal scalability and elasticity. In this paper, we study the implementation of data warehouses with document-oriented NoSQL systems. We propose mapping rules that transform the multidimensional data model to logical document-oriented models. We consider three different logical models and we use them to instantiate data warehouses. We focus on data loading, model-to-model conversion and OLAP cuboid computation
Pattern based processing of XPath queries
As the popularity of areas including document storage and
distributed systems continues to grow, the demand for high
performance XML databases is increasingly evident. This
has led to a number of research eorts aimed at exploiting
the maturity of relational database systems in order to in-
crease XML query performance. In our approach, we use an
index structure based on a metamodel for XML databases
combined with relational database technology to facilitate
fast access to XML document elements. The query process
involves transforming XPath expressions to SQL which can
be executed over our optimised query engine. As there are
many dierent types of XPath queries, varying processing
logic may be applied to boost performance not only to indi-
vidual XPath axes, but across multiple axes simultaneously.
This paper describes a pattern based approach to XPath
query processing, which permits the execution of a group of
XPath location steps in parallel
- âŠ