1 research outputs found
Introduction to OXPath
Contemporary web pages with increasingly sophisticated interfaces rival
traditional desktop applications for interface complexity and are often called
web applications or RIA (Rich Internet Applications). They often require the
execution of JavaScript in a web browser and can call AJAX requests to
dynamically generate the content, reacting to user interaction. From the
automatic data acquisition point of view, thus, it is essential to be able to
correctly render web pages and mimic user actions to obtain relevant data from
the web page content. Briefly, to obtain data through existing Web interfaces
and transform it into structured form, contemporary wrappers should be able to:
1) interact with sophisticated interfaces of web applications; 2) precisely
acquire relevant data; 3) scale with the number of crawled web pages or states
of web application; 4) have an embeddable programming API for integration with
existing web technologies. OXPath is a state-of-the-art technology, which is
compliant with these requirements and demonstrated its efficiency in
comprehensive experiments. OXPath integrates Firefox for correct rendering of
web pages and extends XPath 1.0 for the DOM node selection, interaction, and
extraction. It provides means for converting extracted data into different
formats, such as XML, JSON, CSV, and saving data into relational databases.
This tutorial explains main features of the OXPath language and the setup of
a suitable working environment. The guidelines for using OXPath are provided in
the form of prototypical examples.Comment: 63 page