8 research outputs found
A Semantic Scraping Model for Web Resources - Applying Linked Data to Web Page Screen Scraping
In spite of the increasing presence of Semantic Web Facilities, only a limited amount of the available resources in the Internet provide a semantic access. Recent initiatives such as the emerging Linked Data Web are providing semantic access to available data by porting existing resources to the semantic web using different technologies, such as database-semantic mapping and scraping. Nevertheless, existing scraping solutions are based on ad-hoc solutions complemented with graphical interfaces for speeding up the scraper development. This article proposes a generic framework for web scraping based on semantic technologies. This framework is structured in three levels: scraping services, semantic scraping model and syntactic scraping. The ο¬rst level provides an interface to generic applications or intelligent agents for gathering information from the web at a high level. The second level deο¬nes a semantic RDF model of the scraping process, in order to provide a declarative approach to the scraping task. Finally, the third level provides an implementation of the RDF scraping model for speciο¬c technologies. The work has been validated in a scenario that illustrates its application to mashup technologie
Abmash: Mashing Up Legacy Web Applications by Automated Imitation of Human Actions
Many business web-based applications do not offer applications programming
interfaces (APIs) to enable other applications to access their data and
functions in a programmatic manner. This makes their composition difficult (for
instance to synchronize data between two applications). To address this
challenge, this paper presents Abmash, an approach to facilitate the
integration of such legacy web applications by automatically imitating human
interactions with them. By automatically interacting with the graphical user
interface (GUI) of web applications, the system supports all forms of
integrations including bi-directional interactions and is able to interact with
AJAX-based applications. Furthermore, the integration programs are easy to
write since they deal with end-user, visual user-interface elements. The
integration code is simple enough to be called a "mashup".Comment: Software: Practice and Experience (2013)
Performance improvement of user-generated data retrieval from the Web, based on adaptive intelligent methods
ΠΠΎΡΠΈΡΠ½ΠΈΡΠΊΠΈ Π³Π΅Π½Π΅ΡΠΈΡΠ°Π½ ΡΠ°Π΄ΡΠΆΠ°Ρ Π½Π° Π²Π΅Π± ΡΠΎΡΡΠΌΡ ΡΠ΅ ΠΌΠ½ΠΎΠ³ΠΎ ΡΠ΅ΡΡΠ΅ Π΄ΠΎΠ΄Π°ΡΠ΅ Π½Π΅Π³ΠΎ ΡΡΠΎ ΡΠ΅ Π±ΡΠΈΡΠ΅ ΠΈΠ»ΠΈ ΠΌΠ΅ΡΠ° ΠΏΠ° ΡΠ΅ ΡΠ°ΠΌΠΈΠΌ ΡΠΈΠΌ, ΡΠΈΡΠ°ΡΠ΅ ΠΈΡΡΠΎΠ³, ΠΏΡΠΈΠ»ΠΈΠΊΠΎΠΌ ΠΈΠ½ΠΊΡΠ΅ΠΌΠ΅Π½ΡΠ°Π»Π½ΠΎΠ³ ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ°, ΡΠ°Π·Π»ΠΈΠΊΡΡΠ΅ Ρ ΠΎΠ΄Π½ΠΎΡΡ Π½Π° ΠΊΠ»Π°ΡΠΈΡΠ½ΠΎ ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ΅ ΡΡΡΠ°Π½Π° Π²Π΅Π± ΡΠ°ΡΡΠ°. ΠΠΎΠ΄Π°Π²Π°ΡΠ΅ Π½ΠΎΠ²ΠΎΠ³ ΡΠ°Π΄ΡΠΆΠ°ΡΠ° Π½Π° ΡΠΎΡΡΠΌΡ ΠΌΠΎΠΆΠ΅ ΡΠ΅Π·ΡΠ»ΡΠΎΠ²Π°ΡΠΈ ΠΏΠΎΠΌΠ΅ΡΠ°ΡΠ΅ΠΌ Π²Π΅Ρ ΠΏΠΎΡΡΠΎΡΠ΅ΡΠ΅Π³ ΡΠ°Π΄ΡΠΆΠ°ΡΠ° Π½Π° Π½ΠΎΠ²Π΅ ΠΈΠ»ΠΈ ΠΏΠΎΡΡΠΎΡΠ΅ΡΠ΅ ΡΡΡΠ°Π½Π΅. ΠΠ½ΠΊΡΠ΅ΠΌΠ΅Π½ΡΠ°Π»Π½ΠΎ ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ΅ ΡΠΎΡΡΠΌΠ° Π½ΠΈΡΠ΅ ΡΡΠΈΠ²ΠΈΡΠ°Π»Π°Π½ Π·Π°Π΄Π°ΡΠ°ΠΊ, ΡΠ΅Ρ ΠΈΠ³Π½ΠΎΡΠΈΡΠ°ΡΠ΅ Π½Π°ΡΠΈΠ½Π° Π½Π° ΠΊΠΎΡΠ΅ ΡΠ΅ ΡΠ°Π΄ΡΠΆΠ°Ρ ΠΏΡΠ΅Π·Π΅Π½ΡΠΎΠ²Π°Π½, Π΄ΠΈΡΡΡΠΈΠ±ΡΠΈΡΠ°Π½ ΠΈ ΡΠΎΡΡΠΈΡΠ°Π½ ΠΌΠΎΠΆΠ΅ Π΄ΠΎΠ²Π΅ΡΡΠΈ Π΄ΠΎ ΠΏΡΠ΅Π½ΠΎΡΠ° ΠΏΠΎΡΡΠΎΠ²Π° ΠΊΠΎΡΠΈ ΡΡ Π²Π΅Ρ Π±ΠΈΠ»ΠΈ ΠΈΠ½Π΄Π΅ΠΊΡΠΈΡΠ°Π½ΠΈ Ρ ΠΏΡΠ΅ΡΡ
ΠΎΠ΄Π½ΠΈΠΌ ΡΠΈΠΊΠ»ΡΡΠΈΠΌΠ° ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ°. Π‘ Π΄ΡΡΠ³Π΅ ΡΡΡΠ°Π½Π΅ ΠΏΠΎΡΡΠΎΡΠΈ ΡΠΈΡΠΎΠΊ ΡΠΏΠ΅ΠΊΡΠ°Ρ ΡΠΎΡΡΠΌΡΠΊΠΈΡ
ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΠ° ΠΊΠΎΡΠ΅ ΠΎΠΌΠΎΠ³ΡΡΠ°Π²Π°ΡΡ ΡΠ°Π·Π»ΠΈΡΠΈΡΠ΅ Π½Π°Π²ΠΈΠ³Π°ΡΠΈΠΎΠ½Π΅ ΠΏΡΡΠ°ΡΠ΅ ΠΊΠ° ΡΠ²ΠΎΡΠΈΠΌ Π½Π°ΡΠ½ΠΎΠ²ΠΈΡΠΈΠΌ ΠΏΠΎΡΡΠΎΠ²ΠΈΠΌΠ° ΠΊΠ°ΠΎ ΠΈ ΡΠ°Π·Π»ΠΈΡΠΈΡΠ΅ Π½Π°ΡΠΈΠ½Π΅ ΠΏΡΠ΅Π·Π΅Π½ΡΠΎΠ²Π°ΡΠ° ΠΈ ΡΠΎΡΡΠΈΡΠ°ΡΠ° ΠΈΡΡΠΈΡ
.
ΠΠ΅Π΄Π°Π½ ΠΎΠ΄ Π³Π»Π°Π²Π½ΠΈΡ
ΡΠ΅Π·ΡΠ»ΡΠ°ΡΠ° ΡΠ΅Π·Π΅ ΡΠ΅ ΡΡΡΡΠΊΡΡΡΠ½ΠΎ Π²ΠΎΡΠ΅Π½ΠΈ ΠΈΠ½ΠΊΡΠ΅ΠΌΠ΅Π½ΡΠ°Π»Π½ΠΈ ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°Ρ ΡΠΎΡΡΠΌΠ° (SInFo) ΠΊΠΎΡΠΈ ΡΠ΅ ΡΠΏΠ΅ΡΠΈΡΠ°Π»ΠΈΠ·ΠΎΠ²Π°Π½ Π·Π° ΡΠΈΡΠ°ΡΠ΅ Π½Π°ΡΠ½ΠΎΠ²ΠΈΡΠ΅Π³ ΡΠ°Π΄ΡΠΆΠ°ΡΠ° ΠΏΡΠΈΠ»ΠΈΠΊΠΎΠΌ ΠΈΠ½ΠΊΡΠ΅ΠΌΠ΅Π½ΡΠ°Π»Π½ΠΎΠ³ ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ° ΠΊΠΎΡΠΈΡΡΠ΅ΡΠ΅ΠΌ Π½Π°ΠΏΡΠ΅Π΄Π½ΠΈΡ
ΠΎΠΏΡΠΈΠΌΠΈΠ·Π°ΡΠΈΠΎΠ½ΠΈΡ
ΡΠ΅Ρ
Π½ΠΈΠΊΠ° ΠΈ ΠΌΠ°ΡΠΈΠ½ΡΠΊΠΎΠ³ ΡΡΠ΅ΡΠ°. ΠΠ»Π°Π²Π½ΠΈ ΡΠΈΡ ΠΏΡΠ΅Π΄ΡΡΠ°Π²ΡΠ΅Π½ΠΎΠ³ ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ° ΡΠ΅ΡΡΠ΅ ΠΈΠ·Π±Π΅Π³Π°Π²Π°ΡΠ΅ Π²Π΅Ρ ΠΈΠ½Π΄Π΅ΠΊΡΠΈΡΠ°Π½ΠΎΠ³ ΡΠ°Π΄ΡΠΆΠ°ΡΠ° Ρ Π½ΠΎΠ²ΠΈΠΌ ΡΠΈΠΊΠ»ΡΡΠΈΠΌΠ° ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ° ΡΠΎΡΡΠΌΠ° Π±Π΅Π· ΠΎΠ±Π·ΠΈΡΠ° Π½Π° ΡΠ΅Π³ΠΎΠ²Ρ ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΡ. ΠΠ° Π±ΠΈ ΠΎΠ²Π°Ρ ΡΠΈΡ ΠΌΠΎΠ³Π°ΠΎ Π±ΠΈΡΠΈ ΠΈΡΠΏΡΡΠ΅Π½, ΡΠ»Π΅Π΄Π΅ΡΠ΅ ΠΊΠ°ΡΠ°ΠΊΡΠ΅ΡΠΈΡΡΠΈΠΊΠ΅ Π²Π΅Π± ΡΠΎΡΡΠΌΠ° ΡΡ ΠΈΡΠΊΠΎΡΠΈΡΡΠ΅Π½Π΅: (1) Π½Π°ΡΠΈΠ½ ΡΠΎΡΡΠΈΡΠ°ΡΠ° Π½Π° ΠΈΠ½Π΄Π΅ΠΊΡΠ½ΠΈΠΌ ΠΈ Π΄ΠΈΡΠΊΡΡΠΈΠΎΠ½ΠΈΠΌ ΡΡΡΠ°Π½Π°ΠΌΠ° ΠΈ (2) Π΄ΠΎΡΡΡΠΏΠ½Π΅ Π½Π°Π²ΠΈΠ³Π°ΡΠΈΠΎΠ½Π΅ ΠΏΡΡΠ°ΡΠ΅ ΠΈΠ·ΠΌΠ΅ΡΡ ΡΡΡΠ°Π½Π° ΠΊΠΎΡΠ΅ ΡΡΠ΅Π½ΡΡΠ½Π° Π²Π΅Π± ΡΠΎΡΡΠΌΡΠΊΠ° ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΠ° Π½ΡΠ΄ΠΈ. Π‘ ΠΎΠ±Π·ΠΈΡΠΎΠΌ Π½Π° ΡΠΎ Π΄Π° ΠΏΡΠΈΠ»ΠΈΠΊΠΎΠΌ ΡΡΠ²ΡΡΠΈΠ²Π°ΡΠ° ΡΠΈΠΏΠ° ΡΠΎΡΡΠΈΡΠ°ΡΠ° Π±ΠΈΡΠ½Ρ ΡΠ»ΠΎΠ³Ρ ΠΈΠΌΠ° Π΄Π°ΡΡΠΌ ΠΊΡΠ΅ΠΈΡΠ°ΡΠ° ΡΠ°Π΄ΡΠ°ΠΆΠ°ΡΠ°, Π΄Π΅ΡΠ΅ΠΊΡΠΈΡΠ° ΠΈ Π½ΠΎΡΠΌΠ°Π»ΠΈΠ·Π°ΡΠΈΡΠ° ΠΈΡΡΠΈΡ
Π½ΠΈΡΠ΅ ΡΠ΅Π΄Π½ΠΎΡΡΠ°Π²Π°Π½ Π·Π°Π΄Π°ΡΠ°ΠΊ. ΠΠ° ΠΎΠ²Π°Ρ Π·Π°Π΄Π°ΡΠ°ΠΊ ΡΡ ΠΊΠΎΡΠΈΡΡΠ΅Π½ΠΈ ΠΌΠΎΠ΄Π΅Π»ΠΈ ΠΌΠ°ΡΠΈΠ½ΡΠΊΠΎΠ³ ΡΡΠ΅ΡΠ°, ΡΠ΅Ρ Π³Π΅Π½Π΅ΡΠΈΡΠ°Π½ΠΈ Π΄Π°ΡΡΠΌΠΈ ΠΌΠΎΠ³Ρ Π±ΠΈΡΠΈ Ρ ΡΠ°Π·Π»ΠΈΡΠΈΡΠΈΠΌ ΡΠΎΡΠΌΠ°ΡΠΈΠΌΠ° ΠΈ Π½Π° ΡΠ°Π·Π»ΠΈΡΡΠΈΠΌ ΡΠ΅Π·ΠΈΡΠΈΠΌΠ°. Π‘ Π΄ΡΡΠ³Π΅ ΡΡΡΠ°Π½Π΅, Π΄Π΅ΡΠ΅ΠΊΡΠΈΡΠ° Π½Π°Π²ΠΈΠ³Π°ΡΠΈΠΎΠ½ΠΈΡ
ΠΏΡΡΠ°ΡΠ° ΡΠ΅ ΠΏΠΎΡΡΠΈΠΆΠ΅ ΠΈΠ½ΡΠ΅ΡΠΏΡΠ΅ΡΠ°ΡΠΈΡΠΎΠΌ ΡΠΎΡΠΌΠ°ΡΠ° URL Π»ΠΈΠ½ΠΊΠΎΠ²Π° ΠΈ ΡΠΊΠ΅Π½ΠΈΡΠ°ΡΠ΅ΠΌ ΡΡΡΠ°Π½Π° Π½Π° ΠΊΠΎΡΠ΅ ΠΎΠ½ΠΈ ΡΠΊΠ°Π·ΡΡΡ.
ΠΠΎΠΊΠ°Π·Π°Π½ΠΎ ΡΠ΅ Π΄Π° ΡΠ΅ ΠΊΠΎΡΠΈΡΡΠ΅ΡΠ΅ΠΌ ΠΏΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½ΠΈΡ
ΠΌΠ΅ΡΠΎΠ΄Π° ΠΈ ΡΠ΅Ρ
Π½ΠΈΠΊΠ°, ΠΏΡΠΈΠ»ΠΈΠΊΠΎΠΌ ΡΠΈΡΠ°ΡΠ° ΡΡΡΠ°Π½Π° ΡΠ° Π½Π°ΡΠ½ΠΎΠ²ΠΈΡΠΈΠΌ ΡΠ°Π΄ΡΠΆΠ°ΡΠ΅ΠΌ, ΠΌΠΈΠ½ΠΈΠΌΠΈΠ·ΡΡΠ΅ Π±ΡΠΎΡ ΠΏΡΠ΅ΡΠ·ΠΈΠΌΠ°ΡΠ° Π΄ΡΠΏΠ»ΠΈΡΠ°Π½ΠΎΠ³ ΡΠ°Π΄ΡΠΆΠ°ΡΠ° ΠΈ ΠΌΠ°ΠΊΡΠΈΠΌΠΈΠ·ΡΡΠ΅ ΠΈΡΠΊΠΎΡΠΈΡΡΠ΅Π½ΠΎΡΡ Π½Π°Π²ΠΈΠ³Π°ΡΠΈΠΎΠ½Π΅ ΡΡΡΡΠΊΡΡΡΠ΅ ΠΈ ΠΏΡΡΠ°ΡΠ° ΡΡΠ΅Π½ΡΡΠ½Π΅ ΡΠΎΡΡΠΌ ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΠ΅. ΠΠΊΡΠΏΠ΅ΡΠΈΠΌΠ΅Π½ΡΠΈ ΡΡ ΠΈΠ·Π²Π΅Π΄Π΅Π½ΠΈ Π½Π° ΡΠΈΡΠΎΠΊΠΎΠΌ ΡΠΏΠ΅ΠΊΡΡΡ Π²Π΅Ρ ΠΏΠΎΡΡΠΎΡΠ΅ΡΠΈΡ
ΠΏΠΎΠΏΡΠ»Π°ΡΠ½ΠΈΡ
ΡΠΎΡΡΠΌΡΠΊΠΈΡ
ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΠ° ΠΊΠ°ΠΎ ΠΈ Π½Π° ΠΈΠ½Π΄ΠΈΠ²ΠΈΠ΄ΡΠ°Π»Π½ΠΈΠΌ stand-alone ΡΠΎΡΡΠΌΡΠΊΠΈΠΌ ΡΠ΅Ρ
Π½ΠΎΠ»ΠΎΠ³ΠΈΡΠ°ΠΌΠ°. SInFo ΡΠ΅ ΠΏΠΎΠΊΠ°Π·Π°ΠΎ Π²ΠΈΡΠΎΠΊΡ ΠΏΡΠ΅ΡΠΈΠ·Π½ΠΎΡΡ ΠΈ ΠΌΠΈΠ½ΠΈΠΌΠ°Π»Π°Π½ Π±ΡΠΎΡ ΠΏΡΠ΅Π½ΠΎΡΠ° Π΄ΡΠΏΠ»ΠΎΠ³ ΡΠ°Π΄ΡΠ°ΠΆΠ°ΡΠ° Ρ ΡΠ²Π°ΠΊΠΎΠΌ Π½ΠΎΠ²ΠΎΠΌ ΡΠΈΠΊΠ»ΡΡΡ ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ°. ΠΠ΅ΡΠΈΠ½Π° Π΄ΡΠΏΠ»ΠΈΠΊΠ°ΡΠ° Π½Π° ΠΊΠΎΡΠ΅ ΡΠ΅ ΠΏΡΠ΅Π΄Π»ΠΎΠΆΠ΅Π½ΠΈ ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°Ρ Π½Π°ΠΈΠ»Π°Π·ΠΈΠΎ ΡΠ΅ ΡΠ° ΡΡΡΠ°Π½Π° ΠΊΠΎΡΠ΅ ΡΡ ΠΌΠΎΡΠ°Π»Π΅ Π±ΠΈΡΠΈ ΠΏΠΎΡΠ΅ΡΠ΅Π½Π΅ ΠΊΠ°ΠΊΠΎ Π±ΠΈ ΡΠ΅ ΠΈΡΠΏΡΠ°Π²Π½ΠΎ ΡΡΠ²ΡΠ΄ΠΈΠ»Π° Π½Π°Π²ΠΈΠ³Π°ΡΠΈΠΎΠ½Π° ΠΏΡΡΠ°ΡΠ° ΠΈΠ»ΠΈ ΠΏΡΠΎΠ½Π°ΡΠ°ΠΎ ΠΎΠ΄Π³ΠΎΠ²Π°ΡΠ°ΡΡΡΠΈ URL. ΠΠΎΠ΄Π°ΡΠ½ΠΎ, ΠΌΠΎΠ΄Π΅Π»ΠΈ ΠΌΠ°ΡΠΈΠ½ΡΠΊΠΎΠ³ ΡΡΠ΅ΡΠ°, ΠΈΠ°ΠΊΠΎ ΡΡ ΠΊΠΎΠΌΠΏΠ»Π΅ΠΊΡΠ½ΠΈ ΠΏΠΎΡΡΠΈΠΆΡ Π΄ΠΎΠ±ΡΠ΅ ΠΏΠ΅ΡΡΠΎΡΠΌΠ°Π½ΡΠ΅ ΠΏΡΠΈΠ»ΠΈΠΊΠΎΠΌ ΠΏΡΠ΅ΡΡΠ°ΠΆΠΈΠ²Π°ΡΠ° ΠΈ ΠΈΠΌΠ°ΡΡ Π²ΠΈΡΠΎΠΊΡ ΠΏΡΠ΅ΡΠΈΠ·Π½ΠΎΡΡ Ρ Π΄Π΅ΡΠ΅ΠΊΡΠΈΡΠΈ ΠΈ Π½ΠΎΡΠΌΠ°Π»ΠΈΠ·Π°ΡΠΈΡΠΈ Π΄Π°ΡΡΠΌΠ°, Π΄ΠΎΡΡΠΈΠΆΡΡΠΈ F1-ΠΌΠ΅ΡΡ ΠΎΠ΄ 99%.User-generated content on Web forums is added much more often than it is deleted or changed, so its targeting during incremental crawling differs from the Web site pages crawling. Adding new content to a forum can result in moving existing content to new or existing pages. Incremental forum crawling is not a trivial task, because ignoring in which way the content is presented, distributed and sorted can lead to the transfer of posts that have already been indexed in the previous crawl cycles. On the other hand, there is a wide spectrum of forum technologies that allow different navigational paths to its latest posts, as well as different ways of presenting and sorting user generated content.
This thesis presents Structure-driven Incremental Forum crawler (SInFo) that specializes in targeting the latest content in incremental forum crawling using advanced optimization techniques and machine learning. The main goal of the presented system is to avoid already indexed content in new crawling cycles regardless of its technology. In order to achieve this, the following Web Forum features have been used: (1) the sort method on the index and thread pages and (2) the available navigation paths between the pages that the current Web Forum technology offers. Since the date of content creation plays an important role in determining the type of sort, their detection and normalization is not a trivial task. Machine learning models were used for this task, because the generated dates can be in different formats and in different languages. On the other hand, the detection of navigational paths is achieved by interpreting the URL format and scanning the pages they target.
It has been shown that using the proposed methods and techniques while targeting pages with the latest content can achieve a minimum number of duplicate content downloads and maximize the utilization of the navigational structure and paths of the current forum technology. The experiments were performed on a wide range of already existing popular forum technologies as well as on individual stand-alone forum technologies. SInFo has demonstrated high precision and a minimum number of duplicate content transfers in each new crawl cycle. Most of the duplicates that the proposed system encountered are from pages that had to be visited in order to correctly determine the navigational path or to find the appropriate URL. Additionally, machine learning models, although complex, achieved good performance while crawling and have high accuracy in date detection and normalization, reaching an F1-measure of 99%
ABSTRACT Vision-based Web Data Records Extraction
This paper studies the problem of extracting data records on the response pages returned from web databases or search engines. Existing solutions to this problem are based primarily on analyzing the HTML DOM trees and tags of the response pages. While these solutions can achieve good results, they are too heavily dependent on the specifics of HTML and they may have to be changed should the response pages are written in a totally different markup language. In this paper, we propose a novel and language independent technique to solve the data extraction problem. Our proposed solution performs the extraction using only the visual information of the response pages when they are rendered on web browsers. We analyze several types of visual features in this paper. We also propose a new measure revision to evaluate the extraction performance. This measure reflects perfect extraction ratio among all response pages. Our experimental results indicate that this visionbased approach can achieve very high extraction accuracy
ABSTRACT Vision-based Web Data Records Extraction
This paper studies the problem of extracting data records on the response pages returned from web databases or search engines. Existing solutions to this problem are based primarily on analyzing the HTML DOM trees and tags of the response pages. While these solutions can achieve good results, they are too heavily dependent on the specifics of HTML and they may have to be changed should the response pages are written in a totally different markup language. In this paper, we propose a novel and language independent technique to solve the data extraction problem. Our proposed solution performs the extraction using only the visual information of the response pages when they are rendered on web browsers. We analyze several types of visual features in this paper. We also propose a new measure revision to evaluate the extraction performance. This measure reflects perfect extraction ratio among all response pages. Our experimental results indicate that this visionbased approach can achieve very high extraction accuracy