Search CORE

8 research outputs found

A Semantic Scraping Model for Web Resources - Applying Linked Data to Web Page Screen Scraping

Author: Blasco Garcia Jacobo
Fernández Villamor José Ignacio
Garijo Ayestaran Mercedes
Iglesias Fernandez Carlos Angel
Publication venue: E.T.S.I. Telecomunicación (UPM)
Publication date: 01/01/2011
Field of study

In spite of the increasing presence of Semantic Web Facilities, only a limited amount of the available resources in the Internet provide a semantic access. Recent initiatives such as the emerging Linked Data Web are providing semantic access to available data by porting existing resources to the semantic web using different technologies, such as database-semantic mapping and scraping. Nevertheless, existing scraping solutions are based on ad-hoc solutions complemented with graphical interfaces for speeding up the scraper development. This article proposes a generic framework for web scraping based on semantic technologies. This framework is structured in three levels: scraping services, semantic scraping model and syntactic scraping. The ﬁrst level provides an interface to generic applications or intelligent agents for gathering information from the web at a high level. The second level deﬁnes a semantic RDF model of the scraping process, in order to provide a declarative approach to the scraping task. Finally, the third level provides an implementation of the RDF scraping model for speciﬁc technologies. The work has been validated in a scenario that illustrates its application to mashup technologie

Archivo Digital UPM

Abmash: Mashing Up Legacy Web Applications by Automated Imitation of Human Actions

Author: Mezini Mira
Monperrus Martin
Ortac Alper
Publication venue: 'Wiley'
Publication date: 01/01/2013
Field of study

Many business web-based applications do not offer applications programming interfaces (APIs) to enable other applications to access their data and functions in a programmatic manner. This makes their composition difficult (for instance to synchronize data between two applications). To address this challenge, this paper presents Abmash, an approach to facilitate the integration of such legacy web applications by automatically imitating human interactions with them. By automatically interacting with the graphical user interface (GUI) of web applications, the system supports all forms of integrations including bi-directional interactions and is able to interact with AJAX-based applications. Furthermore, the integration programs are easy to write since they deal with end-user, visual user-interface elements. The integration code is simple enough to be called a "mashup".Comment: Software: Practice and Experience (2013)

arXiv.org e-Print Archive

HAL - Lille 3

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Automatic Data Extraction from Template-Generated Web Pages

Author: Shao-Hua YANG
Publication venue: 'China Science Publishing & Media Ltd.'
Publication date
Field of study

Crossref

Performance improvement of user-generated data retrieval from the Web, based on adaptive intelligent methods

Author: Pavković Miloš
Publication venue: Универзитет у Београду, Електротехнички факултет
Publication date: 18/02/2021
Field of study

Кориснички генерисан садржај на веб форуму се много чешће додаје него што се брише или мења па се самим тим, циљање истог, приликом инкременталног претраживања, разликује у односу на класично претраживање страна веб сајта. Додавање новог садржаја на форуму може резултовати померањем већ постојећег садржаја на нове или постојеће стране. Инкрементално претраживање форума није тривијалан задатак, јер игнорисање начина на које је садржај презентован, дистрибуиран и сортиран може довести до преноса постова који су већ били индексирани у претходним циклусима претраживања. С друге стране постоји широк спектар форумских технологија које омогућавају различите навигационе путање ка својим најновијим постовима као и различите начине презентовања и сортирања истих. Један од главних резултата тезе је структурно вођени инкрементални претраживач форума (SInFo) који је специјализован за циљање најновијег садржаја приликом инкременталног претраживања коришћењем напредних оптимизационих техника и машинског учења. Главни циљ представљеног претраживача јесте избегавање већ индексираног садржаја у новим циклусима претраживања форума без обзира на његову технологију. Да би овај циљ могао бити испуњен, следеће карактеристике веб форума су искоришћене: (1) начин сортирања на индексним и дискусионим странама и (2) доступне навигационе путање између страна које тренутна веб форумска технологија нуди. С обзиром на то да приликом утврђивања типа сортирања битну улогу има датум креирања садражаја, детекција и нормализација истих није једноставан задатак. За овај задатак су коришћени модели машинског учења, јер генерисани датуми могу бити у различитим форматима и на различтим језицима. С друге стране, детекција навигационих путања се постиже интерпретацијом формата URL линкова и скенирањем страна на које они указују. Показано је да се коришћењем предложених метода и техника, приликом циљања страна са најновијим садржајем, минимизује број преузимања дуплираног садржаја и максимизује искоришћеност навигационе структуре и путања тренутне форум технологије. Експерименти су изведени на широком спектру већ постојећих популарних форумских технологија као и на индивидуалним stand-alone форумским технологијама. SInFo је показао високу прецизност и минималан број преноса дуплог садражаја у сваком новом циклусу претраживања. Већина дупликата на које је предложени претраживач наилазио је са страна које су морале бити посећене како би се исправно утврдила навигациона путања или пронашао одговарајући URL. Додатно, модели машинског учења, иако су комплексни постижу добре перформансе приликом претраживања и имају високу прецизност у детекцији и нормализацији датума, достижући F1-меру од 99%.User-generated content on Web forums is added much more often than it is deleted or changed, so its targeting during incremental crawling differs from the Web site pages crawling. Adding new content to a forum can result in moving existing content to new or existing pages. Incremental forum crawling is not a trivial task, because ignoring in which way the content is presented, distributed and sorted can lead to the transfer of posts that have already been indexed in the previous crawl cycles. On the other hand, there is a wide spectrum of forum technologies that allow different navigational paths to its latest posts, as well as different ways of presenting and sorting user generated content. This thesis presents Structure-driven Incremental Forum crawler (SInFo) that specializes in targeting the latest content in incremental forum crawling using advanced optimization techniques and machine learning. The main goal of the presented system is to avoid already indexed content in new crawling cycles regardless of its technology. In order to achieve this, the following Web Forum features have been used: (1) the sort method on the index and thread pages and (2) the available navigation paths between the pages that the current Web Forum technology offers. Since the date of content creation plays an important role in determining the type of sort, their detection and normalization is not a trivial task. Machine learning models were used for this task, because the generated dates can be in different formats and in different languages. On the other hand, the detection of navigational paths is achieved by interpreting the URL format and scanning the pages they target. It has been shown that using the proposed methods and techniques while targeting pages with the latest content can achieve a minimum number of duplicate content downloads and maximize the utilization of the navigational structure and paths of the current forum technology. The experiments were performed on a wide range of already existing popular forum technologies as well as on individual stand-alone forum technologies. SInFo has demonstrated high precision and a minimum number of duplicate content transfers in each new crawl cycle. Most of the duplicates that the proposed system encountered are from pages that had to be visited in order to correctly determine the navigational path or to find the appropriate URL. Additionally, machine learning models, although complex, achieved good performance while crawling and have high accuracy in date detection and normalization, reaching an F1-measure of 99%

National Repository of Dissertations in Serbia (NaRDuS)

Nardus

ABSTRACT Vision-based Web Data Records Extraction

Author: Wei Liu
Weiyi Meng
Xiaofeng Meng
Publication venue
Publication date: 01/01/2006
Field of study

This paper studies the problem of extracting data records on the response pages returned from web databases or search engines. Existing solutions to this problem are based primarily on analyzing the HTML DOM trees and tags of the response pages. While these solutions can achieve good results, they are too heavily dependent on the specifics of HTML and they may have to be changed should the response pages are written in a totally different markup language. In this paper, we propose a novel and language independent technique to solve the data extraction problem. Our proposed solution performs the extraction using only the visual information of the response pages when they are rendered on web browsers. We analyze several types of visual features in this paper. We also propose a new measure revision to evaluate the extraction performance. This measure reflects perfect extraction ratio among all response pages. Our experimental results indicate that this visionbased approach can achieve very high extraction accuracy

CiteSeerX

ABSTRACT Vision-based Web Data Records Extraction

Author: Wei Liu
Xiaofeng Meng
Publication venue
Publication date
Field of study

CiteSeerX