8 research outputs found

    A Semantic Scraping Model for Web Resources - Applying Linked Data to Web Page Screen Scraping

    Get PDF
    In spite of the increasing presence of Semantic Web Facilities, only a limited amount of the available resources in the Internet provide a semantic access. Recent initiatives such as the emerging Linked Data Web are providing semantic access to available data by porting existing resources to the semantic web using different technologies, such as database-semantic mapping and scraping. Nevertheless, existing scraping solutions are based on ad-hoc solutions complemented with graphical interfaces for speeding up the scraper development. This article proposes a generic framework for web scraping based on semantic technologies. This framework is structured in three levels: scraping services, semantic scraping model and syntactic scraping. The first level provides an interface to generic applications or intelligent agents for gathering information from the web at a high level. The second level defines a semantic RDF model of the scraping process, in order to provide a declarative approach to the scraping task. Finally, the third level provides an implementation of the RDF scraping model for specific technologies. The work has been validated in a scenario that illustrates its application to mashup technologie

    Abmash: Mashing Up Legacy Web Applications by Automated Imitation of Human Actions

    Get PDF
    Many business web-based applications do not offer applications programming interfaces (APIs) to enable other applications to access their data and functions in a programmatic manner. This makes their composition difficult (for instance to synchronize data between two applications). To address this challenge, this paper presents Abmash, an approach to facilitate the integration of such legacy web applications by automatically imitating human interactions with them. By automatically interacting with the graphical user interface (GUI) of web applications, the system supports all forms of integrations including bi-directional interactions and is able to interact with AJAX-based applications. Furthermore, the integration programs are easy to write since they deal with end-user, visual user-interface elements. The integration code is simple enough to be called a "mashup".Comment: Software: Practice and Experience (2013)

    Automatic Data Extraction from Template-Generated Web Pages

    Full text link

    Performance improvement of user-generated data retrieval from the Web, based on adaptive intelligent methods

    Get PDF
    ΠšΠΎΡ€ΠΈΡΠ½ΠΈΡ‡ΠΊΠΈ гСнСрисан ΡΠ°Π΄Ρ€ΠΆΠ°Ρ˜ Π½Π° Π²Π΅Π± Ρ„ΠΎΡ€ΡƒΠΌΡƒ сС ΠΌΠ½ΠΎΠ³ΠΎ Ρ‡Π΅ΡˆΡ›Π΅ додајС Π½Π΅Π³ΠΎ ΡˆΡ‚ΠΎ сС Π±Ρ€ΠΈΡˆΠ΅ ΠΈΠ»ΠΈ мСња ΠΏΠ° сС самим Ρ‚ΠΈΠΌ, Ρ†ΠΈΡ™Π°ΡšΠ΅ истог, ΠΏΡ€ΠΈΠ»ΠΈΠΊΠΎΠΌ ΠΈΠ½ΠΊΡ€Π΅ΠΌΠ΅Π½Ρ‚Π°Π»Π½ΠΎΠ³ ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ°, Ρ€Π°Π·Π»ΠΈΠΊΡƒΡ˜Π΅ Ρƒ односу Π½Π° класично ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ΅ страна Π²Π΅Π± ΡΠ°Ρ˜Ρ‚Π°. Π”ΠΎΠ΄Π°Π²Π°ΡšΠ΅ Π½ΠΎΠ²ΠΎΠ³ ΡΠ°Π΄Ρ€ΠΆΠ°Ρ˜Π° Π½Π° Ρ„ΠΎΡ€ΡƒΠΌΡƒ ΠΌΠΎΠΆΠ΅ Ρ€Π΅Π·ΡƒΠ»Ρ‚ΠΎΠ²Π°Ρ‚ΠΈ ΠΏΠΎΠΌΠ΅Ρ€Π°ΡšΠ΅ΠΌ Π²Π΅Ρ› ΠΏΠΎΡΡ‚ΠΎΡ˜Π΅Ρ›Π΅Π³ ΡΠ°Π΄Ρ€ΠΆΠ°Ρ˜Π° Π½Π° Π½ΠΎΠ²Π΅ ΠΈΠ»ΠΈ ΠΏΠΎΡΡ‚ΠΎΡ˜Π΅Ρ›Π΅ странС. Π˜Π½ΠΊΡ€Π΅ΠΌΠ΅Π½Ρ‚Π°Π»Π½ΠΎ ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ΅ Ρ„ΠΎΡ€ΡƒΠΌΠ° нијС Ρ‚Ρ€ΠΈΠ²ΠΈΡ˜Π°Π»Π°Π½ Π·Π°Π΄Π°Ρ‚Π°ΠΊ, Ρ˜Π΅Ρ€ ΠΈΠ³Π½ΠΎΡ€ΠΈΡΠ°ΡšΠ΅ Π½Π°Ρ‡ΠΈΠ½Π° Π½Π° којС јС ΡΠ°Π΄Ρ€ΠΆΠ°Ρ˜ ΠΏΡ€Π΅Π·Π΅Π½Ρ‚ΠΎΠ²Π°Π½, дистрибуиран ΠΈ сортиран ΠΌΠΎΠΆΠ΅ довСсти Π΄ΠΎ прСноса постова који су Π²Π΅Ρ› Π±ΠΈΠ»ΠΈ индСксирани Ρƒ ΠΏΡ€Π΅Ρ‚Ρ…ΠΎΠ΄Π½ΠΈΠΌ циклусима ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ°. Π‘ Π΄Ρ€ΡƒΠ³Π΅ странС ΠΏΠΎΡΡ‚ΠΎΡ˜ΠΈ ΡˆΠΈΡ€ΠΎΠΊ спСктар форумских Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π° којС ΠΎΠΌΠΎΠ³ΡƒΡ›Π°Π²Π°Ρ˜Ρƒ Ρ€Π°Π·Π»ΠΈΡ‡ΠΈΡ‚Π΅ Π½Π°Π²ΠΈΠ³Π°Ρ†ΠΈΠΎΠ½Π΅ ΠΏΡƒΡ‚Π°ΡšΠ΅ ΠΊΠ° својим најновијим постовима ΠΊΠ°ΠΎ ΠΈ Ρ€Π°Π·Π»ΠΈΡ‡ΠΈΡ‚Π΅ Π½Π°Ρ‡ΠΈΠ½Π΅ ΠΏΡ€Π΅Π·Π΅Π½Ρ‚ΠΎΠ²Π°ΡšΠ° ΠΈ ΡΠΎΡ€Ρ‚ΠΈΡ€Π°ΡšΠ° истих. ЈСдан ΠΎΠ΄ Π³Π»Π°Π²Π½ΠΈΡ… Ρ€Π΅Π·ΡƒΠ»Ρ‚Π°Ρ‚Π° Ρ‚Π΅Π·Π΅ јС структурно Π²ΠΎΡ’Π΅Π½ΠΈ ΠΈΠ½ΠΊΡ€Π΅ΠΌΠ΅Π½Ρ‚Π°Π»Π½ΠΈ ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°Ρ‡ Ρ„ΠΎΡ€ΡƒΠΌΠ° (SInFo) који јС ΡΠΏΠ΅Ρ†ΠΈΡ˜Π°Π»ΠΈΠ·ΠΎΠ²Π°Π½ Π·Π° Ρ†ΠΈΡ™Π°ΡšΠ΅ најновијСг ΡΠ°Π΄Ρ€ΠΆΠ°Ρ˜Π° ΠΏΡ€ΠΈΠ»ΠΈΠΊΠΎΠΌ ΠΈΠ½ΠΊΡ€Π΅ΠΌΠ΅Π½Ρ‚Π°Π»Π½ΠΎΠ³ ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ° ΠΊΠΎΡ€ΠΈΡˆΡ›Π΅ΡšΠ΅ΠΌ Π½Π°ΠΏΡ€Π΅Π΄Π½ΠΈΡ… ΠΎΠΏΡ‚ΠΈΠΌΠΈΠ·Π°Ρ†ΠΈΠΎΠ½ΠΈΡ… Ρ‚Π΅Ρ…Π½ΠΈΠΊΠ° ΠΈ машинског ΡƒΡ‡Π΅ΡšΠ°. Π“Π»Π°Π²Π½ΠΈ Ρ†ΠΈΡ™ прСдстављСног ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°Ρ‡Π° Ρ˜Π΅ΡΡ‚Π΅ избСгавањС Π²Π΅Ρ› индСксираног ΡΠ°Π΄Ρ€ΠΆΠ°Ρ˜Π° Ρƒ Π½ΠΎΠ²ΠΈΠΌ циклусима ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ° Ρ„ΠΎΡ€ΡƒΠΌΠ° Π±Π΅Π· ΠΎΠ±Π·ΠΈΡ€Π° Π½Π° ΡšΠ΅Π³ΠΎΠ²Ρƒ Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Ρƒ. Π”Π° Π±ΠΈ овај Ρ†ΠΈΡ™ ΠΌΠΎΠ³Π°ΠΎ Π±ΠΈΡ‚ΠΈ ΠΈΡΠΏΡƒΡšΠ΅Π½, слСдСћС карактСристикС Π²Π΅Π± Ρ„ΠΎΡ€ΡƒΠΌΠ° су ΠΈΡΠΊΠΎΡ€ΠΈΡˆΡ›Π΅Π½Π΅: (1) Π½Π°Ρ‡ΠΈΠ½ ΡΠΎΡ€Ρ‚ΠΈΡ€Π°ΡšΠ° Π½Π° индСксним ΠΈ дискусионим странама ΠΈ (2) доступнС Π½Π°Π²ΠΈΠ³Π°Ρ†ΠΈΠΎΠ½Π΅ ΠΏΡƒΡ‚Π°ΡšΠ΅ ΠΈΠ·ΠΌΠ΅Ρ’Ρƒ страна којС Ρ‚Ρ€Π΅Π½ΡƒΡ‚Π½Π° Π²Π΅Π± форумска Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π° Π½ΡƒΠ΄ΠΈ. Π‘ ΠΎΠ±Π·ΠΈΡ€ΠΎΠΌ Π½Π° Ρ‚ΠΎ Π΄Π° ΠΏΡ€ΠΈΠ»ΠΈΠΊΠΎΠΌ ΡƒΡ‚Π²Ρ€Ρ’ΠΈΠ²Π°ΡšΠ° Ρ‚ΠΈΠΏΠ° ΡΠΎΡ€Ρ‚ΠΈΡ€Π°ΡšΠ° Π±ΠΈΡ‚Π½Ρƒ ΡƒΠ»ΠΎΠ³Ρƒ ΠΈΠΌΠ° Π΄Π°Ρ‚ΡƒΠΌ ΠΊΡ€Π΅ΠΈΡ€Π°ΡšΠ° ΡΠ°Π΄Ρ€Π°ΠΆΠ°Ρ˜Π°, Π΄Π΅Ρ‚Π΅ΠΊΡ†ΠΈΡ˜Π° ΠΈ Π½ΠΎΡ€ΠΌΠ°Π»ΠΈΠ·Π°Ρ†ΠΈΡ˜Π° истих нијС Ρ˜Π΅Π΄Π½ΠΎΡΡ‚Π°Π²Π°Π½ Π·Π°Π΄Π°Ρ‚Π°ΠΊ. Π—Π° овај Π·Π°Π΄Π°Ρ‚Π°ΠΊ су ΠΊΠΎΡ€ΠΈΡˆΡ›Π΅Π½ΠΈ ΠΌΠΎΠ΄Π΅Π»ΠΈ машинског ΡƒΡ‡Π΅ΡšΠ°, Ρ˜Π΅Ρ€ гСнСрисани Π΄Π°Ρ‚ΡƒΠΌΠΈ ΠΌΠΎΠ³Ρƒ Π±ΠΈΡ‚ΠΈ Ρƒ Ρ€Π°Π·Π»ΠΈΡ‡ΠΈΡ‚ΠΈΠΌ Ρ„ΠΎΡ€ΠΌΠ°Ρ‚ΠΈΠΌΠ° ΠΈ Π½Π° Ρ€Π°Π·Π»ΠΈΡ‡Ρ‚ΠΈΠΌ Ρ˜Π΅Π·ΠΈΡ†ΠΈΠΌΠ°. Π‘ Π΄Ρ€ΡƒΠ³Π΅ странС, Π΄Π΅Ρ‚Π΅ΠΊΡ†ΠΈΡ˜Π° Π½Π°Π²ΠΈΠ³Π°Ρ†ΠΈΠΎΠ½ΠΈΡ… ΠΏΡƒΡ‚Π°ΡšΠ° сС постиТС ΠΈΠ½Ρ‚Π΅Ρ€ΠΏΡ€Π΅Ρ‚Π°Ρ†ΠΈΡ˜ΠΎΠΌ Ρ„ΠΎΡ€ΠΌΠ°Ρ‚Π° URL Π»ΠΈΠ½ΠΊΠΎΠ²Π° ΠΈ ΡΠΊΠ΅Π½ΠΈΡ€Π°ΡšΠ΅ΠΌ страна Π½Π° којС ΠΎΠ½ΠΈ ΡƒΠΊΠ°Π·ΡƒΡ˜Ρƒ. Показано јС Π΄Π° сС ΠΊΠΎΡ€ΠΈΡˆΡ›Π΅ΡšΠ΅ΠΌ ΠΏΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½ΠΈΡ… ΠΌΠ΅Ρ‚ΠΎΠ΄Π° ΠΈ Ρ‚Π΅Ρ…Π½ΠΈΠΊΠ°, ΠΏΡ€ΠΈΠ»ΠΈΠΊΠΎΠΌ Ρ†ΠΈΡ™Π°ΡšΠ° страна са најновијим ΡΠ°Π΄Ρ€ΠΆΠ°Ρ˜Π΅ΠΌ, ΠΌΠΈΠ½ΠΈΠΌΠΈΠ·ΡƒΡ˜Π΅ Π±Ρ€ΠΎΡ˜ ΠΏΡ€Π΅ΡƒΠ·ΠΈΠΌΠ°ΡšΠ° Π΄ΡƒΠΏΠ»ΠΈΡ€Π°Π½ΠΎΠ³ ΡΠ°Π΄Ρ€ΠΆΠ°Ρ˜Π° ΠΈ ΠΌΠ°ΠΊΡΠΈΠΌΠΈΠ·ΡƒΡ˜Π΅ ΠΈΡΠΊΠΎΡ€ΠΈΡˆΡ›Π΅Π½ΠΎΡΡ‚ Π½Π°Π²ΠΈΠ³Π°Ρ†ΠΈΠΎΠ½Π΅ структурС ΠΈ ΠΏΡƒΡ‚Π°ΡšΠ° Ρ‚Ρ€Π΅Π½ΡƒΡ‚Π½Π΅ Ρ„ΠΎΡ€ΡƒΠΌ Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π΅. ЕкспСримСнти су ΠΈΠ·Π²Π΅Π΄Π΅Π½ΠΈ Π½Π° ΡˆΠΈΡ€ΠΎΠΊΠΎΠΌ спСктру Π²Π΅Ρ› ΠΏΠΎΡΡ‚ΠΎΡ˜Π΅Ρ›ΠΈΡ… ΠΏΠΎΠΏΡƒΠ»Π°Ρ€Π½ΠΈΡ… форумских Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π° ΠΊΠ°ΠΎ ΠΈ Π½Π° ΠΈΠ½Π΄ΠΈΠ²ΠΈΠ΄ΡƒΠ°Π»Π½ΠΈΠΌ stand-alone форумским Ρ‚Π΅Ρ…Π½ΠΎΠ»ΠΎΠ³ΠΈΡ˜Π°ΠΌΠ°. SInFo јС ΠΏΠΎΠΊΠ°Π·Π°ΠΎ високу прСцизност ΠΈ ΠΌΠΈΠ½ΠΈΠΌΠ°Π»Π°Π½ Π±Ρ€ΠΎΡ˜ прСноса Π΄ΡƒΠΏΠ»ΠΎΠ³ ΡΠ°Π΄Ρ€Π°ΠΆΠ°Ρ˜Π° Ρƒ сваком Π½ΠΎΠ²ΠΎΠΌ циклусу ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ°. Π’Π΅Ρ›ΠΈΠ½Π° Π΄ΡƒΠΏΠ»ΠΈΠΊΠ°Ρ‚Π° Π½Π° којС јС ΠΏΡ€Π΅Π΄Π»ΠΎΠΆΠ΅Π½ΠΈ ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°Ρ‡ Π½Π°ΠΈΠ»Π°Π·ΠΈΠΎ јС са страна којС су ΠΌΠΎΡ€Π°Π»Π΅ Π±ΠΈΡ‚ΠΈ посСћСнС ΠΊΠ°ΠΊΠΎ Π±ΠΈ сС исправно ΡƒΡ‚Π²Ρ€Π΄ΠΈΠ»Π° Π½Π°Π²ΠΈΠ³Π°Ρ†ΠΈΠΎΠ½Π° ΠΏΡƒΡ‚Π°ΡšΠ° ΠΈΠ»ΠΈ ΠΏΡ€ΠΎΠ½Π°ΡˆΠ°ΠΎ ΠΎΠ΄Π³ΠΎΠ²Π°Ρ€Π°Ρ˜ΡƒΡ›ΠΈ URL. Π”ΠΎΠ΄Π°Ρ‚Π½ΠΎ, ΠΌΠΎΠ΄Π΅Π»ΠΈ машинског ΡƒΡ‡Π΅ΡšΠ°, ΠΈΠ°ΠΊΠΎ су комплСксни постиТу Π΄ΠΎΠ±Ρ€Π΅ пСрформансС ΠΏΡ€ΠΈΠ»ΠΈΠΊΠΎΠΌ ΠΏΡ€Π΅Ρ‚Ρ€Π°ΠΆΠΈΠ²Π°ΡšΠ° ΠΈ ΠΈΠΌΠ°Ρ˜Ρƒ високу прСцизност Ρƒ Π΄Π΅Ρ‚Π΅ΠΊΡ†ΠΈΡ˜ΠΈ ΠΈ Π½ΠΎΡ€ΠΌΠ°Π»ΠΈΠ·Π°Ρ†ΠΈΡ˜ΠΈ Π΄Π°Ρ‚ΡƒΠΌΠ°, достиТући F1-ΠΌΠ΅Ρ€Ρƒ ΠΎΠ΄ 99%.User-generated content on Web forums is added much more often than it is deleted or changed, so its targeting during incremental crawling differs from the Web site pages crawling. Adding new content to a forum can result in moving existing content to new or existing pages. Incremental forum crawling is not a trivial task, because ignoring in which way the content is presented, distributed and sorted can lead to the transfer of posts that have already been indexed in the previous crawl cycles. On the other hand, there is a wide spectrum of forum technologies that allow different navigational paths to its latest posts, as well as different ways of presenting and sorting user generated content. This thesis presents Structure-driven Incremental Forum crawler (SInFo) that specializes in targeting the latest content in incremental forum crawling using advanced optimization techniques and machine learning. The main goal of the presented system is to avoid already indexed content in new crawling cycles regardless of its technology. In order to achieve this, the following Web Forum features have been used: (1) the sort method on the index and thread pages and (2) the available navigation paths between the pages that the current Web Forum technology offers. Since the date of content creation plays an important role in determining the type of sort, their detection and normalization is not a trivial task. Machine learning models were used for this task, because the generated dates can be in different formats and in different languages. On the other hand, the detection of navigational paths is achieved by interpreting the URL format and scanning the pages they target. It has been shown that using the proposed methods and techniques while targeting pages with the latest content can achieve a minimum number of duplicate content downloads and maximize the utilization of the navigational structure and paths of the current forum technology. The experiments were performed on a wide range of already existing popular forum technologies as well as on individual stand-alone forum technologies. SInFo has demonstrated high precision and a minimum number of duplicate content transfers in each new crawl cycle. Most of the duplicates that the proposed system encountered are from pages that had to be visited in order to correctly determine the navigational path or to find the appropriate URL. Additionally, machine learning models, although complex, achieved good performance while crawling and have high accuracy in date detection and normalization, reaching an F1-measure of 99%

    ABSTRACT Vision-based Web Data Records Extraction

    No full text
    This paper studies the problem of extracting data records on the response pages returned from web databases or search engines. Existing solutions to this problem are based primarily on analyzing the HTML DOM trees and tags of the response pages. While these solutions can achieve good results, they are too heavily dependent on the specifics of HTML and they may have to be changed should the response pages are written in a totally different markup language. In this paper, we propose a novel and language independent technique to solve the data extraction problem. Our proposed solution performs the extraction using only the visual information of the response pages when they are rendered on web browsers. We analyze several types of visual features in this paper. We also propose a new measure revision to evaluate the extraction performance. This measure reflects perfect extraction ratio among all response pages. Our experimental results indicate that this visionbased approach can achieve very high extraction accuracy

    ABSTRACT Vision-based Web Data Records Extraction

    No full text
    This paper studies the problem of extracting data records on the response pages returned from web databases or search engines. Existing solutions to this problem are based primarily on analyzing the HTML DOM trees and tags of the response pages. While these solutions can achieve good results, they are too heavily dependent on the specifics of HTML and they may have to be changed should the response pages are written in a totally different markup language. In this paper, we propose a novel and language independent technique to solve the data extraction problem. Our proposed solution performs the extraction using only the visual information of the response pages when they are rendered on web browsers. We analyze several types of visual features in this paper. We also propose a new measure revision to evaluate the extraction performance. This measure reflects perfect extraction ratio among all response pages. Our experimental results indicate that this visionbased approach can achieve very high extraction accuracy
    corecore