Search CORE

10,437 research outputs found

Automatic supervised information extraction of structured web data

Author: Pérez Ruiz David
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/06/2019
Field of study

The overall purpose of this project is, in short words, to create a system able to extract vital information from product web pages just like a human would. Information like the name of the product, its description, price tag, company that produces it, and so on. At a first glimpse, this may not seem extraordinary or technically difficult, since web scraping techniques exist from long ago (like the python library Beautiful Soup for instance, an HTML parser1 released in 2004). But let us think for a second on what it actually means being able to extract desired information from any given web source: the way information is displayed can be extremely varied, not only visually, but also semantically. For instance, some hotel booking web pages display at once all prices for the different room types, while medium-sized consumer products in websites like Amazon offer the main product in detail and then more small-sized product recommendations further down the page, being the latter the preferred way of displaying assets by most retail companies. And each with its own styling and search engines. With the above said, the task of mining valuable data from the web now does not sound as easy as it first seemed. Hence the purpose of this project is to shine some light on the Automatic Supervised Information Extraction of Structured Web Data problem. It is important to think if developing such a solution is really valuable at all. Such an endeavour both in time and computing resources should lead to a useful end result, at least on paper, to justify it. The opinion of this author is that it does lead to a potentially valuable result. The targeted extraction of information of publicly available consumer-oriented content at large scale in an accurate, reliable and future proof manner could provide an incredibly useful and large amount of data. This data, if kept updated, could create endless opportunities for Business Intelligence, although exactly which ones is beyond the scope of this work. A simple metaphor explains the potential value of this work: if an oil company were to be told where are all the oil reserves in the planet, it still should need to invest in machinery, workers and time to successfully exploit them, but half of the job would have already been done2. As the reader will see in this work, the way the issue is tackled is by building a somehow complex architecture that ends in an Artificial Neural Network3. A quick overview of such architecture is as follows: first find the URLs that lead to the product pages that contain the desired data that is going to be extracted inside a given site (like URLs that lead to ”action figure” products inside the site ebay.com); second, per each URL passed, extract its HTML and make a screenshot of the page, and store this data in a suitable and scalable fashion; third, label the data that will be fed to the NN4; fourth, prepare the aforementioned data to be input in an NN; fifth, train the NN; and sixth, deploy the NN to make [hopefully accurate] predictions

UPCommons. Portal del coneixement obert de la UPC

Preliminary document analyzing and summarizing metadata standards and issues across Europe

Author: Alemu Getaneh
Anderson David
Delve Janet
Pinchbeck Dan
Publication venue: European Commission
Publication date: 01/01/2009
Field of study

University of Brighton Research Portal

Portsmouth University Research Portal (Pure)

Privacy Issues of the W3C Geolocation API

Author: Doty Nick
Mulligan Deirdre K.
Wilde Erik
Publication venue
Publication date: 01/01/2010
Field of study

The W3C's Geolocation API may rapidly standardize the transmission of location information on the Web, but, in dealing with such sensitive information, it also raises serious privacy concerns. We analyze the manner and extent to which the current W3C Geolocation API provides mechanisms to support privacy. We propose a privacy framework for the consideration of location information and use it to evaluate the W3C Geolocation API, both the specification and its use in the wild, and recommend some modifications to the API as a result of our analysis

arXiv.org e-Print Archive

CiteSeerX

eScholarship - University of California

Cross-Browser Document Capture System

Author: Kaljuve Marti
Publication venue: Tartu Ülikool
Publication date: 01/01/2013
Field of study

Veebilehte kuvatakse harva täpselt samasugusena erinevates brauseri ja operatsioonisüsteemi kombinatsioonides. Sellel on mitmeid põhjuseid: veebistandardite tõlgendamine brauseri poolt, brauseri visualiseerimismootor, operatsioonisüsteemi vaikefondid, brauserisse installeeritud pistikprogrammid, ekraani eraldusvõime jms. Nende erinevuste tähelepanuta jätmine võib tekitada probleeme veebilehe kujunduses, mille tagajärjeks on klientide kaotamine. Veebidisaineritele võib tunduda veebilehtede testimine mitmes brauseris tavapärase praktikana, et leida brauseritevahelised kujunduse probleemid. Katsed näitavad, et visuaalsete erinevuste käsitsi leidmine on tülikas ja kohmakas ülesanne. Seda teades on meie meeskonna liige loonud algoritmi, mis on osutunud inimestega võrreldes märkimisväärselt kiiremaks ja täpsemaks kujunduses vigade leidmisel. Algoritm töötab selliselt, et veebilehest tehtud aluspilti (tarkvara testimise mõistes oraaklit) võrreldakse samast veebilehest teiste brauseritega tehtud piltidega, leides nendes paigutuse erinevusi, mida ka inimsilm arvestaks väärana. Käesolev töö keskendub probleemile, kuidas eelnevalt mainitud algoritmile sisendit luua. Töö annab valikulise ülevaate olemasolevatest lahendustest ja teenustest, mis tagastavad veebilehe sisu pildi kujul, ning võimalusel mõõdab nende jõudlust. Tuvastatakse nimekiri nõuetest, mis on vajalikud mitmeplatvormilise veebidokumentide pildistamise lahenduse kommertsialiseerimiseks. Seejärel tutvustab töö kiiret ja mitmeplatvormilist meetodit veebilehe täispikkuses pildistamiseks ning annab ülevaate skaleeritava arhitektuuriga veebiteenusest, mis pildistab veebilehti virtuaalsetes ja füüsilistes masinates ning erinevates brauserites ja operatsioonisüsteemides.A web page is seldom displayed in the exact same manner in different browser and operating system combinations. There are several reasons for different rendering outcomes: interpretation of web standards by the browser, the browser's rendering engine, available fonts in the operating system, plugins installed in the browser, screen resolution etc. Neglecting to consider these differences as a web designer may lead to webpage layout issues that result in lost customers. Web designers might consider it common practice to test webpages on several browsers to eliminate cross-browser layout issues. Experiments show that finding visual differences is a dull and cumbersome task for people. Knowing this, another member working at Browserbite has created an algorithm that has proved to be much faster and more accurate at finding layout issues compared to humans. The algorithm works by comparing a baseline (oracle in software testing terms) webpage in image form to other image captures of the same webpage in different browsers, finding differences in layout and position that a human might consider erroneous. This thesis concentrates on the problem of creating the input to the aforementioned algorithm. A selective overview of existing solutions and services for webpage capture and automation is given, measuring their performance where possible. A list of requirements are established for a cross-platform capture solution to be commercialized. A fast and cross-platform method of capturing full webpages is then introduced, and an overview of a scalable Software-as-a-Service system implemented for cross-browser and cross-platform capture in several virtual and physical machines asynchronously is given

DSpace at Tartu University Library

Recommended from our members

Proceedings ICPW'07: 2nd International Conference on the Pragmatic Web, 22-23 Oct. 2007, Tilburg: NL

Author
Publication venue: Tilburg University
Publication date: 01/10/2007
Field of study

Proceedings ICPW'07: 2nd International Conference on the Pragmatic Web, 22-23 Oct. 2007, Tilburg: N

Open Research Online (The Open University)

An Experimental Digital Library Platform - A Demonstrator Prototype for the DigLib Project at SICS

Author: Hulth Anette
Jonsson Anna
Publication venue: Swedish Institute of Computer Science
Publication date: 01/01/1999
Field of study

Within the framework of the Digital Library project at SICS, this thesis describes the implementation of a demonstrator prototype of a digital library (DigLib); an experimental platform integrating several functions in one common interface. It includes descriptions of the structure and formats of the digital library collection, the tailoring of the search engine Dienst, the construction of a keyword extraction tool, and the design and development of the interface. The platform was realised through sicsDAIS, an agent interaction and presentation system, and is to be used for testing and evaluating various tools for information seeking. The platform supports various user interaction strategies by providing: search in bibliographic records (Dienst); an index of keywords (the Keyword Extraction Function (KEF)); and browsing through the hierarchical structure of the collection. KEF was developed for this thesis work, and extracts and presents keywords from Swedish documents. Although based on a comparatively simple algorithm, KEF contributes by supplying a long-felt want in the area of Information Retrieval. Evaluations of the tasks and the interface still remain to be done, but the digital library is very much up and running. By implementing the platform through sicsDAIS, DigLib can deploy additional tools and search engines without interfering with already running modules. If wanted, agents providing other services than SICS can supply, can be plugged in

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive