Search CORE

161,847 research outputs found

Comparative Mining of B2C Web Sites by Discovering Web Database Schemas

Author: Peravali Bindu
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2016
Field of study

Discovering potentially useful and previously unknown historical knowledge from heterogeneous E-Commerce (B2C) web site contents to answer comparative queries such as “list all laptop prices from Walmart and Staples between 2013 and 2015 including make, type, screen size, CPU power, year of make”, would require the difficult task of finding the schema of web documents from different web pages, extracting target information and performing web content data integration, building their virtual or physical data warehouse and mining from it. Automatic data extractors (wrappers) such as the WebOMiner system use data extraction techniques based on parsing the web page html source code into a document object model (DOM) tree, traversing the DOM for pattern discovery to recognize and extract different web data types (e.g., text, image, links, and lists). Some limitations of the existing systems include using complicated matching techniques such as tree matching, non-deterministic finite state automata (NFA), domain ontology and inability to answer complex comparative historical and derived queries. This thesis proposes building the WebOMiner_S which uses web structure and content mining approaches on the DOM¬ tree html code to simplify and make more easily extendable the WebOMiner system data extraction process. We propose to replace the use of NFA in the WebOMiner with a frequent structure finder algorithm which uses regular expression matching in Java XPATH parser with its methods to dynamically discover the most frequent structure (which is the most frequently repeated blocks in the html code represented as tags \u3c div class = “ ′′ \u3e) in the Dom tree. This approach eliminates the need for any supervised training or updating the wrapper for each new B2C web page making the approach simpler, more easily extendable and automated. Experiments show that the WebOMiner_S achieves a 100% precision and 100% recall in identifying the product records, 95.55% precision and 100% recall in identifying the data columns

Scholarship at UWindsor

An active, ontology-driven network service for Internet collaboration

Author: Courtenage Simon
Feeney Kevin
Lewis David
Tiropanis Thanassis
Publication venue: Sun SITE Central Europe
Publication date: 06/08/2004
Field of study

Web portals have emerged as an important means of collaboration on the WWW, and the integration of ontologies promises to make them more accurate in how they serve users’ collaboration and information location requirements. However, web portals are essentially a centralised architecture resulting in difficulties supporting seamless roaming between portals and collaboration between groups supported on different portals. This paper proposes an alternative approach to collaboration over the web using ontologies that is de-centralised and exploits content-based networking. We argue that this approach promises a user-centric, timely, secure and location-independent mechanism, which is potentially more scaleable and universal than existing centralised portals

Southampton (e-Prints Soton)

BlogForever D2.6: Data Extraction Methodology

Author: Banos V.
Davis R.
Gkotsis G.
Pincent E.
Stepanyan K.
Publication venue
Publication date: 25/10/2013
Field of study

This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Web Data Extraction, Applications and Techniques: A Survey

Author: Abel
Amalfitano
Balduzzi
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Baumgartner
Berger
Berthold
Bettencourt
Califf
Catanese
Chang
Chen
Chen
Chen
Collins
Conover
Crandall
Crescenzi
Crescenzi
Dalvi
Dalvi
De Meo
De Meo
Doan
Emilio Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Ferrara
Flesca
Freitag
Furche
Gatterbauer
Gatterbauer
Giacomo Fiumara
Gjoka
Gkotsis
Gottlob
Gottlob
Hammersley
Han
Hecht
Hsu
Irmak
Khare
Kim
Kinsella
Kleinberg
Kleinberg
Kohlschütter
Kokkoras
Kokkoras
Kokkoras
Krüpl
Kushmerick
Kwak
Laender
Liu
Manning
Masanès
Mathes
Meng
Mislove
Monge
Muslea
Oro
Pan
Pasquale De Meo
Perito
Phan
Plake
Rahm
Rahm
Reis
Robert Baumgartner
Sahuguet
Sarawagi
Schifanella
Selkow
Shi
Soderland
Szomszor
Turmo
Vosecky
Wang
Wang
Weikum
Wilson
Winograd
Yang
Ye
Zafarani
Zanasi
Zhai
Zhang
Zhang
Publication venue: 'Elsevier BV'
Publication date: 09/06/2014
Field of study

Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

arXiv.org e-Print Archive

Crossref

Design Features for the Social Web: The Architecture of Deme

Author: Davies Todd
Mintz Mike D.
Publication venue
Publication date: 19/02/2013
Field of study

We characterize the "social Web" and argue for several features that are desirable for users of socially oriented web applications. We describe the architecture of Deme, a web content management system (WCMS) and extensible framework, and show how it implements these desired features. We then compare Deme on our desiderata with other web technologies: traditional HTML, previous open source WCMSs (illustrated by Drupal), commercial Web 2.0 applications, and open-source, object-oriented web application frameworks. The analysis suggests that a WCMS can be well suited to building social websites if it makes more of the features of object-oriented programming, such as polymorphism, and class inheritance, available to non-programmers in an accessible vocabulary.Comment: Appeared in Luis Olsina, Oscar Pastor, Daniel Schwabe, Gustavo Rossi, and Marco Winckler (Editors), Proceedings of the 8th International Workshop on Web-Oriented Software Technologies (IWWOST 2009), CEUR Workshop Proceedings, Volume 493, August 2009, pp. 40-51; 12 pages, 2 figures, 1 tabl

arXiv.org e-Print Archive

CiteSeerX

An integrated ranking algorithm for efficient information computing in social networks

Author: Suri Pushpa R.
Taneja Harmunish
Publication venue: 'Academy and Industry Research Collaboration Center (AIRCC)'
Publication date: 06/04/2012
Field of study

Social networks have ensured the expanding disproportion between the face of WWW stored traditionally in search engine repositories and the actual ever changing face of Web. Exponential growth of web users and the ease with which they can upload contents on web highlights the need of content controls on material published on the web. As definition of search is changing, socially-enhanced interactive search methodologies are the need of the hour. Ranking is pivotal for efficient web search as the search performance mainly depends upon the ranking results. In this paper new integrated ranking model based on fused rank of web object based on popularity factor earned over only valid interlinks from multiple social forums is proposed. This model identifies relationships between web objects in separate social networks based on the object inheritance graph. Experimental study indicates the effectiveness of proposed Fusion based ranking algorithm in terms of better search results.Comment: 14 pages, International Journal on Web Service Computing (IJWSC), Vol.3, No.1, March 201

arXiv.org e-Print Archive

Crossref