161,845 research outputs found

    Comparative Mining of B2C Web Sites by Discovering Web Database Schemas

    Get PDF
    Discovering potentially useful and previously unknown historical knowledge from heterogeneous E-Commerce (B2C) web site contents to answer comparative queries such as “list all laptop prices from Walmart and Staples between 2013 and 2015 including make, type, screen size, CPU power, year of make”, would require the difficult task of finding the schema of web documents from different web pages, extracting target information and performing web content data integration, building their virtual or physical data warehouse and mining from it. Automatic data extractors (wrappers) such as the WebOMiner system use data extraction techniques based on parsing the web page html source code into a document object model (DOM) tree, traversing the DOM for pattern discovery to recognize and extract different web data types (e.g., text, image, links, and lists). Some limitations of the existing systems include using complicated matching techniques such as tree matching, non-deterministic finite state automata (NFA), domain ontology and inability to answer complex comparative historical and derived queries. This thesis proposes building the WebOMiner_S which uses web structure and content mining approaches on the DOM¬ tree html code to simplify and make more easily extendable the WebOMiner system data extraction process. We propose to replace the use of NFA in the WebOMiner with a frequent structure finder algorithm which uses regular expression matching in Java XPATH parser with its methods to dynamically discover the most frequent structure (which is the most frequently repeated blocks in the html code represented as tags \u3c div class = “ ′′ \u3e) in the Dom tree. This approach eliminates the need for any supervised training or updating the wrapper for each new B2C web page making the approach simpler, more easily extendable and automated. Experiments show that the WebOMiner_S achieves a 100% precision and 100% recall in identifying the product records, 95.55% precision and 100% recall in identifying the data columns

    An active, ontology-driven network service for Internet collaboration

    No full text
    Web portals have emerged as an important means of collaboration on the WWW, and the integration of ontologies promises to make them more accurate in how they serve users’ collaboration and information location requirements. However, web portals are essentially a centralised architecture resulting in difficulties supporting seamless roaming between portals and collaboration between groups supported on different portals. This paper proposes an alternative approach to collaboration over the web using ontologies that is de-centralised and exploits content-based networking. We argue that this approach promises a user-centric, timely, secure and location-independent mechanism, which is potentially more scaleable and universal than existing centralised portals

    BlogForever D2.6: Data Extraction Methodology

    Get PDF
    This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    Design Features for the Social Web: The Architecture of Deme

    Full text link
    We characterize the "social Web" and argue for several features that are desirable for users of socially oriented web applications. We describe the architecture of Deme, a web content management system (WCMS) and extensible framework, and show how it implements these desired features. We then compare Deme on our desiderata with other web technologies: traditional HTML, previous open source WCMSs (illustrated by Drupal), commercial Web 2.0 applications, and open-source, object-oriented web application frameworks. The analysis suggests that a WCMS can be well suited to building social websites if it makes more of the features of object-oriented programming, such as polymorphism, and class inheritance, available to non-programmers in an accessible vocabulary.Comment: Appeared in Luis Olsina, Oscar Pastor, Daniel Schwabe, Gustavo Rossi, and Marco Winckler (Editors), Proceedings of the 8th International Workshop on Web-Oriented Software Technologies (IWWOST 2009), CEUR Workshop Proceedings, Volume 493, August 2009, pp. 40-51; 12 pages, 2 figures, 1 tabl

    An integrated ranking algorithm for efficient information computing in social networks

    Full text link
    Social networks have ensured the expanding disproportion between the face of WWW stored traditionally in search engine repositories and the actual ever changing face of Web. Exponential growth of web users and the ease with which they can upload contents on web highlights the need of content controls on material published on the web. As definition of search is changing, socially-enhanced interactive search methodologies are the need of the hour. Ranking is pivotal for efficient web search as the search performance mainly depends upon the ranking results. In this paper new integrated ranking model based on fused rank of web object based on popularity factor earned over only valid interlinks from multiple social forums is proposed. This model identifies relationships between web objects in separate social networks based on the object inheritance graph. Experimental study indicates the effectiveness of proposed Fusion based ranking algorithm in terms of better search results.Comment: 14 pages, International Journal on Web Service Computing (IJWSC), Vol.3, No.1, March 201
    • …
    corecore