446 research outputs found

    XML Schema Clustering with Semantic and Hierarchical Similarity Measures

    Get PDF
    With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis

    Finding structure and characteristic of web documents for classification.

    Get PDF
    by Wong, Wai Ching.Thesis (M.Phil.)--Chinese University of Hong Kong, 2000.Includes bibliographical references (leaves 91-94).Abstracts in English and Chinese.Abstract --- p.iiAcknowledgments --- p.vChapter 1 --- Introduction --- p.1Chapter 1.1 --- Semistructured Data --- p.2Chapter 1.2 --- Problem Addressed in the Thesis --- p.4Chapter 1.2.1 --- Labels and Values --- p.4Chapter 1.2.2 --- Discover Labels for the Same Attribute --- p.5Chapter 1.2.3 --- Classifying A Web Page --- p.6Chapter 1.3 --- Organization of the Thesis --- p.8Chapter 2 --- Background --- p.8Chapter 2.1 --- Related Work on Web Data --- p.8Chapter 2.1.1 --- Object Exchange Model (OEM) --- p.9Chapter 2.1.2 --- Schema Extraction --- p.11Chapter 2.1.3 --- Discovering Typical Structure --- p.15Chapter 2.1.4 --- Information Extraction of Web Data --- p.17Chapter 2.2 --- Automatic Text Processing --- p.19Chapter 2.2.1 --- Stopwords Elimination --- p.19Chapter 2.2.2 --- Stemming --- p.20Chapter 3 --- Web Data Definition --- p.22Chapter 3.1 --- Web Page --- p.22Chapter 3.2 --- Problem Description --- p.27Chapter 4 --- Hierarchical Structure --- p.32Chapter 4.1 --- Types of HTML Tags --- p.33Chapter 4.2 --- Tag-tree --- p.36Chapter 4.3 --- Hierarchical Structure Construction --- p.41Chapter 4.4 --- Hierarchical Structure Statistics --- p.50Chapter 5 --- Similar Labels Discovery --- p.53Chapter 5.1 --- Expression of Hierarchical Structure --- p.53Chapter 5.2 --- Labels Discovery Algorithm --- p.55Chapter 5.2.1 --- Phase 1: Remove Non-label Nodes --- p.57Chapter 5.2.2 --- Phase 2: Identify Label Nodes --- p.61Chapter 5.2.3 --- Phase 3: Discover Similar Labels --- p.66Chapter 5.3 --- Performance Evaluation of Labels Discovery Algorithm --- p.76Chapter 5.3.1 --- Phase 1 Results --- p.75Chapter 5.3.2 --- Phase 2 Results --- p.77Chapter 5.3.3 --- Phase 3 Results --- p.81Chapter 5.4 --- Classifying a Web Page --- p.83Chapter 5.4.1 --- Similarity Measurement --- p.84Chapter 5.4.2 --- Performance Evaluation --- p.86Chapter 6 --- Conclusion --- p.8

    Integrating data warehouses with web data : a survey

    Get PDF
    This paper surveys the most relevant research on combining Data Warehouse (DW) and Web data. It studies the XML technologies that are currently being used to integrate, store, query, and retrieve Web data and their application to DWs. The paper reviews different DW distributed architectures and the use of XML languages as an integration tool in these systems. It also introduces the problem of dealing with semistructured data in a DW. It studies Web data repositories, the design of multidimensional databases for XML data sources, and the XML extensions of OnLine Analytical Processing techniques. The paper addresses the application of information retrieval technology in a DW to exploit text-rich document collections. The authors hope that the paper will help to discover the main limitations and opportunities that offer the combination of the DW and the Web fields, as well as to identify open research line

    A Progressive Clustering Algorithm to Group the XML Data by Structural and Semantic Similarity

    Get PDF
    Since the emergence in the popularity of XML for data representation and exchange over the Web, the distribution of XML documents has rapidly increased. It has become a challenge for researchers to turn these documents into a more useful information utility. In this paper, we introduce a novel clustering algorithm PCXSS that keeps the heterogeneous XML documents into various groups according to their similar structural and semantic representations. We develop a global criterion function CPSim that progressively measures the similarity between a XML document and existing clusters, ignoring the need to compute the similarity between two individual documents. The experimental analysis shows the method to be fast and accurate

    A Semantic DOM Approach For Webpage Information Extraction

    Get PDF
    With the development of electronic technology and e-commerce, technology for Web pages has attracted a lot of research efforts which becomes one of the hottest topics recently. This paper has proposed a semantic DOM (SDOM) approach for information extraction of e-commerce Web pages. With the combination of content and structure information, the precision and recall can achieve a good result which is shown in our experiments on listpage and tablepage data sets.published_or_final_versio

    Web page cleaning for web mining

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Implementation and Web Mounting of the WebOMiner_S Recommendation System

    Get PDF
    The ability to quickly extract information from a large amount of heterogeneous data available on the web from various Business to Consumer (B2C) or Ecommerce stores selling similar products (such as Laptops) for comparative querying and knowledge discovery remains a challenge because different web sites have different structures for their web data and web data are unstructured. For example: Find out the best and cheapest deal for Dell Laptop comparing BestBuy.ca and Amazon.com based on the following specification: Model: Inspiron 15 series, ram: 16gb, processor: i5, Hdd: 1 TB. The “WebOMiner” and “WebOMiner_S” systems perform automatic extraction by first parsing web html source code into a document object model (DOM) tree before using some pattern mining techniques to discover heterogeneous data types (e.g. text, image, links, lists) so that product schemas are extracted and stored in a back-end data warehouse for querying and recommendation. Although a web interface application of this system needs to be developed to make it accessible for to all users on the web.This thesis proposes a Web Recommendation System through Graphical User Interface, which is mounted readily on the web and is accessible to all users. It also performs integration of the web data consisting of all the product features such as Product model name, product description, market price subject to the retailer, etc. retained from the extraction process. Implementation is done using “Java server pages (JSP)” as the GUI designed in HTML, CSS, JavaScript and the framework used for this application is “Spring framework” which forms a bridge between the GUI and the data warehouse. SQL database is implemented to store the extracted product schemas for further integration, querying and knowledge discovery. All the technologies used are compatible with UNIX system for hosting the required application
    corecore