Search CORE

446 research outputs found

Discovery of Maximally Frequent Tag Tree Patterns in Semistructured Data (New Developments of Theory of Computation and Algorithms)

Author: Miyahara Tetsuhiro
Shoudai Takayoshi
Uchida Tomoyuki
Publication venue: 京都大学数理解析研究所
Publication date: 01/05/2001
Field of study

Kyoto University Research Information Repository

XML Schema Clustering with Semantic and Hierarchical Similarity Measures

Author: Iryadi Wina
Nayak Richi
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis

Crossref

Queensland University of Technology ePrints Archive

Finding structure and characteristic of web documents for classification.

Author
Publication venue
Publication date: 01/01/2000
Field of study

by Wong, Wai Ching.Thesis (M.Phil.)--Chinese University of Hong Kong, 2000.Includes bibliographical references (leaves 91-94).Abstracts in English and Chinese.Abstract --- p.iiAcknowledgments --- p.vChapter 1 --- Introduction --- p.1Chapter 1.1 --- Semistructured Data --- p.2Chapter 1.2 --- Problem Addressed in the Thesis --- p.4Chapter 1.2.1 --- Labels and Values --- p.4Chapter 1.2.2 --- Discover Labels for the Same Attribute --- p.5Chapter 1.2.3 --- Classifying A Web Page --- p.6Chapter 1.3 --- Organization of the Thesis --- p.8Chapter 2 --- Background --- p.8Chapter 2.1 --- Related Work on Web Data --- p.8Chapter 2.1.1 --- Object Exchange Model (OEM) --- p.9Chapter 2.1.2 --- Schema Extraction --- p.11Chapter 2.1.3 --- Discovering Typical Structure --- p.15Chapter 2.1.4 --- Information Extraction of Web Data --- p.17Chapter 2.2 --- Automatic Text Processing --- p.19Chapter 2.2.1 --- Stopwords Elimination --- p.19Chapter 2.2.2 --- Stemming --- p.20Chapter 3 --- Web Data Definition --- p.22Chapter 3.1 --- Web Page --- p.22Chapter 3.2 --- Problem Description --- p.27Chapter 4 --- Hierarchical Structure --- p.32Chapter 4.1 --- Types of HTML Tags --- p.33Chapter 4.2 --- Tag-tree --- p.36Chapter 4.3 --- Hierarchical Structure Construction --- p.41Chapter 4.4 --- Hierarchical Structure Statistics --- p.50Chapter 5 --- Similar Labels Discovery --- p.53Chapter 5.1 --- Expression of Hierarchical Structure --- p.53Chapter 5.2 --- Labels Discovery Algorithm --- p.55Chapter 5.2.1 --- Phase 1: Remove Non-label Nodes --- p.57Chapter 5.2.2 --- Phase 2: Identify Label Nodes --- p.61Chapter 5.2.3 --- Phase 3: Discover Similar Labels --- p.66Chapter 5.3 --- Performance Evaluation of Labels Discovery Algorithm --- p.76Chapter 5.3.1 --- Phase 1 Results --- p.75Chapter 5.3.2 --- Phase 2 Results --- p.77Chapter 5.3.3 --- Phase 3 Results --- p.81Chapter 5.4 --- Classifying a Web Page --- p.83Chapter 5.4.1 --- Similarity Measurement --- p.84Chapter 5.4.2 --- Performance Evaluation --- p.86Chapter 6 --- Conclusion --- p.8

CUHK Digital Repository

Integrating data warehouses with web data : a survey

Author: Aramburu Cabo María José
Berlanga Llavori Rafael
Pedersen Torben Bach
Pérez Martínez Juan Manuel
Publication venue: IEEE Computer Society
Publication date: 01/01/2008
Field of study

This paper surveys the most relevant research on combining Data Warehouse (DW) and Web data. It studies the XML technologies that are currently being used to integrate, store, query, and retrieve Web data and their application to DWs. The paper reviews different DW distributed architectures and the use of XML languages as an integration tool in these systems. It also introduces the problem of dealing with semistructured data in a DW. It studies Web data repositories, the design of multidimensional databases for XML data sources, and the XML extensions of OnLine Analytical Processing techniques. The paper addresses the application of information retrieval technology in a DW to exploit text-rich document collections. The authors hope that the paper will help to discover the main limitations and opportunities that offer the combination of the DW and the Web fields, as well as to identify open research line

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Repositori Institucional de la Universitat Jaume I

VBN

A Progressive Clustering Algorithm to Group the XML Data by Structural and Semantic Similarity

Author: Nayak Richi
Tran Tien
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 01/01/2007
Field of study

Since the emergence in the popularity of XML for data representation and exchange over the Web, the distribution of XML documents has rapidly increased. It has become a challenge for researchers to turn these documents into a more useful information utility. In this paper, we introduce a novel clustering algorithm PCXSS that keeps the heterogeneous XML documents into various groups according to their similar structural and semantic representations. We develop a global criterion function CPSim that progressively measures the similarity between a XML document and existing clusters, ignoring the need to compute the similarity between two individual documents. The experimental analysis shows the method to be fast and accurate

CiteSeerX

Queensland University of Technology ePrints Archive

A Semantic DOM Approach For Webpage Information Extraction

Author: Fei Y
Luo Z
Xu Y
Zhang W
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2009
Field of study

With the development of electronic technology and e-commerce, technology for Web pages has attracted a lot of research efforts which becomes one of the hottest topics recently. This paper has proposed a semantic DOM (SDOM) approach for information extraction of e-commerce Web pages. With the combination of content and structure information, the precision and recall can achieve a good result which is shown in our experiments on listpage and tablepage data sets.published_or_final_versio

HKU Scholars Hub

Web page cleaning for web mining

Author: YI LAN
Publication venue
Publication date: 01/02/2005
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Implementation and Web Mounting of the WebOMiner_S Recommendation System

Author: Chachad Amrutraj Alhad
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2017
Field of study

The ability to quickly extract information from a large amount of heterogeneous data available on the web from various Business to Consumer (B2C) or Ecommerce stores selling similar products (such as Laptops) for comparative querying and knowledge discovery remains a challenge because different web sites have different structures for their web data and web data are unstructured. For example: Find out the best and cheapest deal for Dell Laptop comparing BestBuy.ca and Amazon.com based on the following specification: Model: Inspiron 15 series, ram: 16gb, processor: i5, Hdd: 1 TB. The “WebOMiner” and “WebOMiner_S” systems perform automatic extraction by first parsing web html source code into a document object model (DOM) tree before using some pattern mining techniques to discover heterogeneous data types (e.g. text, image, links, lists) so that product schemas are extracted and stored in a back-end data warehouse for querying and recommendation. Although a web interface application of this system needs to be developed to make it accessible for to all users on the web.This thesis proposes a Web Recommendation System through Graphical User Interface, which is mounted readily on the web and is accessible to all users. It also performs integration of the web data consisting of all the product features such as Product model name, product description, market price subject to the retailer, etc. retained from the extraction process. Implementation is done using “Java server pages (JSP)” as the GUI designed in HTML, CSS, JavaScript and the framework used for this application is “Spring framework” which forms a bridge between the GUI and the data warehouse. SQL database is implemented to store the extracted product schemas for further integration, querying and knowledge discovery. All the technologies used are compatible with UNIX system for hosting the required application

Scholarship at UWindsor