Search CORE

4 research outputs found

Research of Web Information Extraction Based on Tree structure

Author: 任仲晟
Publication venue
Publication date: 30/05/2007
Field of study

随着Internet的快速发展，Web已经发展成为一个巨大的、分布式的和共享的信息资源。目前Web数据大都以HTML页面的形式出现。由于HTML描述的数据是一种半结构化的数据，这使得由HTML描述的Web页面只适合人类的浏览，应用程序无法直接解析并利用Web上的丰富信息。为了增强Web数据的可用性，提供更多的增值服务，出现了Web信息抽取技术。它通过包装(wrapper)现有的Web信息源，将网页上的信息以结构化的方式抽取出来，为应用程序利用Web中的数据提供了可能，因此有着广阔的前景，是当今数据库领域的研究热点之一。本文首先对Web信息抽取的一些基本概念做了简要介绍，并简述了Web信息抽取技...With the rapid development of Internet, Web is becoming a vast, distributed, and shared information resource. Most of Web data are in the form of HTML. Due to the semi-structured nature of HTML pages, Web pages are easy for exploring by human beings while it is difficult for applications to process and use the data in the Web pages. To strengthen the availability of Web data, providing more value-...学位：工学硕士院系专业：信息科学与技术学院计算机科学系_计算机软件与理论学号：20042801

Xiamen University Institutional Repository

Web Information Extraction Based on Tree Structure

Author: 任仲晟
薛永生
Publication venue
Publication date: 01/01/2009
Field of study

提出了一种基于树形结构的WEb结构化数据抽取算法.该算法基于HTMl的树形层次结构,包括HTMl树构造算法,数据区域挖掘算法,数据记录挖掘算法以及数据记录模式生成算法.算法引入了页面元素布局位置等信息用于清洗页面,采用层次划分思想实现页面数据区域的挖掘,并通过树匹配生成记录模式,实现最终数据项抽取.实验表明,该方法可以有效地实现WEb结构化数据抽取.It proposes tree structure based Web data extraction algorithm in view of the inadequacies of the existing methods.The tree structure based algorithm includes: the algorithm of HTML tree construction,the algorithm of data region mining,the algorithm of data record mining,and the algorithm of record schema generation.The algorithm cleans the Web pages using the position information of page elements,mines data region by hierarchical clustering, and generates record schema finishing data item extraction through tree matching.Experimental results show that our algorithm can improve the accuracy and efficiency of Web data extraction.国家自然科学基金资助项目(50474033);福建省自然科学基金资助项目(A0310008);福建省重点科技项目(2003H043

Xiamen University Institutional Repository

Structured Data Extraction Based on Web Page Tags

Author: 任仲晟
薛永生
Publication venue
Publication date: 15/10/2007
Field of study

本文研究了从data intensive类型的Web页面中提取结构化数据的问题,提出了基于页面标签的数据抽取算法。该算法先根据标签的显示位置及其大小判断不同标签元素之间的嵌套关系,并构造简化的HTML树Sim- HTree,有效地减少了识别数据记录的时间。在此基础上,提出子串匹配调整算法,对数据记录进行识别,标识教据项。实验表明,该算法是有效的。This paper studies the problem of structured data extraction from data intensive Web pages.A novel ap- proach based on page tags is proposed to solve the problem.The proposed method identifies the nesting relationship be- tween different page tags according to the visual display location and the size on the screen and constructs the corre- sponding simplified HTML tree (SimHTree for short),which reduce the time cost for the identification of data record. As the second step of the data extraction problem,substring match and adjustment algorithm is proposed,which iden- tify the data record and data item.Experimental results show that the proposed method is effective and efficient.国家自然科学基金(50474033);; 福建省自然科学基金(A0310008);; 福建省重点科技项目(2003H043

Xiamen University Institutional Repository

Automatic Web Information Extraction Based on Maximal and Frenquent Equivalence Classes

Author: 任仲晟
张东站
薛永生
陈华昌
Publication venue
Publication date: 25/12/2006
Field of study

在定义模板的基础上,提出了页面创建模型。该模型描述了如何使用模板将来自于后台数据库的值编码生成页面。基于这个模型,设计了一个基于最大频繁等价类的抽取算法EBMFEC,通过分析给定的数据导向型页面的终端符号的出现情况,找出最大频繁等价类,并推导出用于生成页面的未知模板。然后使用推导出的模板,从输入页面中提取出相关信息。在大量实际HTML页面上的实验证明,EBMFEC在大部分情况下都可以从给定页面中推导出模板,并正确抽取出数据信息。A novel approach based on MFEC(Maximal and Frenquent Equivalence Classes)is proposed to solve the problem of automatically extracting data from data-intensive Web pages.A template is defined and a model of page creation is proposed to describe how values are encoded into pages using the defined template. We present an algorithm,EBMFEC that takes,as input,a set of template-generated pages,analyzes the page-tokens of given pages to discover MFEC,deduces the unknown template used to generate the pages and extracts,as output,the values encoded in the pages. Experiments on a large number of HTML pages indicate that our algorithm correctly extracts data in most cases and the results are also provided.国家自然科学基金(50474033);; 福建省自然科学基金(A0310008);; 福建省重点科技项目(2003H043)

Xiamen University Institutional Repository