2 research outputs found

    Extraction Model Based on Web Format Information Quantity in Blog Post and Comment Extraction

    Get PDF
    从信息论的角度出发,提出了一个基于网页格式信息量的博客文章和评论抽取模型.首先,结合网页视觉上的位置信息和文本的有效信息来定位网页正文.其次,利用博客网页中的格式信息作为信息单元并计算每个信息块所包含的格式信息量,通过计算最小切分位置信息量来切分正文中的文章和评论.该模型具有与语言无关的特点,因此具有一定的通用性.实验结果表明,该模型在博客正文定位和正文切分方面达到了较高的精确率.Based on the information theory,this paper presents a model based on Web format information quantity in blog information extraction.First,the vision information in blog Web page and the effective text information are combined to locate the main text which represents the theme of the blog Web page.Second,the format information of blog Web page is used to calculate the information quantity of each block and the minimal separating information quantity of separate position is used to detect the boundary of posts and comments in the main text.This model is language insensitive and can be used in a lot of blogs which are written in different natural languages.Experimental results show that this method achieves high precision in locating main text and separating the post and comment.国家重点基础研究发展计划(973)Nos.2004CB318109;2007CB311100;国家高技术研究发展计划(863)No.2007AA01Z441---

    Using the Web Infrastructure for Real Time Recovery of Missing Web Pages

    Get PDF
    Given the dynamic nature of the World Wide Web, missing web pages, or 404 Page not Found responses, are part of our web browsing experience. It is our intuition that information on the web is rarely completely lost, it is just missing. In whole or in part, content often moves from one URI to another and hence it just needs to be (re-)discovered. We evaluate several methods for a \justin- time approach to web page preservation. We investigate the suitability of lexical signatures and web page titles to rediscover missing content. It is understood that web pages change over time which implies that the performance of these two methods depends on the age of the content. We therefore conduct a temporal study of the decay of lexical signatures and titles and estimate their half-life. We further propose the use of tags that users have created to annotate pages as well as the most salient terms derived from a page\u27s link neighborhood. We utilize the Memento framework to discover previous versions of web pages and to execute the above methods. We provide a work ow including a set of parameters that is most promising for the (re-)discovery of missing web pages. We introduce Synchronicity, a web browser add-on that implements this work ow. It works while the user is browsing and detects the occurrence of 404 errors automatically. When activated by the user Synchronicity offers a total of six methods to either rediscover the missing page at its new URI or discover an alternative page that satisfies the user\u27s information need. Synchronicity depends on user interaction which enables it to provide results in real time
    corecore