Location of Repository

Mainly significant Content Mining of Entire Web Page

By P. Sivakumar and Dr. R. M. S Parvathi

Abstract

Abstract- User explore for the necessary information with search engines. Search engines crawl and index web pages according to their informative content. User is attracted only in the useful contents and not in noninformative content blocks. Web pages often contain navigation sidebars, advertisements, search blocks, copyright notices, etc which are not content blocks. The information contained in these non-content blocks can harm web mining. So having an algorithm to extracts only most important content could help better quality on web page indexing. Almost all algorithms have been proposed are tag dependent means they could only look for primary content among specific tags such as <TABLE> or <DIV>.The proposed technique is tag free and has two phases to achieve the extraction work. primary it transform contribution DOM tree obtain from input HTML detailed web page into a block tree based on their visual representation and DOM structure in a way that on every node it will have specification vector, then it traverses the obtained small block tree to find main block having dominant computed value in comparison with other block nodes based on its requirement vector values. This introduce technique doesn’t have any knowledge phases and can find educational content on any casual input complete web page

Topics: Noise elimination, Informative content
Year: 2014
OAI identifier: oai:CiteSeerX.psu:10.1.1.416.7709
Provided by: CiteSeerX
Download PDF:
Sorry, we are unable to provide the full text but you may find it at the following location(s):
  • http://citeseerx.ist.psu.edu/v... (external link)
  • http://www.ijera.com/papers/Vo... (external link)
  • Suggested articles


    To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.