Search CORE

3 research outputs found

Using the DOM tree for content extraction

Author: David Insa
Francesco Tiezzi
Josep Silva
Josep Silva
Sergio López
Publication venue: 'Open Publishing Association'
Publication date: 01/01/2012
Field of study

The main information of a webpage is usually mixed between menus, advertisements, panels, and other not necessarily related information; and it is often difficult to automatically isolate this information. This is precisely the objective of content extraction, a research area of widely interest due to its many applications. Content extraction is useful not only for the final human user, but it is also frequently used as a preprocessing stage of different systems that need to extract the main content in a web document to avoid the treatment and processing of other useless information. Other interesting application where content extraction is particularly used is displaying webpages in small screens such as mobile phones or PDAs. In this work we present a new technique for content extraction that uses the DOM tree of the webpage to analyze the hierarchical relations of the elements in the webpage. Thanks to this information, the technique achieves a considerable recall and precision. Using the DOM structure for content extraction gives us the benefits of other approaches based on the syntax of the webpage (such as characters, words and tags), but it also gives us a very precise information regarding the related components in a block, thus, producing very cohesive blocksLópez, S.; Silva Galiana, JF.; Insa Cabrera, D. (2012). Using the DOM tree for content extraction. Electronic Proceedings in Theoretical Computer Science. 98(Proceedings 8th International Workshop on Automated Specification and Verification of Web Systems):46-59. doi:10.4204/EPTCS.98S465998Proceedings 8th International Workshop on Automated Specification and Verification of Web System

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

RiuNet

Archivio istituzionale della ricerca - Università di Camerino

Using the words/leafs ratio in the DOM tree for content extraction

Author: Baluja
Cohen
Dalvi
David Insa
Gibson
Gottron
Gupta
Josep Silva
Kohlschütter
Kohlschütter
Kohlschütter
Kushmerick
Li
Salvador Tamarit
Weninger
Publication venue: 'Elsevier BV'
Publication date: 01/11/2013
Field of study

The main content in a webpage is usually centered and visible without the need to scroll. It is often rounded by the navigation menus of the website and it can include advertisements, panels, banners, and other not necessarily related information. The process to automatically extract the main content of a webpage is called content extraction. Content extraction is an area of research of widely interest due to its many applications. Concretely, it is useful not only for the final human user, but it is also frequently used as a preprocessing stage of different systems (i.e., robots, indexers, crawlers, etc.) that need to extract the main content of a web document to avoid the treatment and processing of other useless information. In thisworkwe present a newtechnique for content extraction that is based on the information contained in theDOMtree. The technique analyzes the hierarchical relations of the elements in the webpage and the distribution of textual information in order to identify the main block of content. Thanks to the hierarchy imposed by the DOM tree the technique achieves a considerable recall and precision. Using theDOMstructure for content extraction gives us the benefits of other approaches based on the syntax of the webpage (such as characters, words and tags), but it also gives us a very precise information regarding the related components in a block (not necessarily textual such as images or videos), thus, producing very cohesive blocks. © 2013 Elsevier Inc. All rights reserved.This work has been partially supported by the Spanish Ministerio de Economia y Competitividad (Secretaria de Estado de Investigacion, Desarrollo e Innovacion) under Grant TIN2008-06622-C03-02 and by the Generalitat Valenciana under Grant PROMETEO/2011/052. Salvador Tamarit was partially supported by the Spanish MICINN under FPI Grant BES-2009-015019. David Insa was partially supported by the Spanish Ministerio de Eduacion under FPU Grant AP2010-4415.Insa Cabrera, D.; Silva Galiana, JF.; Tamarit, S. (2013). Using the words/leafs ratio in the DOM tree for content extraction. Journal of Logic and Algebraic Programming. 82(8):311-325. https://doi.org/10.1016/j.jlap.2013.01.002S31132582

Crossref

RiuNet