Search CORE

4 research outputs found

Knowledge extraction from webpages

Author: Napoli Amedeo
Polanco Xavier
Tenier Sylvain
Toussaint Yannick
Publication venue: HAL CCSD
Publication date: 07/11/2005
Field of study

http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/This article presents a system to extract Knowledge from webpages by producing semantic annotations. taking into account semantic information from the domain to annotate an element in a webpage implies solving two problems : (1) identifying the syntactic structure of this element in the webpage and (2) identifying the most specific concept (in terms of subsumption) of the ontology that will be used to annotate this element. Our approach relies on a wrapper-based machine learning algorithm combined with reasoning making use of the formal structure of the ontology

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Knowledge extraction from webpages

Author: Amedeo Napoli
Sylvain Tenier
Xavier Polanco
Yannick Toussaint
Publication venue
Publication date
Field of study

Abstract. This article presents a system to extract Knowledge from webpages by producing semantic annotations. taking into account semantic information from the domain to annotate an element in a webpage implies solving two problems: (1) identifying the syntactic structure of this element in the webpage and (2) identifying the most specific concept (in terms of subsumption) of the ontology that will be used to annotate this element. Our approach relies on a wrapper-based machine learning algorithm combined with reasoning making use of the formal structure of the ontology. 1 Context of the research Our system aims at using information provided by research teams on their website to generate knowledge about the European Research Community. In order to make this information machine-processable, a formal representation of the content of the webpages is needed, encoded with a well-defined syntax and semantics. This is the purpose of semantic annotation [1]. The system is provided with: – an ontology which represents the concepts of a domain and their relationships. The ontology, implemented in the Web Ontology Language (OWL), is based on Description Logics (DL) and thus reasoning mechanisms, like classification and subsumption, are provided [2], – webpages from which data are extracted according to the ontology. For each data in the document, the systems generates an individual with the concept and roles it instantiates. Each individual is added to a Knowledge Base (KB). Two main tasks are dealt with: the first is about locating each data in the provided documents and extracting it to generate a “raw ” individual which may not be specific enough. It is followed by a reasoning task which infers the most specific concept the individual is an instance of

CiteSeerX