Search CORE

24 research outputs found

Web mining for information extracion from the web using voting and stached generalization

Author: Sigletos Georgios
Συγλέτος Γεώργιος
Publication venue: 'National Documentation Centre (EKT)'
Publication date: 01/01/2005
Field of study

Hellenic National Archive of Doctoral Dissertations

Role identification from free text using hidden Markov models

Author: Georgios Paliouras
Georgios Sigletos
Vangelis Karkaletsis
Publication venue
Publication date: 01/01/2002
Field of study

In this paper we explore the use of hidden Markov models on the task of role identification from free text. Role identification is an important stage of the information extraction process, assigning roles to particular types of entities with respect to a particular event. Hidden Markov models (HMMs) have been shown to achieve good performance when applied to information extraction tasks in both semistructured and free text. The main contribution of this work is the analysis of whether and how linguistic processing of textual data can improve the extraction performance of HMMs. The emphasis is on the minimal use of computationally expensive linguistic analysis. The overall conclusion is that the performance of HMMs is still worse than an equivalent manually constructed system. However, clear paths for improvement of the method are shown, aiming at a method, which is easily adaptable to new domains

CiteSeerX

Crossref

Mining Web sites using wrapper induction, named entities, and post-processing

Author: Sigletos G. Paliouras, G. Spyropoulos, C.D. Hatzopoulos, M.
Publication venue
Publication date: 01/01/2004
Field of study

This paper presents a new framework for extracting information from collections of Web pages across different sites. In the proposed framework, a standard wrapper induction algorithm is used that exploits named entity information that has been previously identified. The idea of post-processing the extraction results is introduced for resolving ambiguous fields and improving the overall extraction performance. Post-processing involves the exploitation of two additional sources of information: field transition probabilities, based on a trained bigram model, and confidence scores, estimated for each field by the wrapper induction system. A multiplicative model that is based on the product of those two probabilities is also considered for post-processing. Experiments were conducted on pages describing laptop products, collected from many different sites and in four different languages. The results highlight the effectiveness of the new framework. © Springer-Verlag Berlin Heidelberg 2004

Pergamos : Unified Institutional Repository / Digital Library Platform of the National and Kapodistrian University of Athens

Stacked generalization for information extraction

Author: Constantine D. Spyropoulos
Georgios Paliouras
Georgios Sigletos
Takis Stamatopoulos
Publication venue
Publication date
Field of study

Abstract. 1 This paper defines a new stacked generalization framework in the context of information extraction (IE) from online sources. The proposed setting removes the constraint of applying classifiers at the base-level. A set of IE systems are trained instead to identify relevant fragments within text documents, which differs significantly from the task of classifying candidate text fragments as relevant or not. The templates filled by the base-level IE systems are stacked, forming a set of feature vectors for training a metalevel classifier. Thus, base-level IE systems are combined with a common classifier at meta-level. The proposed framework was evaluated on three Web domains, using well known IE approaches at base-level and a variety of classifiers at meta-level. Results demonstrate the added value obtained by combining the base-level IE systems in the new framework.

CiteSeerX

Combining Information Extraction Systems Using Voting and Stacked Generalization

Author: Constantine D. Spyropoulos
Georgios Paliouras
Georgios Sigletos
Michalis Hatzopoulos
Publication venue
Publication date
Field of study

This article investigates the effectiveness of voting and stacked generalization-also known as stacking- in the context of information extraction (IE). A new stacking framework is proposed that accommodates well-known approaches for IE. The key idea is to perform cross-validation on the base-level data set, which consists of text documents annotated with relevant information, in order to create a meta-level data set that consists of feature vectors. A classifier is then trained using the new vectors. Therefore, base-level IE systems are combined with a common classifier at the metalevel. Various voting schemes are presented for comparing against stacking in various IE domains. Well known IE systems are employed at the base-level, together with a variety of classifiers at the meta-level. Results show that both voting and stacking work better when relying on probabilistic estimates by the base-level systems. Voting proved to be effective in most domains in the experiments. Stacking, on the other hand, proved to be consistently effective over all domains, doing comparably or better than voting and always better than the best base-level systems. Particular emphasis is also given to explaining the results obtained by voting and stacking at the meta-level, with respect to the varying degree of similarity in the output of the base-level systems

CiteSeerX

Annotating Web pages for the needs of Web Information Extraction applications

Author: Dimitra Farmakiotou
Georgios Paliouras
Georgios Sigletos
Kostas Stamatakis
Vangelis Karkaletsis
Publication venue
Publication date
Field of study

This paper outlines our approach to the creation of annotated corpora for the purposes of Web Information Extraction, and presents the Web Annotation tool. This tool enables the annotation of Web pages from different domains and for different information extraction tasks providing a user-friendly interface to human annotators. Annotated information is stored in a representation format that can easily be exploited

CiteSeerX

Meta-learning beyond classification: A framework for information extraction from the Web

Author: Constantine D. Spyropoulos
Georgios Paliouras
Georgios Sigletos
Takis Stamatopoulos
Publication venue
Publication date
Field of study

This paper proposes a meta-learning framework in the context of information extraction from the Web. The proposed framework relies on learning a meta-level classifier, based on the output of base-level information extraction systems. Such systems are typically trained to recognize relevant information within documents, i.e., streams of lexical units, which differs significantly from the task of classifying feature vectors that is commonly assumed for metalearning

CiteSeerX

PNS: A Personalized News Aggregator on the Web

Author: D Pierrakos
G Paliouras
G Sigletos
K Bharat
L Ardissono
T Kamba
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2008
Field of study

Crossref

DSpace at NTUA