5 research outputs found

    Integrating Deep-Web Information Sources

    Get PDF
    Deep-web information sources are difficult to integrate into automated business processes if they only provide a search form. A wrapping agent is a piece of software that allows a developer to query such information sources without worrying about the details of interacting with such forms. Our goal is to help soft ware engineers construct wrapping agents that interpret queries written in high-level structured languages. We think that this shall definitely help reduce integration costs because this shall relieve developers from the burden of transforming their queries into low-level interactions in an ad-hoc manner. In this paper, we report on our reference framework, delve into the related work, and highlight current research challenges. This is intended to help guide future research efforts in this area.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-

    On using high-level structured queries for integrating deep-web information sources

    Get PDF
    The actual value of the Deep Web comes from integrating the data its applications provide. Such applications offer human-oriented search forms as their entry points, and there exists a number of tools that are used to fill them in and retrieve the resulting pages programmatically. Solution that rely on these tools are usually costly, which motivated a number of researchers to work on virtual integration, also known as metasearch. Virtual integration abstracts away from actual search forms by providing a unified search form, i.e., a programmer fills it in and the virtual integration system translates it into the application search forms. We argue that virtual integration costs might be reduced further if another abstraction level is provided by issuing structured queries in high-level languages such as SQL, XQuery or SPARQL; this helps abstract away from search forms. As far as we know, there is not a proposal in the literature that addresses this problem. In this paper, we propose a reference framework called IntegraWeb to solve the problems of using high-level structured queries to perform deep-web data integration. Furthermore, we provide a comprehensive report on existing proposals from the database integration and the Deep Web research fields, which can be used in combination to address our problem within the previous reference framework.Ministerio de Ciencia y Tecnología TIN2007-64119Junta de Andalucía P07- TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010- 21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-

    CALA: An unsupervised URL-based web page classification system

    Get PDF
    Unsupervised web page classification refers to the problem of clustering the pages in a web site so that each cluster includes a set of web pages that can be classified using a unique class. The existing proposals to perform web page classification do not fulfill a number of requirements that would make them suitable for enterprise web information integration, namely: to be based on a lightweight crawling, so as to avoid interfering with the normal operation of the web site, to be unsupervised, which avoids the need for a training set of pre-classified pages, or to use features from outside the page to be classified, which avoids having to download it. In this article, we propose CALA, a new automated proposal to generate URL-based web page classifiers. Our proposal builds a number of URL patterns that represent the different classes of pages in a web site, so further pages can be classified by matching their URLs to the patterns. Its salient features are that it fulfills all of the previous requirements, and it has been validated by a number of experiments using real-world, top-visited web sites. Our validation proves that CALA is very effective and efficient in practice.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08- TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Ciencia e Innovación TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-

    KDC: uma abordagem baseada em conhecimento para classificação de documentos

    Get PDF
    Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2015.Classificação de documentos fornece um meio para organizar as informações, permitindo uma melhor compreensão e interpretação dos dados. A tarefa de classificar é caracterizada pela associação de rótulos de classes a documentos com o objetivo de criar agrupamentos semânticos. O aumento exponencial no número de documentos e dados digitais demanda formas mais precisas, abrangentes e eficientes para busca e organização de informações. Nesse contexto, o aprimoramento de técnicas de classificação de documentos com o uso de informação semântica é considerado essencial. Sendo assim, este trabalho propõe uma abordagem baseada em conhecimento para a classificação de documentos. A técnica utiliza termos extraídos de documentos associando-os a conceitos de uma base de conhecimento de domínio aberto. Em seguida, os conceitos são generalizados a um nível maior de abstração. Por fim, é calculado um valor de disparidade entre os conceitos generalizados e o documento, sendo o conceito de menor disparidade considerado como rótulo de classe aplicável ao documento. A aplicação da técnica proposta oferece vantagens sobre os métodos convencionais como a ausência da necessidade de treinamento, a oportunidade de atribuir uma ou múltiplas classes a um documento e a capacidade de aplicação em diferentes temas de classificação sem a necessidade de alterar o classificador.Abstract : Document classification provides a way to organize information, providing a better way to understand available data. The classification task is characterized by the association of class labels to documents, aiming to create semantic clusters. The exponential increase in the number of documents and digital data demands for more precise, comprehensive and efficient ways to search and organize information. In this context, the improvement of document classification techniques using semantic information is considered essential. Thus, this paper proposes a knowledge-based approach for the classification of documents. The technique uses terms extracted from documents in association with concepts of an open domain knowledge base. Then, the concepts are generalized to a higher level of abstraction. Finally a disparity value between generalized concepts and the document is calculated, and the best ranked concept is then considered as a class label applicable to the document. The application of the proposed technique offers advantages over conventional methods including no need for training, the choice to assign one or multiple classes to a document and the capacity to classify over different subjects without the need to change the classifier

    Structure-Based Crawling in the Hidden Web

    No full text
    The number of applications that need to crawl the Web to gather data is growing at an ever increasing pace. In some cases, the criterion to determine what pages must be included in a collection is based on theirs contents; in others, it would be wiser to use a structure-based criterion. In this article, we present a proposal to build structure-based crawlers that just requires a few examples of the pages to be crawled and an entry point to the target web site. Our crawlers can deal with form-based web sites. Contrarily to other proposals, ours does not require a sample database to fill in the forms, and does not require the user to interact heavily. Our experiments prove that our precision is 100% in seventeen real-world web sites, with both static and dynamic content, and that our recall is 95% in the eleven static web sites examined
    corecore