43 research outputs found

    A Transducer Model for Web Information Extraction

    Get PDF
    In recent years, many authors have paid attention to web information extractors. They usually build on an algorithm that interprets extraction rules that are inferred from examples. Several rule learning techniques are based on transducers, but none of them proposed a transducer generic model for web in formation extraction. In this paper, we propose a new transducer model that is specifically tailored to web information extraction. The model has proven quite flexible since we have adapted three techniques in the literature to infer state transitions, and the results prove that it can achieve high precision and recall ratesMinisterio de Educaci贸n y Ciencia TIN2007-64119Junta de Andaluc铆a P07-TIC-2602Junta de Andaluc铆a P08-TIC-4100Ministerio de Ciencia e Innovaci贸n TIN2008-04718-EMinisterio de Ciencia e Innovaci贸n TIN2010-21744Ministerio de Econom铆a, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovaci贸n TIN2010-10811-EMinisterio de Ciencia e Innovaci贸n TIN2010-09988-

    Cloud Configuration Modelling: a Literature Review from an Application Integration Deployment Perspective

    Get PDF
    Enterprise Application Integration has played an important role in providing methodologies, techniques and tools to develop integration solutions, aiming at reusing current applications and supporting the new demands that arise from the evolution of business processes in companies. Cloud-computing is part of a new reality in which companies have at their disposal a high capacity IT infrastructure at a low-cost, in which integration solutions can be deployed and run. The charging model adopted by cloud-computing providers is based on the amount of computing resources consumed by clients. Such demand of resources can be computed either from the implemented integration solution, or from the conceptual model that describes it. It is desirable that cloud-computing providers supply detailed conceptual models describing the variability of services and restrictions between them. However, this is not the case and providers do not supply the conceptual models of their services. The conceptual model of services is the basis to develop a process and provide supporting tools for the decision-making on the deployment of integration solutions to the cloud. In this paper, we review the literature on cloud configuration modelling, and compare current proposals based on a comparison framework that we have developed

    MostoDEx: A tool to exchange RDF data using exchange samples

    Get PDF
    The Web is evolving into a Web of Data in which RDF data are becoming pervasive, and it is organised into datasets that share a common purpose but have been developed in isolation. This motivates the need to devise complex integration tasks, which are usually performed using schema mappings; generating them automatically is appealing to relieve users from the burden of handcrafting them. Many tools are based on the data models to be integrated: classes, properties, and constraints. Unfortunately, many data models in the Web of Data comprise very few or no constraints at all, so relying on constraints to generate schema mappings is not appealing. Other tools rely on handcrafting the schema mappings, which is not appealing at all. A few other tools rely on exchange samples but require user intervention, or are hybrid and require constraints to be available. In this article, we present MostoDEx, a tool to generate schema mappings between two RDF datasets. It uses a single exchange sample and a set of correspondences, but does not require any constraints to be available or any user intervention. We validated and evaluated MostoDEx using many experiments that prove its effectiveness and efficiency in practice.Ministerio de Educaci贸n y Ciencia TIN2007-64119Junta de Andaluc铆a P07-TIC-2602Junta de Andaluc铆a P08- TIC-4100Ministerio de Ciencia e Innovaci贸n TIN2008-04718-EMinisterio de Ciencia e Innovaci贸n TIN2010-21744Ministerio de Ciencia e Innovaci贸n TIN2010-09809-EMinisterio de Ciencia e Innovaci贸n TIN2010-10811-EMinisterio de Ciencia e Innovaci贸n TIN2010-09988-EMinisterio de Econom铆a y Competitividad TIN2011-15497-EMinisterio de Econom铆a y Competitividad TIN2013-40848-

    A Conceptual Framework for Efficient Web Crawling in Virtual Integration Contexts

    Get PDF
    Virtual Integration systems require a crawling tool able to navigate and reach relevant pages in the Web in an efficient way. Existing proposals in the crawling area are aware of the efficiency problem, but still most of them need to download pages in order to classify them as relevant or not. In this paper, we present a conceptual framework for designing crawlers supported by a web page classifier that relies solely on URLs to determine page relevance. Such a crawler is able to choose in each step only the URLs that lead to relevant pages, and therefore reduces the number of unnecessary pages downloaded, optimising bandwidth and making it efficient and suitable for virtual integration systems. Our preliminary experiments show that such a classifier is able to distinguish between links leading to different kinds of pages, without previous intervention from the user.Ministerio de Educaci贸n y Ciencia TIN2007-64119Junta de Andaluc铆a P07-TIC-2602Junta de Andaluc铆a P08- TIC-4100Ministerio de Ciencia e Innovaci贸n TIN2008-04718-EMinisterio de Ciencia e Innovaci贸n TIN2010-21744Ministerio de Econom铆a, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovaci贸n TIN2010-10811-EMinisterio de Ciencia e Innovaci贸n TIN2010-09988-

    CALA: Classifying Links Automatically based on their URL

    Get PDF
    Web page classification refers to the problem of automatically assigning a web page to one or moreclasses after analysing its features. Automated web page classifiers have many applications, and many re- searchers have proposed techniques and tools to perform web page classification. Unfortunately, the ex- isting tools have a number of drawbacks that makes them unappealing for real-world scenarios, namely:they require a previous extensive crawling, they are supervised, they need to download a page beforeclassifying it, or they are site-, language-, or domain-dependent. In this article, we propose CALA, a toolfor URL-based web page classification. The strongest features of our tool are that it does not require aprevious extensive crawling to achieve good classification results, it is unsupervised, it is based exclu- sively on URL features, which means that pages can be classified without downloading them, and it issite-, language-, and domain-independent, which makes it generally applicable. We have validated ourtool with 22 real-world web sites from multiple domains and languages, and our conclusion is that CALAis very effective and efficient in practice.Ministerio de Educaci贸n y Ciencia TIN2007-64119Junta de Andaluc铆a P07-TIC-2602Junta de Andaluc铆a P08-TIC-4100Ministerio de Ciencia e Innovaci贸n TIN2008-04718-EMinisterio de Ciencia e Innovaci贸n TIN2010-21744Ministerio de Ciencia e Innovaci贸n TIN2010-09809-EMinisterio de Ciencia e Innovaci贸n TIN2010-10811-EMinisterio de Ciencia e Innovaci贸n TIN2010-09988-EMinisterio de Econom铆a y Competitividad TIN2011-15497-EMinisterio de Econom铆a y Competitividad TIN2013-40848-

    Towards Discovering Conceptual Models behind Web Sites

    Get PDF
    Deep Web sites expose data from a database, whose conceptual model remains hidden. Having access to that model is mandatory to perform several tasks, such as integrating different web sites; extracting information from the web unsupervisedly; or creating ontologies. In this paper, we propose a technique to discover the conceptual model behind a web site in the Deep Web, using a statistical approach to discover relationships between entities. Our proposal is unsupervised, not requiring the user to have expert knowledge; and it does not focus on a single view on the database, instead it integrates all views containing entities and relationships that are exposed in the web site.Ministerio de Educaci贸n y Ciencia TIN2007-64119Junta de Andaluc铆a P07-TIC-2602Junta de Andaluc铆a P08-TIC-4100Ministerio de Ciencia e Innovaci贸n TIN2008-04718-EMinisterio de Ciencia e Innovaci贸n TIN2010-10811-EMinisterio de Ciencia e Innovaci贸n TIN2010-21744Ministerio de Ciencia e Innovaci贸n TIN2010-09809-EMinisterio de Ciencia e Innovaci贸n TIN2010-09988-

    Benchmarking Data Exchange among Semantic-Web Ontologies

    Get PDF
    The increasing popularity of the Web of Data is motivating the need to integrate semantic-web ontologies. Data exchange is one integration approach that aims to populate a target ontology using data that come from one or more source ontologies. Currently, there exist a variety of systems that are suitable to perform data exchange among these ontologies; unfortunately, they have uneven performance, which makes it appealing assessing and ranking them from an empirical point of view. In the bibliography, there exist a number of benchmarks, but they cannot be applied to this context because they are not suitable for testing semantic-web ontologies or they do not focus on data exchange problems. In this paper, we present MostoBM, a benchmark for testing data exchange systems in the context of such ontologies. It provides a catalogue of three real-world and seven synthetic data exchange patterns, which can be instantiated into a variety of scenarios using some parameters. These scenarios help to analyze how the performance of data exchange systems evolves as the exchanging ontologies are scaled in structured and/or data. Finally, we provide an evaluation methodology to compare data exchange systems side by side and to make informed and statistically sound decisions regarding: 1) which data exchange system performs better; and 2) how the performance of a system is influenced by the parameters of our benchmark.Ministerio de Educaci贸n y Ciencia TIN2007-64119Junta de Andaluc铆a P07-TIC-2602Junta de Andaluc铆a P08- TIC-4100Ministerio de Ciencia e Innovaci贸n TIN2008-04718-EMinisterio de Ciencia e Innovaci贸n TIN2010-21744Ministerio de Ciencia e Innovaci贸n TIN2010-09809-EMinisterio de Ciencia e Innovaci贸n TIN2010-10811-EMinisterio de Ciencia e Innovaci贸n TIN2010-09988-EMinisterio de Econom铆a y Competitividad TIN2011-15497-

    Towards Discovering Ontological Models from Big RDF Data

    Get PDF
    The Web of Data, which comprises web sources that provide their data in RDF, is gaining popularity day after day. Ontological models over RDF data are shared and developed with the consensus of one or more communities. In this context, there usually exist more than one ontological model to understand RDF data, therefore, there might be a gap between the models and the data, which is not negligible in practice. In this paper, we present a technique to automatically discover ontological models from raw RDF data. It relies on a set of SPARQL 1.1 structural queries that are generic and independent from the RDF data. The output of our technique is a model that is derived from these data and includes the types and properties, subtypes, domains and ranges of properties, and minimum cardinalities of these properties. Our technique is suitable to deal with Big RDF Data since our experiments focus on millions of RDF triples, i.e., RDF data from DBpedia 3.2 and BBC. As far as we know, this is the first technique to discover such ontological models in the context of RDF data and the Web of Data.Ministerio de Educaci贸n y Ciencia TIN2007-64119Junta de Andaluc铆a P07-TIC-2602Junta de Andaluc铆a P08-TIC-4100Ministerio de Ciencia e Innovaci贸n TIN2010-21744Ministerio de Ciencia e Innovaci贸n TIN2010-09809-EMinisterio de Ciencia e Innovaci贸n TIN2010-10811-EMinisterio de Ciencia e Innovaci贸n TIN2010-09988-

    An Architecture for Efficient Web Crawling

    Get PDF
    Virtual Integration systems require a crawling tool able to navigate and reach relevant pages in the Deep Web in an efficient way. Existing proposals in the crawling area fulfill some of these requirements, but most of them need to download pages in order to classify them as relevant or not. We propose a crawler supported by a web page classifier that uses solely a page URL to determine page relevance. Such a crawler is able to choose in each step only the URLs that lead to relevant pages, and therefore reduces the number of unnecessary pages downloaded, minimising bandwidth and making it efficient and suitable for virtual integration systems.Ministerio de Educaci贸n y Ciencia TIN2007-64119Junta de Andaluc铆a P07-TIC-2602Junta de Andaluc铆a P08- TIC-4100Ministerio de Ciencia e Innovaci贸n TIN2008-04718-EMinisterio de Ciencia e Innovaci贸n TIN2010-21744Ministerio de Econom铆a, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovaci贸n TIN2010-10811-EMinisterio de Ciencia e Innovaci贸n TIN2010-09988-

    A Tool for Link-Based Web Page Classification

    Get PDF
    Virtual integration systems require a crawler to navigate through web sites automatically, looking for relevant information. This process is online, so whilst the system is looking for the required information, the user is waiting for a response. Therefore, downloading a minimum number of irrelevant pages is mandatory to improve the crawler efficiency. Most crawlers need to download a page to determine its relevance, which results in a high number of irrelevant pages downloaded. In this paper, we propose a classifier that helps crawlers to efficiently navigate through web sites. This classifier is able to determine if a web page is relevant by analysing exclusively its URL, minimising the number of irrelevant pages downloaded, improving crawling efficiency and reducing used bandwidth, making it suitable for virtual integration systems.Ministerio de Educaci贸n y Ciencia TIN2007-64119Junta de Andaluc铆a P07-TIC-2602Junta de Andaluc铆a P08- TIC-4100Ministerio de Ciencia e Innovaci贸n TIN2008-04718-EMinisterio de Ciencia e Innovaci贸n TIN2010-21744Ministerio de Econom铆a, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovaci贸n TIN2010-10811-EMinisterio de Ciencia e Innovaci贸n TIN2010-09988-
    corecore