181 research outputs found

    CALA: Classifying Links Automatically based on their URL

    Get PDF
    Web page classification refers to the problem of automatically assigning a web page to one or moreclasses after analysing its features. Automated web page classifiers have many applications, and many re- searchers have proposed techniques and tools to perform web page classification. Unfortunately, the ex- isting tools have a number of drawbacks that makes them unappealing for real-world scenarios, namely:they require a previous extensive crawling, they are supervised, they need to download a page beforeclassifying it, or they are site-, language-, or domain-dependent. In this article, we propose CALA, a toolfor URL-based web page classification. The strongest features of our tool are that it does not require aprevious extensive crawling to achieve good classification results, it is unsupervised, it is based exclu- sively on URL features, which means that pages can be classified without downloading them, and it issite-, language-, and domain-independent, which makes it generally applicable. We have validated ourtool with 22 real-world web sites from multiple domains and languages, and our conclusion is that CALAis very effective and efficient in practice.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Ciencia e Innovación TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-

    Towards Discovering Conceptual Models behind Web Sites

    Get PDF
    Deep Web sites expose data from a database, whose conceptual model remains hidden. Having access to that model is mandatory to perform several tasks, such as integrating different web sites; extracting information from the web unsupervisedly; or creating ontologies. In this paper, we propose a technique to discover the conceptual model behind a web site in the Deep Web, using a statistical approach to discover relationships between entities. Our proposal is unsupervised, not requiring the user to have expert knowledge; and it does not focus on a single view on the database, instead it integrates all views containing entities and relationships that are exposed in the web site.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Ciencia e Innovación TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-09988-

    MostoDEx: A tool to exchange RDF data using exchange samples

    Get PDF
    The Web is evolving into a Web of Data in which RDF data are becoming pervasive, and it is organised into datasets that share a common purpose but have been developed in isolation. This motivates the need to devise complex integration tasks, which are usually performed using schema mappings; generating them automatically is appealing to relieve users from the burden of handcrafting them. Many tools are based on the data models to be integrated: classes, properties, and constraints. Unfortunately, many data models in the Web of Data comprise very few or no constraints at all, so relying on constraints to generate schema mappings is not appealing. Other tools rely on handcrafting the schema mappings, which is not appealing at all. A few other tools rely on exchange samples but require user intervention, or are hybrid and require constraints to be available. In this article, we present MostoDEx, a tool to generate schema mappings between two RDF datasets. It uses a single exchange sample and a set of correspondences, but does not require any constraints to be available or any user intervention. We validated and evaluated MostoDEx using many experiments that prove its effectiveness and efficiency in practice.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08- TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Ciencia e Innovación TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-EMinisterio de Economía y Competitividad TIN2013-40848-

    A Tool for Link-Based Web Page Classification

    Get PDF
    Virtual integration systems require a crawler to navigate through web sites automatically, looking for relevant information. This process is online, so whilst the system is looking for the required information, the user is waiting for a response. Therefore, downloading a minimum number of irrelevant pages is mandatory to improve the crawler efficiency. Most crawlers need to download a page to determine its relevance, which results in a high number of irrelevant pages downloaded. In this paper, we propose a classifier that helps crawlers to efficiently navigate through web sites. This classifier is able to determine if a web page is relevant by analysing exclusively its URL, minimising the number of irrelevant pages downloaded, improving crawling efficiency and reducing used bandwidth, making it suitable for virtual integration systems.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08- TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-

    Benchmarking Data Exchange among Semantic-Web Ontologies

    Get PDF
    The increasing popularity of the Web of Data is motivating the need to integrate semantic-web ontologies. Data exchange is one integration approach that aims to populate a target ontology using data that come from one or more source ontologies. Currently, there exist a variety of systems that are suitable to perform data exchange among these ontologies; unfortunately, they have uneven performance, which makes it appealing assessing and ranking them from an empirical point of view. In the bibliography, there exist a number of benchmarks, but they cannot be applied to this context because they are not suitable for testing semantic-web ontologies or they do not focus on data exchange problems. In this paper, we present MostoBM, a benchmark for testing data exchange systems in the context of such ontologies. It provides a catalogue of three real-world and seven synthetic data exchange patterns, which can be instantiated into a variety of scenarios using some parameters. These scenarios help to analyze how the performance of data exchange systems evolves as the exchanging ontologies are scaled in structured and/or data. Finally, we provide an evaluation methodology to compare data exchange systems side by side and to make informed and statistically sound decisions regarding: 1) which data exchange system performs better; and 2) how the performance of a system is influenced by the parameters of our benchmark.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08- TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Ciencia e Innovación TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-EMinisterio de Economía y Competitividad TIN2011-15497-

    An Architecture for Efficient Web Crawling

    Get PDF
    Virtual Integration systems require a crawling tool able to navigate and reach relevant pages in the Deep Web in an efficient way. Existing proposals in the crawling area fulfill some of these requirements, but most of them need to download pages in order to classify them as relevant or not. We propose a crawler supported by a web page classifier that uses solely a page URL to determine page relevance. Such a crawler is able to choose in each step only the URLs that lead to relevant pages, and therefore reduces the number of unnecessary pages downloaded, minimising bandwidth and making it efficient and suitable for virtual integration systems.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08- TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Economía, Industria y Competitividad TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-

    Towards Discovering Ontological Models from Big RDF Data

    Get PDF
    The Web of Data, which comprises web sources that provide their data in RDF, is gaining popularity day after day. Ontological models over RDF data are shared and developed with the consensus of one or more communities. In this context, there usually exist more than one ontological model to understand RDF data, therefore, there might be a gap between the models and the data, which is not negligible in practice. In this paper, we present a technique to automatically discover ontological models from raw RDF data. It relies on a set of SPARQL 1.1 structural queries that are generic and independent from the RDF data. The output of our technique is a model that is derived from these data and includes the types and properties, subtypes, domains and ranges of properties, and minimum cardinalities of these properties. Our technique is suitable to deal with Big RDF Data since our experiments focus on millions of RDF triples, i.e., RDF data from DBpedia 3.2 and BBC. As far as we know, this is the first technique to discover such ontological models in the context of RDF data and the Web of Data.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2010-21744Ministerio de Ciencia e Innovación TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-

    Mapping RDF knowledge bases using exchange samples

    Get PDF
    Nowadays, the Web of Data is in its earliest stages; it is currently organised into a variety of linked knowledge bases that have been developed independently by different organisations. RDF is one of the most popular languages to represent data in this context, which motivates the need to perform complex integration tasks amongst RDF knowledge bases. These tasks are performed using schema mappings, which are declarative specifications of the relationships amongst a source and a target knowledge base. Generating schema mappings automatically is appealing because this relieves users from the burden of handcrafting them. In the literature, the vast majority of proposals are based on the data models of the knowledge bases to be integrated, that is, on classes, properties, and constraints. In the Web of Data, there exist many data models that comprise very few constraints or no constraints at all, which has motivated some researchers to work on an alternate paradigm that does not rely on constraints. Unfortunately, the current proposals that fit this paradigm are not completely automatic. In this article, we present our proposal to automatically generate schema mappings amongst RDF knowledge bases. Its salient features are that it uses a single input exchange sample and a set of input correspondences, but does not require any constraints to be available or any user intervention; it has been validated and evaluated using many experiments that prove that it is effective and efficient in practice; the schema mappings that it produces are GLAV. Other researchers can reproduce our experiments since all of our implementations and repositories are publicly available

    On Benchmarking Data Translation Systems for Semantic-Web Ontologies

    Get PDF
    Data translation, also known as data exchange, is an inte- gration task that aims at populating a target model using data from a source model. This task is gaining importance in the context of semantic-web ontologies due to the increasing interest in graph databases and semantic-web agents. Cur- rently, there are a variety of semantic-web technologies that can be used to implement data translation systems. This makes it di±cult to assess them from an empirical point of view. In this paper, we present a benchmark that provides a catalogue of seven data translation patterns that can be instantiated by means of seven parameters. This allows us to create a variety of synthetic, domain-independent scenar- ios one can use to test existing data translation systems. We also illustrate how to analyse three such systems using our benchmark. The main benefit of our benchmark is that it allows to compare data translation systems side by side within a homogeneous framework.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Ciencia e Innovación TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-

    Mosto: Generating SPARQL Executable Mappings between Ontologies

    Get PDF
    Data translation is an integration task that aims at populating a target model with data of a source model, which is usually performed by means of mappings. To reduce costs, there are some techniques to automatically generate executable mappings in a given query language, which are executed using a query engine to perform the data translation task. Unfortunately, current approaches to automatically generate executable mappings are based on nested relational models, which cannot be straightforwardly applied to semantic-web ontologies due to some differences between both models. In this paper, we present Mosto, a tool to perform the data translation using automatically generated SPARQL executable mappings. In this demo, ER attendees will have an opportunity to test this automatic generation when performing the data translation task between two different versions of the DBpedia ontology.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08- TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-EMinisterio de Ciencia e Innovación TIN2010-21744Ministerio de Ciencia e Innovación TIN2010-09809-EMinisterio de Ciencia e Innovación TIN2010-10811-EMinisterio de Ciencia e Innovación TIN2010-09988-
    corecore