7,624 research outputs found

    Web Data Extraction, Applications and Techniques: A Survey

    Full text link
    Web Data Extraction is an important problem that has been studied by means of different scientific tools and in a broad range of applications. Many approaches to extracting data from the Web have been designed to solve specific problems and operate in ad-hoc domains. Other approaches, instead, heavily reuse techniques and algorithms developed in the field of Information Extraction. This survey aims at providing a structured and comprehensive overview of the literature in the field of Web Data Extraction. We provided a simple classification framework in which existing Web Data Extraction applications are grouped into two main classes, namely applications at the Enterprise level and at the Social Web level. At the Enterprise level, Web Data Extraction techniques emerge as a key tool to perform data analysis in Business and Competitive Intelligence systems as well as for business process re-engineering. At the Social Web level, Web Data Extraction techniques allow to gather a large amount of structured data continuously generated and disseminated by Web 2.0, Social Media and Online Social Network users and this offers unprecedented opportunities to analyze human behavior at a very large scale. We discuss also the potential of cross-fertilization, i.e., on the possibility of re-using Web Data Extraction techniques originally designed to work in a given domain, in other domains.Comment: Knowledge-based System

    Mapping Large Scale Research Metadata to Linked Data: A Performance Comparison of HBase, CSV and XML

    Full text link
    OpenAIRE, the Open Access Infrastructure for Research in Europe, comprises a database of all EC FP7 and H2020 funded research projects, including metadata of their results (publications and datasets). These data are stored in an HBase NoSQL database, post-processed, and exposed as HTML for human consumption, and as XML through a web service interface. As an intermediate format to facilitate statistical computations, CSV is generated internally. To interlink the OpenAIRE data with related data on the Web, we aim at exporting them as Linked Open Data (LOD). The LOD export is required to integrate into the overall data processing workflow, where derived data are regenerated from the base data every day. We thus faced the challenge of identifying the best-performing conversion approach.We evaluated the performances of creating LOD by a MapReduce job on top of HBase, by mapping the intermediate CSV files, and by mapping the XML output.Comment: Accepted in 0th Metadata and Semantics Research Conferenc

    Handling Confidential Data on the Untrusted Cloud: An Agent-based Approach

    Get PDF
    Cloud computing allows shared computer and storage facilities to be used by a multitude of clients. While cloud management is centralized, the information resides in the cloud and information sharing can be implemented via off-the-shelf techniques for multiuser databases. Users, however, are very diffident for not having full control over their sensitive data. Untrusted database-as-a-server techniques are neither readily extendable to the cloud environment nor easily understandable by non-technical users. To solve this problem, we present an approach where agents share reserved data in a secure manner by the use of simple grant-and-revoke permissions on shared data.Comment: 7 pages, 9 figures, Cloud Computing 201

    The aDORe federation architecture: digital repositories at scale

    Get PDF

    CLOUD-BASED SOLUTIONS IMPROVING TRANSPARENCY, OPENNESS AND EFFICIENCY OF OPEN GOVERNMENT DATA

    Get PDF
    A central pillar of open government programs is the disclosure of data held by public agencies using Information and Communication Technologies (ICT). This disclosure relies on the creation of open data portals (e.g. Data.gov) and has subsequently been associated with the expression Open Government Data (OGD). The overall goal of these governmental initiatives is not limited to enhance transparency of public sectors but aims to raise awareness of how released data can be put to use in order to enable the creation of new products and services by private sectors. Despite the usage of technological platforms to facilitate access to government data, open data portals continue to be organized in order to serve the goals of public agencies without opening the doors to public accountability, information transparency, public scrutiny, etc. This thesis considers the basic aspects of OGD including the definition of technical models for organizing such complex contexts, the identification of techniques for combining data from several portals and the proposal of user interfaces that focus on citizen-centred usability. In order to deal with the above issues, this thesis presents a holistic approach to OGD that aims to go beyond problems inherent their simple disclosure by providing a tentative answer to the following questions: 1) To what extent do the OGD-based applications contribute towards the creation of innovative, value-added services? 2) What technical solutions could increase the strength of this contribution? 3) Can Web 2.0 and Cloud technologies favour the development of OGD apps? 4) How should be designed a common framework for developing OGD apps that rely on multiple OGD portals and external web resources? In particular, this thesis is focused on devising computational environments that leverage the content of OGD portals (supporting the initial phase of data disclosure) for the creation of new services that add value to the original data. The thesis is organized as follows. In order to offer a general view about OGD, some important aspects about open data initiatives are presented including their state of art, the existing approaches for publishing and consuming OGD across web resources, and the factors shaping the value generated through government data portals. Then, an architectural framework is proposed that gathers OGD from multiple sites and supports the development of cloud-based apps that leverage these data according to potentially different exploitation roots ranging from traditional business to specialized supports for citizens. The proposed framework is validated by two cloud-based apps, namely ODMap (Open Data Mapping) and NESSIE (A Network-based Environment Supporting Spatial Information Exploration). In particular, ODMap supports citizens in searching and accessing OGD from several web sites. NESSIE organizes data captured from real estate agencies and public agencies (i.e. municipalities, cadastral offices and chambers of commerce) in order to provide citizens with a geographic representation of real estate offers and relevant statistics about the price trend.A central pillar of open government programs is the disclosure of data held by public agencies using Information and Communication Technologies (ICT). This disclosure relies on the creation of open data portals (e.g. Data.gov) and has subsequently been associated with the expression Open Government Data (OGD). The overall goal of these governmental initiatives is not limited to enhance transparency of public sectors but aims to raise awareness of how released data can be put to use in order to enable the creation of new products and services by private sectors. Despite the usage of technological platforms to facilitate access to government data, open data portals continue to be organized in order to serve the goals of public agencies without opening the doors to public accountability, information transparency, public scrutiny, etc. This thesis considers the basic aspects of OGD including the definition of technical models for organizing such complex contexts, the identification of techniques for combining data from several portals and the proposal of user interfaces that focus on citizen-centred usability. In order to deal with the above issues, this thesis presents a holistic approach to OGD that aims to go beyond problems inherent their simple disclosure by providing a tentative answer to the following questions: 1) To what extent do the OGD-based applications contribute towards the creation of innovative, value-added services? 2) What technical solutions could increase the strength of this contribution? 3) Can Web 2.0 and Cloud technologies favour the development of OGD apps? 4) How should be designed a common framework for developing OGD apps that rely on multiple OGD portals and external web resources? In particular, this thesis is focused on devising computational environments that leverage the content of OGD portals (supporting the initial phase of data disclosure) for the creation of new services that add value to the original data. The thesis is organized as follows. In order to offer a general view about OGD, some important aspects about open data initiatives are presented including their state of art, the existing approaches for publishing and consuming OGD across web resources, and the factors shaping the value generated through government data portals. Then, an architectural framework is proposed that gathers OGD from multiple sites and supports the development of cloud-based apps that leverage these data according to potentially different exploitation roots ranging from traditional business to specialized supports for citizens. The proposed framework is validated by two cloud-based apps, namely ODMap (Open Data Mapping) and NESSIE (A Network-based Environment Supporting Spatial Information Exploration). In particular, ODMap supports citizens in searching and accessing OGD from several web sites. NESSIE organizes data captured from real estate agencies and public agencies (i.e. municipalities, cadastral offices and chambers of commerce) in order to provide citizens with a geographic representation of real estate offers and relevant statistics about the price trend
    • …
    corecore