5 research outputs found

    The Internet of Things as a Privacy-Aware Database Machine

    Get PDF
    Instead of using a computer cluster with homogeneous nodes and very fast high bandwidth connections, we want to present the vision to use the Internet of Things (IoT) as a database machine. This is among others a key factor for smart (assistive) systems in apartments (AAL, ambient assisted living), offices (AAW, ambient assisted working), Smart Cities as well as factories (IIoT, Industry 4.0). It is important to massively distribute the calculation of analysis results on sensor nodes and other low-resource appliances in the environment, not only for reasons of performance, but also for reasons of privacy and protection of corporate knowledge. Thus, functions crucial for assistive systems, such as situation, activity, and intention recognition, are to be automatically transformed not only in database queries, but also in local nodes of lower performance. From a database-specific perspective, analysis operations on large quantities of distributed sensor data, currently based on classical big-data techniques and executed on large, homogeneously equipped parallel computers have to be automatically transformed to billions of processors with energy and capacity restrictions. In this visionary paper, we will focus on the database-specific perspective and the fundamental research questions in the underlying database theory

    The Internet of Things as a Privacy-Aware Database Machine

    Get PDF
    Instead of using a computer cluster with homogeneous nodes and very fast high bandwidth connections, we want to present the vision to use the Internet of Things (IoT) as a database machine. This is among others a key factor for smart (assistive) systems in apartments (AAL, ambient assisted living), offices (AAW, ambient assisted working), Smart Cities as well as factories (IIoT, Industry 4.0). It is important to massively distribute the calculation of analysis results on sensor nodes and other low-resource appliances in the environment, not only for reasons of performance, but also for reasons of privacy and protection of corporate knowledge. Thus, functions crucial for assistive systems, such as situation, activity, and intention recognition, are to be automatically transformed not only in database queries, but also in local nodes of lower performance. From a database-specific perspective, analysis operations on large quantities of distributed sensor data, currently based on classical big-data techniques and executed on large, homogeneously equipped parallel computers have to be automatically transformed to billions of processors with energy and capacity restrictions. In this visionary paper, we will focus on the database-specific perspective and the fundamental research questions in the underlying database theory

    Data-independent space partitionings for summaries

    Get PDF
    Histograms are a standard tool in data management for describing multidimensional data. It is often convenient or even necessary to define data independent histograms, to partition space in advance without observing the data itself. Specific motivations arise in managing data when it is not suitable to frequently change the boundaries between histogram cells. For example, when the data is subject to many insertions and deletions; when data is distributed across multiple systems; or when producing a privacy-preserving representation of the data. The baseline approach is to consider an equiwidth histogram, i.e., a regular grid over the space. However, this is not optimal for the objective of splitting the multidimensional space into (possibly overlapping) bins, such that each box can be rebuilt using a set of non-overlapping bins with minimal excess (or deficit) of volume. Thus, we investigate how to split the space into bins and identify novel solutions that offer a good balance of desirable properties. As many data processing tools require a dataset as an input, we propose efficient methods how to obtain synthetic point sets that match the histograms over the overlapping bins

    Federated Query Processing over Heterogeneous Data Sources in a Semantic Data Lake

    Get PDF
    Data provides the basis for emerging scientific and interdisciplinary data-centric applications with the potential of improving the quality of life for citizens. Big Data plays an important role in promoting both manufacturing and scientific development through industrial digitization and emerging interdisciplinary research. Open data initiatives have encouraged the publication of Big Data by exploiting the decentralized nature of the Web, allowing for the availability of heterogeneous data generated and maintained by autonomous data providers. Consequently, the growing volume of data consumed by different applications raise the need for effective data integration approaches able to process a large volume of data that is represented in different format, schema and model, which may also include sensitive data, e.g., financial transactions, medical procedures, or personal data. Data Lakes are composed of heterogeneous data sources in their original format, that reduce the overhead of materialized data integration. Query processing over Data Lakes require the semantic description of data collected from heterogeneous data sources. A Data Lake with such semantic annotations is referred to as a Semantic Data Lake. Transforming Big Data into actionable knowledge demands novel and scalable techniques for enabling not only Big Data ingestion and curation to the Semantic Data Lake, but also for efficient large-scale semantic data integration, exploration, and discovery. Federated query processing techniques utilize source descriptions to find relevant data sources and find efficient execution plan that minimize the total execution time and maximize the completeness of answers. Existing federated query processing engines employ a coarse-grained description model where the semantics encoded in data sources are ignored. Such descriptions may lead to the erroneous selection of data sources for a query and unnecessary retrieval of data, affecting thus the performance of query processing engine. In this thesis, we address the problem of federated query processing against heterogeneous data sources in a Semantic Data Lake. First, we tackle the challenge of knowledge representation and propose a novel source description model, RDF Molecule Templates, that describe knowledge available in a Semantic Data Lake. RDF Molecule Templates (RDF-MTs) describes data sources in terms of an abstract description of entities belonging to the same semantic concept. Then, we propose a technique for data source selection and query decomposition, the MULDER approach, and query planning and optimization techniques, Ontario, that exploit the characteristics of heterogeneous data sources described using RDF-MTs and provide a uniform access to heterogeneous data sources. We then address the challenge of enforcing privacy and access control requirements imposed by data providers. We introduce a privacy-aware federated query technique, BOUNCER, able to enforce privacy and access control regulations during query processing over data sources in a Semantic Data Lake. In particular, BOUNCER exploits RDF-MTs based source descriptions in order to express privacy and access control policies as well as their automatic enforcement during source selection, query decomposition, and planning. Furthermore, BOUNCER implements query decomposition and optimization techniques able to identify query plans over data sources that not only contain the relevant entities to answer a query, but also are regulated by policies that allow for accessing these relevant entities. Finally, we tackle the problem of interest based update propagation and co-evolution of data sources. We present a novel approach for interest-based RDF update propagation that consistently maintains a full or partial replication of large datasets and deal with co-evolution
    corecore