7 research outputs found

    Multidimensional integration of RDF datasets

    Get PDF
    Data providers have been uploading RDF datasets on the web to aid researchers and analysts in finding insights. These datasets, made available by different data providers, contain common characteristics that enable their integration. However, since each provider has their own data dictionary, identifying common concepts is not trivial and we require costly and complex entity resolution and transformation rules to perform such integration. In this paper, we propose a novel method, that given a set of independent RDF datasets, provides a multidimensional interpretation of these datasets and integrates them based on a common multidimensional space (if any) identified. To do so, our method first identifies potential dimensional and factual data on the input datasets and performs entity resolution to merge common dimensional and factual concepts. As a result, we generate a common multidimensional space and identify each input dataset as a cuboid of the resulting lattice. With such output, we are able to exploit open data with OLAP operators in a richer fashion than dealing with them separately.This research has been funded by the European Commission through the Erasmus Mundus Joint Doctorate Information Technologies for Business Intelligence-Doctoral College (IT4BI-DC) program.Peer ReviewedPostprint (author's final draft

    ‎An Artificial Intelligence Framework for Supporting Coarse-Grained Workload Classification in Complex Virtual Environments

    Get PDF
    Cloud-based machine learning tools for enhanced Big Data applications}‎, ‎where the main idea is that of predicting the ``\emph{next}'' \emph{workload} occurring against the target Cloud infrastructure via an innovative \emph{ensemble-based approach} that combines the effectiveness of different well-known \emph{classifiers} in order to enhance the whole accuracy of the final classification‎, ‎which is very relevant at now in the specific context of \emph{Big Data}‎. ‎The so-called \emph{workload categorization problem} plays a critical role in improving the efficiency and reliability of Cloud-based big data applications‎. ‎Implementation-wise‎, ‎our method proposes deploying Cloud entities that participate in the distributed classification approach on top of \emph{virtual machines}‎, ‎which represent classical ``commodity'' settings for Cloud-based big data applications‎. ‎Given a number of known reference workloads‎, ‎and an unknown workload‎, ‎in this paper we deal with the problem of finding the reference workload which is most similar to the unknown one‎. ‎The depicted scenario turns out to be useful in a plethora of modern information system applications‎. ‎We name this problem as \emph{coarse-grained workload classification}‎, ‎because‎, ‎instead of characterizing the unknown workload in terms of finer behaviors‎, ‎such as CPU‎, ‎memory‎, ‎disk‎, ‎or network intensive patterns‎, ‎we classify the whole unknown workload as one of the (possible) reference workloads‎. ‎Reference workloads represent a category of workloads that are relevant in a given applicative environment‎. ‎In particular‎, ‎we focus our attention on the classification problem described above in the special case represented by \emph{virtualized environments}‎. ‎Today‎, ‎\emph{Virtual Machines} (VMs) have become very popular because they offer important advantages to modern computing environments such as cloud computing or server farms‎. ‎In virtualization frameworks‎, ‎workload classification is very useful for accounting‎, ‎security reasons‎, ‎or user profiling‎. ‎Hence‎, ‎our research makes more sense in such environments‎, ‎and it turns out to be very useful in a special context like Cloud Computing‎, ‎which is emerging now‎. ‎In this respect‎, ‎our approach consists of running several machine learning-based classifiers of different workload models‎, ‎and then deriving the best classifier produced by the \emph{Dempster-Shafer Fusion}‎, ‎in order to magnify the accuracy of the final classification‎. ‎Experimental assessment and analysis clearly confirm the benefits derived from our classification framework‎. ‎The running programs which produce unknown workloads to be classified are treated in a similar way‎. ‎A fundamental aspect of this paper concerns the successful use of data fusion in workload classification‎. ‎Different types of metrics are in fact fused together using the Dempster-Shafer theory of evidence combination‎, ‎giving a classification accuracy of slightly less than 80%80\%‎. ‎The acquisition of data from the running process‎, ‎the pre-processing algorithms‎, ‎and the workload classification are described in detail‎. ‎Various classical algorithms have been used for classification to classify the workloads‎, ‎and the results are compared‎

    Federated Query Processing over Heterogeneous Data Sources in a Semantic Data Lake

    Get PDF
    Data provides the basis for emerging scientific and interdisciplinary data-centric applications with the potential of improving the quality of life for citizens. Big Data plays an important role in promoting both manufacturing and scientific development through industrial digitization and emerging interdisciplinary research. Open data initiatives have encouraged the publication of Big Data by exploiting the decentralized nature of the Web, allowing for the availability of heterogeneous data generated and maintained by autonomous data providers. Consequently, the growing volume of data consumed by different applications raise the need for effective data integration approaches able to process a large volume of data that is represented in different format, schema and model, which may also include sensitive data, e.g., financial transactions, medical procedures, or personal data. Data Lakes are composed of heterogeneous data sources in their original format, that reduce the overhead of materialized data integration. Query processing over Data Lakes require the semantic description of data collected from heterogeneous data sources. A Data Lake with such semantic annotations is referred to as a Semantic Data Lake. Transforming Big Data into actionable knowledge demands novel and scalable techniques for enabling not only Big Data ingestion and curation to the Semantic Data Lake, but also for efficient large-scale semantic data integration, exploration, and discovery. Federated query processing techniques utilize source descriptions to find relevant data sources and find efficient execution plan that minimize the total execution time and maximize the completeness of answers. Existing federated query processing engines employ a coarse-grained description model where the semantics encoded in data sources are ignored. Such descriptions may lead to the erroneous selection of data sources for a query and unnecessary retrieval of data, affecting thus the performance of query processing engine. In this thesis, we address the problem of federated query processing against heterogeneous data sources in a Semantic Data Lake. First, we tackle the challenge of knowledge representation and propose a novel source description model, RDF Molecule Templates, that describe knowledge available in a Semantic Data Lake. RDF Molecule Templates (RDF-MTs) describes data sources in terms of an abstract description of entities belonging to the same semantic concept. Then, we propose a technique for data source selection and query decomposition, the MULDER approach, and query planning and optimization techniques, Ontario, that exploit the characteristics of heterogeneous data sources described using RDF-MTs and provide a uniform access to heterogeneous data sources. We then address the challenge of enforcing privacy and access control requirements imposed by data providers. We introduce a privacy-aware federated query technique, BOUNCER, able to enforce privacy and access control regulations during query processing over data sources in a Semantic Data Lake. In particular, BOUNCER exploits RDF-MTs based source descriptions in order to express privacy and access control policies as well as their automatic enforcement during source selection, query decomposition, and planning. Furthermore, BOUNCER implements query decomposition and optimization techniques able to identify query plans over data sources that not only contain the relevant entities to answer a query, but also are regulated by policies that allow for accessing these relevant entities. Finally, we tackle the problem of interest based update propagation and co-evolution of data sources. We present a novel approach for interest-based RDF update propagation that consistently maintains a full or partial replication of large datasets and deal with co-evolution

    Breaking data silos with Federated Learning

    Get PDF
    Federated learning has been recognized as a promising technology with the potential to revolutionize the field of Artificial Intelligence (AI). By leveraging its decentralized nature, it has the potential to overcome known barriers to AI, such as data acquisition and privacy, paving the way for unprecedented advances in AI. This dissertation argues the benefits of this technology as a catalyst for the irruption of AI both in the public and private sector. Federated learning promotes cooperation among otherwise competitive entities by enabling cooperative efforts to achieve a common goal. In this dissertation, I investigate the goodness-of-fit of this technology in several contexts, with a focus on its application in power systems, financial institutions, and public administrations. The dissertation comprises five papers that investigate various aspects of federated learning in the aforementioned contexts. In particular, the first two papers explore promising venues in the energy sector, where federated learning offers a compelling solution to privately exploit the vast amounts of data and decentralized ownership of data by consumers. The third paper elaborates on another paradigmatic example, in which federated learning is used to foster cooperation among financial institutions to produce accurate credit risk models. The fourth paper makes a juxtaposition with the previous ones centered on the private sector. It elaborates on the use cases of federated learning for public administrations to reduce barriers to cooperation. Lastly, the fifth and last article acts as a finale of this dissertation, compiles the earlier work and elaborates on the constraints and opportunities associated with adopting this technology, as well as a framework for doing so.R-AGR-3787 - EU 2020 - MDOT (01/07/2020 - 31/12/2023) - FRIDGEN GilbertR-AGR-3728 - PEARL/IS/13342933/DFS (01/01/2020 - 31/12/2024) - FRIDGEN Gilber

    Exploring the impact of Big Data analytics capability on port performance: The mediating role of sustainability

    Get PDF
    Many ports are redefining business processes and operations by adopting digital technologies. These can help them to provide efficient and competitive port operations and meet the growing demand for comprehensive port logistics services. Digital technologies provide a harvest of immense amounts of data, known as Big Data. Port management needs the capability to store, process and analyse Big Data to provide meaningful information and thus to maximise organisational performance. Furthermore, one of the most important trends in port development is increased sustainability awareness by regulators and customers. Port managers can employ Big Data technology to reduce environmental pollution and use resources more efficiently, improving the sustainability of ports. Resource-based theory provides a useful theoretical framework to investigate these issues, including the impact on port performance. There is some evidence that port sustainability has a mediating role in the association between BDAC and port performance. However, more research is needed to investigate the association between BDAC and port performance and to explore the mediation role of port sustainability. To address this research gap, this thesis employs a multi-phase approach to investigate the impact of BDAC on port performance and the role of port sustainability in this context. In phase one of the empirical study, a conceptual model for the structural relationship between BDAC, port sustainability and port performance was developed by examining the existing research literature. After a pilot survey to examine the validity and reliability of the survey instrument, a survey was conducted in the world’s top 50 ports, which provided 175 valid responses for assessing the model. The results from these questionnaires were analysed by Partial Least Squares Structural Equation Modelling (PLS-SEM). Analysis of the collected data revealed four main findings. Firstly, this study provides evidence that managerial skills and data-driven culture play a significant role in developing the BDAC of ports. The second major finding provides empirical evidence that BDAC positively enhances port performance. Thirdly, the finding shows that ports can improve sustainability by developing BDAC. Finally, the findings highlighted that port sustainability mediated the relationship between BDAC and port performance. Ports that aim to improve performance should leverage BDAC to implement sustainable port strategies. The study makes several theoretical and practical contributions. The main contribution of this study is developing a hierarchical model based on resource-based theory to evaluate the impact of BDAC on port performance, providing a better understating of how the port builds BDAC and their significant role in port performance. Moreover, this study reveals the mechanism driving the impact of BDAC on port performance, providing a deeper understanding of the significance of sustainability. Furthermore, this study provides practical guidance for port managers to assist them in making clear strategies to build and utilise BDAC. Port managers should leverage port sustainability to catalyse the impact of BDAC on port performance

    Untersuchungen zur Anomalieerkennung in automotive Steuergeräten durch verteilte Observer mit Fokus auf die Plausibilisierung von Kommunikationssignalen

    Get PDF
    Die zwei herausragenden automobilen Trends Konnektivität und hochautomatisiertes Fahren bieten viele Chancen, aber vor allem in ihrer Kombination auch Gefahren. Einerseits wird das Fahrzeug immer mehr mit seiner Außenwelt vernetzt, wodurch die Angriffsfläche für unautorisierten Zugriff deutlich steigt. Andererseits erhalten Steuergeräte die Kontrolle über sicherheitsrelevante Funktionen. Um das Risiko und die potentiellen Folgen eines erfolgreichen Angriffs möglichst gering zu halten, sollte eine Absicherung auf mehreren Ebenen erfolgen. Der Fokus dieser Arbeit liegt auf der innersten Absicherungsebene und dabei speziell auf der Überwachung von Fahrezug-interner Kommunikation. Hierfür empfehlen Wissenschaft und Industrie unter anderem den Einsatz von Intrusion Detection/Intrusion Prevention Systemen. Das erarbeitete Konzept greift diesen Vorschlag auf und berücksichtigt bei der Detaillierung die Steuergeräte-spezifischen Randbedingungen, wie beispielsweise die vergleichsweise statische Fahrzeugvernetzung und die limitierten Ressourcen. Dadurch entsteht ein hybrider Ansatz, bestehend aus klassischen Überwachungsregeln und selbstlernenden Algorithmen. Dieser ist nicht nur für die Fahrzeug-interne Kommunikation geeignet, sondern gleichermaßen für den Steuergeräte-internen Informationsaustausch, die Interaktion zwischen Applikations- und Basissoftware sowie die Überwachung von Laufzeit- und Speichereigenschaften. Das übergeordnete Ziel ist eine ganzheitliche Steuergeräte-Überwachung und damit eine verbesserte Absicherung im Sinne der Security. Abweichungen vom Sollverhalten - sogenannte Anomalien - werden jedoch unabhängig von deren Ursache erkannt, sei es ein mutwilliger Angriff oder eine Fehlfunktion. Daher kann dieser Ansatz auch zur Verbesserung der Safety beitragen, speziell wenn Applikationen und Algorithmen abzusichern sind, die sich während des Lebenszyklus eines Fahrzeugs verändern oder weiterentwickeln. Im zweiten Teil der Arbeit steht die Plausibilisierung von einzelnen Kommunikationssignalen im Vordergrund. Da deren möglicher Verlauf nicht formal beschrieben ist, kommen hierfür selbstlernende Verfahren zum Einsatz. Neben der Analyse und der Auswahl von grundsätzlich geeigneten Algorithmen ist die Leistungsbewertung eine zentrale Herausforderung. Die zu erkennenden Anomalien sind vielfältig und in der Regel sind nur Referenzdaten des Normalverhaltens in ausreichender Menge vorhanden. Aus diesem Grund werden unterschiedliche Anomalie-Typen definiert, welche die Anomaliesynthese in Normaldaten strukturieren und somit eine Evaluierung anhand der Erkennungsrate erlauben. Die Evaluierungsergebnisse zeigen, dass eine Signalplausibilisierung mittels künstlichen neuronalen Netzen (Autoencoder) vielversprechend ist. Zum Abschluss betrachtet die vorliegende Arbeit daher die Herausforderungen bei deren Realisierung auf automotive Steuergeräten und liefert entsprechende Kennzahlen für die benötigte Laufzeit und den Speicherverbrauch
    corecore