    Distributed Holistic Clustering on Linked Data

    Link discovery is an active field of research to support data integration in the Web of Data. Due to the huge size and number of available data sources, efficient and effective link discovery is a very challenging task. Common pairwise link discovery approaches do not scale to many sources with very large entity sets. We here propose a distributed holistic approach to link many data sources based on a clustering of entities that represent the same real-world object. Our clustering approach provides a compact and fused representation of entities, and can identify errors in existing links as well as many new links. We support a distributed execution of the clustering approach to achieve faster execution times and scalability for large real-world data sets. We provide a novel gold standard for multi-source clustering, and evaluate our methods with respect to effectiveness and efficiency for large data sets from the geographic and music domains

    Survey: Models and Prototypes of Schema Matching

    Schema matching is critical problem within many applications to integration of data/information, to achieve interoperability, and other cases caused by schematic heterogeneity. Schema matching evolved from manual way on a specific domain, leading to a new models and methods that are semi-automatic and more general, so it is able to effectively direct the user within generate a mapping among elements of two the schema or ontologies better. This paper is a summary of literature review on models and prototypes on schema matching within the last 25 years to describe the progress of and research chalenge and opportunities on a new models, methods, and/or prototypes

    BAC: A bagged associative classifier for big data frameworks

    Big Data frameworks allow powerful distributed computations extending the results achievable on a single machine. In this work, we present a novel distributed associative classifier, named BAC, based on ensemble techniques. Ensembles are a popular approach that builds several models on different subsets of the original dataset, eventually voting to provide a unique classification outcome. Experiments on Apache Spark and preliminary results showed the capability of the proposed ensemble classifier to obtain a quality comparable with the single-machine version on popular real-world datasets, and overcome their scalability limits on large synthetic datasets

    Data locality in Hadoop

    Current market tendencies show the need of storing and processing rapidly growing amounts of data. Therefore, it implies the demand for distributed storage and data processing systems. The Apache Hadoop is an open-source framework for managing such computing clusters in an effective, fault-tolerant way. Dealing with large volumes of data, Hadoop, and its storage system HDFS (Hadoop Distributed File System), face challenges to keep the high efficiency with computing in a reasonable time. The typical Hadoop implementation transfers computation to the data, rather than shipping data across the cluster. Otherwise, moving the big quantities of data through the network could significantly delay data processing tasks. However, while a task is already running, Hadoop favours local data access and chooses blocks from the nearest nodes. Next, the necessary blocks are moved just when they are needed in the given ask. For supporting the Hadoop’s data locality preferences, in this thesis, we propose adding an innovative functionality to its distributed file system (HDFS), that enables moving data blocks on request. In-advance shipping of data makes it possible to forcedly redistribute data between nodes in order to easily adapt it to the given processing tasks. New functionality enables the instructed movement of data blocks within the cluster. Data can be shifted either by user running the proper HDFS shell command or programmatically by other module like an appropriate scheduler. In order to develop such functionality, the detailed analysis of Apache Hadoop source code and its components (specifically HDFS) was conducted. Research resulted in a deep understanding of internal architecture, what made it possible to compare the possible approaches to achieve the desired solution, and develop the chosen one

    Multiple logic-based reconstruction of conceptual data modelling languages such as EER, UML Class Diagrams, and ORM exists. They mainly cover various fragments of the languages and none are formalised such that the logic applies simultaneously for all three modelling language families as unifying mechanism. This hampers interchangeability, interoperability, and tooling support. In addition, due to the lack of a systematic design process of the logic used for the formalisation, hidden choices permeate the formalisations that have rendered them incompatible. We aim to address these problems, first, by structuring the logic design process in a methodological way. We generalise and extend the DSL design process to apply to logic language design more generally and, in particular, by incorporating an ontological analysis of language features in the process. Second, availing of this extended process, of evidence gathered of language feature usage, and of computational complexity insights from Description Logics (DL), we specify logic profiles taking into account the ontological commitments embedded in the languages. The profiles characterise the minimum logic structure needed to handle the semantics of conceptual models, enabling the development of interoperability tools. There is no known DL language that matches exactly the features of those profiles and the common core is small (in the tractable ALNI). Although hardly any inconsistencies can be derived with the profiles, it is promising for scalable runtime use of conceptual data models

    Multiple logic-based reconstructions of conceptual data modelling languages such as EER, UML Class Diagrams, and ORM exist. They mainly cover various fragments of the languages and none are formalised such that the logic applies simultaneously for all three modelling language families as unifying mechanism. This hampers interchangeability, interoperability, and tooling support. In addition, due to the lack of a systematic design process of the logic used for the formalisation, hidden choices permeate the formalisations that have rendered them incompatible. We aim to address these problems, first, by structuring the logic design process in a methodological way. We generalise and extend the DSL design process to apply to logic language design more generally and, in particular, by incorporating an ontological analysis of language features in the process. Second, we specify minimal logic profiles availing of this extended process, including the ontological commitments embedded in the languages, of evidence gathered of language feature usage, and of computational complexity insights from Description Logics (DL). The profiles characterise the essential logic structure needed to handle the semantics of conceptual models, therewith enabling the development of interoperability tools. There is no known DL language that matches exactly the features of those profiles and the common core is small (in the tractable DL ALNI). Although hardly any inconsistencies can be derived with the profiles, it is promising for scalable runtime use of conceptual data models

    Lenguajes austeros de modelado conceptual de datos basados en evidencias

    Multiple logic-based reconstructions of UML class diagram, Entity Relationship diagrams, and Obect-Role Model diagrams exists. They mainly cover various fragments of these Conceptual Data Modelling Languages and none are formalised such that the logic applies simultaneously for the three language families as a unifying mechanism. This hampers interchangeability, interoperability, and tooling support. In addition, due to the lack of a systematic design process of the logic used for the formalisation, hidden choices permeate the formalisations that have rendered them incompatible. We aim to address these problems, first, by structuring the logic design process in a methodological way. We generalise and extend the DSL design process to logic language design. In particular, a new phase of ontological analysis of language features is included, to apply to logic language design more generally and, in particular, by incorporating an ontological analysis of language features in the process. Second, we specify minimal logic profiles availing of this extended process, including the ontological commitments embedded in the languages, of evidence gathered of language feature usage, and of computational complexity insights from Description Logics (DL). The profiles characterise the essential logic structure needed to handle the semantics of conceptual models, therewith enabling the development of interoperability tools. No known DL language matches exactly the features of those profiles and the common core is in the tractable DL ACJfl. Although hardly any inconsistencies can be derived with the profiles, it is promising for scalable runtime use of conceptual data models.Existen varias reconstrucciones basadas en lógica de lenguajes de modelado conceptual como EER, diagramas de clases UML y ORM. Principalmente cubren fragmentos de estos lenguajes, y sus formalizaciones no están hechas para que se apliquen simultáneamente a estas tres familias de lenguajes como un mecanismo de unificación. Este hecho atenta contra el intercambio y la interoperabilidad de los modelos y el desarrollo de herramientas de soporte. Además, dada la falta de un proceso sistemático de diseño, ciertas decisiones ocultas en la representación lógica hacen que las formalizaciones sean incompatibles. En este trabajo nos proponemos atacar este problema, proponiendo primero un proceso de diseño lógico que puede ser aplicado en forma metodológica. Se generaliza y extiende el proceso DSL para que se pueda aplicar al diseño de lenguajes lógicos en general, incorporando análisis ontológico de las características del lenguaje. Segundo, se especifican perfiles lógicos minimales que sacan provecho de este proceso extendido, incluyendo los compromisos ontológicos asumidos, de evidencia de uso de las características del lenguaje, y de los propiedades computacionales de las Lógicas Descriptivas (DL, description logics). Estos perfiles caracterizan la estructura lógica esencial que se necesita para manejar la semántica de los modelos conceptuales, habilitando el desarrollo de herramientas automáticas de interoperabilidad. No existe correspondencia exacta directa entre estos perfiles y fragmentos conocidos de lenguajes DL, y el núcleo común es pequeño (la lógica tratable ACNT). Aunque es muy poca la posibilidad de derivar inconsistencias dentro de estos perfiles, es prometedor su uso en modelos conceptuales dado su complejidad en tiempo escalable.Facultad de Informátic

    Anforderungsbasierte Modellierung und Ausführung von Datenflussmodellen

    Heutzutage steigen die Menge an Daten sowie deren Heterogenität, Änderungshäufigkeit und Komplexität stark an. Dies wird häufig als das "Big-Data-Problem" bezeichnet. Durch das Aufkommen neuer Paradigmen, wie dem Internet der Dinge oder Industrie 4.0, nimmt dieser Trend zukünftig noch weiter zu. Die Verarbeitung, Analyse und Visualisierung von Daten kann einen hohen Mehrwert darstellen, beispielsweise durch die Erkennung bisher unbekannter Muster oder durch das Vorhersagen von Ereignissen. Jedoch stellen die Charakteristiken von Big-Data, insbesondere die große Datenmenge und deren schnelle Änderung, eine große Herausforderung für die Verarbeitung der Daten dar. Herkömmliche, bisher angewandte Techniken, wie zum Beispiel Analysen basierend auf relationalen Datenbanken, kommen hierbei oft an ihre Grenzen. Des Weiteren ändert sich auch die Art der Anwender der Datenverarbeitung, insbesondere in Unternehmen. Anstatt die Datenverarbeitung ausschließlich von Programmierexperten durchzuführen, wächst die Anwendergruppe auch um Domänennutzer, die starkes Interesse an Datenanalyseergebnissen haben, jedoch diese nicht technisch umsetzen können. Um die Unterstützung von Domänennutzern zu ermöglichen, entstand ca. im Jahr 2007, im Rahmen der Web-2.0-Bewegung, das Konzept der Mashups, die es auf einfachem Wege erlauben sollen, Anwender aus unterschiedlichen Domänen beim Zusammenführen von Programmen, grafischen Oberflächen, und auch Daten zu unterstützen. Hierbei lag der Fokus vor allem auf Webdatenquellen wie RSS-Feeds, HTML-Seiten, oder weiteren XML-basierten Formaten. Auch wenn die entstandenen Konzepte gute Ansätze liefern, um geringe Datenmengen schnell und explorativ durch Domänennutzer zu verarbeiten, können sie mit den oben genannten Herausforderungen von Big-Data nicht umgehen. Die Grundidee der Mashups dient als Inspiration dieser Dissertation und wird dahingehend erweitert, moderne, komplexe und datenintensive Datenverarbeitungs- und Analyseszenarien zu realisieren. Hierfür wird im Rahmen dieser Dissertation ein umfassendes Konzept entwickelt, das sowohl eine einfache Modellierung von Datenanalysen durch Domänenexperten ermöglicht - und somit den Nutzer in den Mittelpunkt stellt - als auch eine individualisierte, effiziente Ausführung von Datenanalysen und -verarbeitung ermöglicht. Unter einer Individualisierung wird dabei verstanden, dass die funktionalen und nichtfunktionalen Anforderungen, die je nach Anwendungsfall variieren können, bei der Ausführung berücksichtigt werden. Dies erfordert einen dynamischen Aufbau der Ausführungsumgebung. Hierbei wird dem beschriebenen Problem durch mehrere Ebenen begegnet: 1) Die Modellierungsebene, die als Schnittstelle zu den Domänennutzern dient und die es erlaubt Datenverarbeitungsszenarien abstrakt zu modellieren. 2) Die Modelltransformationsebene, auf der das abstrakte Modell auf verschiedene ausführbare Repräsentationen abgebildet werden kann. 3) Die Datenverarbeitungsebene, mit der die Daten effizient in einer verteilten Umgebung verarbeitet werden, und 4) die Datenhaltungsebene, in der Daten heterogener Quellen extrahiert sowie Datenverarbeitungs- oder Analyseergebnisse persistiert werden. Die Konzepte der Dissertation werden durch zugehörige Publikationen in Konferenzbeiträgen und Fachmagazinen gestützt und durch eine prototypische Implementierung validiert

    Scalable Data Integration for Linked Data

    Linked Data describes an extensive set of structured but heterogeneous datasources where entities are connected by formal semantic descriptions. In thevision of the Semantic Web, these semantic links are extended towards theWorld Wide Web to provide as much machine-readable data as possible forsearch queries. The resulting connections allow an automatic evaluation to findnew insights into the data. Identifying these semantic connections betweentwo data sources with automatic approaches is called link discovery. We derivecommon requirements and a generic link discovery workflow based on similaritiesbetween entity properties and associated properties of ontology concepts. Mostof the existing link discovery approaches disregard the fact that in times ofBig Data, an increasing volume of data sources poses new demands on linkdiscovery. In particular, the problem of complex and time-consuming linkdetermination escalates with an increasing number of intersecting data sources.To overcome the restriction of pairwise linking of entities, holistic clusteringapproaches are needed to link equivalent entities of multiple data sources toconstruct integrated knowledge bases. In this context, the focus on efficiencyand scalability is essential. For example, reusing existing links or backgroundinformation can help to avoid redundant calculations. However, when dealingwith multiple data sources, additional data quality problems must also be dealtwith. This dissertation addresses these comprehensive challenges by designingholistic linking and clustering approaches that enable reuse of existing links.Unlike previous systems, we execute the complete data integration workflowvia a distributed processing system. At first, the LinkLion portal will beintroduced to provide existing links for new applications. These links act asa basis for a physical data integration process to create a unified representationfor equivalent entities from many data sources. We then propose a holisticclustering approach to form consolidated clusters for same real-world entitiesfrom many different sources. At the same time, we exploit the semantic typeof entities to improve the quality of the result. The process identifies errorsin existing links and can find numerous additional links. Additionally, theentity clustering has to react to the high dynamics of the data. In particular,this requires scalable approaches for continuously growing data sources withmany entities as well as additional new sources. Previous entity clusteringapproaches are mostly static, focusing on the one-time linking and clustering ofentities from few sources. Therefore, we propose and evaluate new approaches for incremental entity clustering that supports the continuous addition of newentities and data sources. To cope with the ever-increasing number of LinkedData sources, efficient and scalable methods based on distributed processingsystems are required. Thus we propose distributed holistic approaches to linkmany data sources based on a clustering of entities that represent the samereal-world object. The implementation is realized on Apache Flink. In contrastto previous approaches, we utilize efficiency-enhancing optimizations for bothdistributed static and dynamic clustering. An extensive comparative evaluationof the proposed approaches with various distributed clustering strategies showshigh effectiveness for datasets from multiple domains as well as scalability on amulti-machine Apache Flink cluster

    A foundation for ontology modularisation

    There has been great interest in realising the Semantic Web. Ontologies are used to define Semantic Web applications. Ontologies have grown to be large and complex to the point where it causes cognitive overload for humans, in understanding and maintaining, and for machines, in processing and reasoning. Furthermore, building ontologies from scratch is time-consuming and not always necessary. Prospective ontology developers could consider using existing ontologies that are of good quality. However, an entire large ontology is not always required for a particular application, but a subset of the knowledge may be relevant. Modularity deals with simplifying an ontology for a particular context or by structure into smaller ontologies, thereby preserving the contextual knowledge. There are a number of benefits in modularising an ontology including simplified maintenance and machine processing, as well as collaborative efforts whereby work can be shared among experts. Modularity has been successfully applied to a number of different ontologies to improve usability and assist with complexity. However, problems exist for modularity that have not been satisfactorily addressed. Currently, modularity tools generate large modules that do not exclusively represent the context. Partitioning tools, which ought to generate disjoint modules, sometimes create overlapping modules. These problems arise from a number of issues: different module types have not been clearly characterised, it is unclear what the properties of a 'good' module are, and it is unclear which evaluation criteria applies to specific module types. In order to successfully solve the problem, a number of theoretical aspects have to be investigated. It is important to determine which ontology module types are the most widely-used and to characterise each such type by distinguishing properties. One must identify properties that a 'good' or 'usable' module meets. In this thesis, we investigate these problems with modularity systematically. We begin by identifying dimensions for modularity to define its foundation: use-case, technique, type, property, and evaluation metric. Each dimension is populated with sub-dimensions as fine-grained values. The dimensions are used to create an empirically-based framework for modularity by classifying a set of ontologies with them, which results in dependencies among the dimensions. The formal framework can be used to guide the user in modularising an ontology and as a starting point in the modularisation process. To solve the problem with module quality, new and existing metrics were implemented into a novel tool TOMM, and an experimental evaluation with a set of modules was performed resulting in dependencies between the metrics and module types. These dependencies can be used to determine whether a module is of good quality. For the issue with existing modularity techniques, we created five new algorithms to improve the current tools and techniques and experimentally evaluate them. The algorithms of the tool, NOMSA, performs as well as other tools for most performance criteria. For NOMSA's generated modules, two of its algorithms' generated modules are good quality when compared to the expected dependencies of the framework. The remaining three algorithms' modules correspond to some of the expected values for the metrics for the ontology set in question. The success of solving the problems with modularity resulted in a formal foundation for modularity which comprises: an exhaustive set of modularity dimensions with dependencies between them, a framework for guiding the modularisation process and annotating module, a way to measure the quality of modules using the novel TOMM tool which has new and existing evaluation metrics, the SUGOI tool for module management that has been investigated for module interchangeability, and an implementation of new algorithms to fill in the gaps of insufficient tools and techniques