10 research outputs found

    MinoanER: Schema-Agnostic, Non-Iterative, Massively Parallel Resolution of Web Entities

    Get PDF
    Entity Resolution (ER) aims to identify different descriptions in various Knowledge Bases (KBs) that refer to the same entity. ER is challenged by the Variety, Volume and Veracity of entity descriptions published in the Web of Data. To address them, we propose the MinoanER framework that simultaneously fulfills full automation, support of highly heterogeneous entities, and massive parallelization of the ER process. MinoanER leverages a token-based similarity of entities to define a new metric that derives the similarity of neighboring entities from the most important relations, as they are indicated only by statistics. A composite blocking method is employed to capture different sources of matching evidence from the content, neighbors, or names of entities. The search space of candidate pairs for comparison is compactly abstracted by a novel disjunctive blocking graph and processed by a non-iterative, massively parallel matching algorithm that consists of four generic, schema-agnostic matching rules that are quite robust with respect to their internal configuration. We demonstrate that the effectiveness of MinoanER is comparable to existing ER tools over real KBs exhibiting low Variety, but it outperforms them significantly when matching KBs with high Variety.Comment: Presented at EDBT 2001

    Automatic Table Extension with Open Data

    Get PDF
    With thousands of data sources available on the web as well as within organisations, data scientists increasingly spend more time searching for data than analysing it. To ease the task of find and integrating relevant data for data mining projects, this dissertation presents two new methods for automatic table extension. Automatic table extension systems take over the task of tata discovery and data integration by adding new columns with new information (new attributes) to any table. The data values in the new columns are extracted from a given corpus of tables

    Explaining differences between unaligned table snapshots

    Get PDF
    We study the problem of explaining differences between two snapshots of the same database table including record insertions, deletions and in particular record updates. Unlike existing alternatives, our solution induces transformation functions and does not require knowledge of the correct alignment between the record sets. This allows profiling snapshots of tables with unspecified or modified primary keys. In such a problem setting, there are always multiple explanations for the differences. Our goal is to find the simplest explanation. We propose to measure the complexity of explanations on the basis of minimum description length in order to formulate the task as an optimization problem. We show that the problem is NP-hard and propose a heuristic search algorithm to solve practical problem instances. We implement a prototype called Affidavit to assess the explanatory qualities of our approach in experiments based on different real-world data sets. We show that it can scale to both a large number of records and attributes and is able to reliably provide correct explanations under practical levels of modifications

    Monitor Newsletter January 12, 1998

    Get PDF
    Official Publication of Bowling Green State University for Faculty and Staffhttps://scholarworks.bgsu.edu/monitor/1481/thumbnail.jp

    Automating Industrial Event Stream Analytics: Methods, Models, and Tools

    Get PDF
    Industrial event streams are an important cornerstone of Industrial Internet of Things (IIoT) applications. For instance, in the manufacturing domain, such streams are typically produced by distributed industrial assets at high frequency on the shop floor. To add business value and extract the full potential of the data (e.g. through predictive quality assessment or maintenance), industrial event stream analytics is an essential building block. One major challenge is the distribution of required technical and domain knowledge across several roles, which makes the realization of analytics projects time-consuming and error-prone. For instance, accessing industrial data sources requires a high level of technical skills due to a large heterogeneity of protocols and formats. To reduce the technical overhead of current approaches, several problems must be addressed. The goal is to enable so-called "citizen technologists" to evaluate event streams through a self-service approach. This requires new methods and models that cover the entire data analytics cycle. In this thesis, the research question is answered, how citizen technologists can be facilitated to independently perform industrial event stream analytics. The first step is to investigate how the technical complexity of modeling and connecting industrial data sources can be reduced. Subsequently, it is analyzed how the event streams can be automatically adapted (directly at the edge), to meet the requirements of data consumers and the infrastructure. Finally, this thesis examines how machine learning models for industrial event streams can be trained in an automated way to evaluate previously integrated data. The main research contributions of this work are: 1. A semantics-based adapter model to describe industrial data sources and to automatically generate adapter instances on edge nodes. 2. An extension for publish-subscribe systems that dynamically reduces event streams while considering requirements of downstream algorithms. 3. A novel AutoML approach to enable citizen data scientists to train and deploy supervised ML models for industrial event streams. The developed approaches are fully implemented in various high-quality software artifacts. These have been integrated into a large open-source project, which enables rapid adoption of the novel concepts into real-world environments. For the evaluation, two user studies to investigate the usability, as well as performance and accuracy tests of the individual components were performed

    Abschlussbericht des Forschungsprojekts "Broker fĂŒr Dynamische Produktionsnetzwerke"

    Get PDF
    Der Broker fĂŒr dynamische Produktionsnetzwerke (DPNB) ist ein vom Bundesministerium fĂŒr Bildung und Forschung (BMBF) gefördertes und durch den ProjekttrĂ€ger Karlsruhe (PTKA) betreutes Forschungsprojekt zwischen sieben Partnern aus Wissenschaft und Wirtschaft mit einer Laufzeit von Januar 2019 bis einschließlich Dezember 2021. Über den Einsatz von Cloud Manufacturing sowie Hard- und Software-Komponenten bei den teilnehmenden Unternehmen, sollen KapazitĂ€tsanbieter mit KapazitĂ€tsnachfrager verbunden werden. Handelbare KapazitĂ€ten sind in diesem Falle Maschinen-, sowie Transport- und MontagekapazitĂ€ten, um Supply Chains anhand des Anwendungsfalls der Blechindustrie möglichst umfassend abzubilden. Der vorliegende Abschlussbericht fasst den Stand der Technik sowie die Erkenntnisse aus dem Projekt zusammen. Außerdem wird ein Überblick ĂŒber die Projektstruktur sowie die Projektpartner gegeben

    Web table integration and profiling for knowledge base augmentation

    Full text link
    HTML tables on web pages ("web tables") have been used successfully as a data source for several applications. They can be extracted from web pages on a large-scale, resulting in corpora of millions of web tables. But, until today only little is known about the general distribution of topics and specific types of data that are contained in the tables that can be found on the Web. But this knowledge is essential to understanding the potential application areas and topical coverage of web tables as a data source. Such knowledge can be obtained through the integration of web tables with a knowledge base, which enables the semantic interpretation of their content and allows for their topical profiling. In turn, the knowledge base can be augmented by adding new statements from the web tables. This is challenging, because the data volume and variety are much larger than in traditional data integration scenarios, in which only a small number of data sources is integrated. The contributions of this thesis are methods for the integration of web tables with a knowledge base and the profiling of large-scale web table corpora through the application of these methods. For this profiling, two corpora of 147 million and 233 million web tables, respectively, are created and made publicly available. These corpora are two of only three that are openly available for research on web tables. Their data profile reveals that most web tables have only very few rows, with a median of 6 rows per web table, and between 35% and 52% of all columns contain non-textual values, such as numbers or dates. These two characteristics have been mostly ignored in the literature about web tables and are addressed by the methods presented in this thesis. The first method, T2K Match, is an algorithm for semantic table interpretation that annotates web tables with classes, properties, and entities from a knowledge base. Other than most algorithms for these tasks, it is not limited to the annotation of columns that contain the names of entities. Its application to a large-scale web table corpus results in the most fine-grained topical data profile of web tables at the time of writing, but also reveals that small web tables cannot be processed with high quality. For such small web tables, a method that stitches them into larger tables is presented and shown to drastically improve the quality of the results. The data profile further shows that the majority of the columns in the web tables, where classes and entities can be recognised, have no corresponding properties in the knowledge base. This makes them candidates for new properties that can be added to the knowledge base. The current methods for this task, however, suffer from the oversimplified assumption that web tables only contain binary relations. This results in the extraction of incomplete relations from the web tables as new properties and makes their correct interpretation impossible. To increase the completeness, a method is presented that generates additional data from the context of the web tables and synthesizes n-ary relations from all web tables of a web site. The application of this method to the second large-scale web table corpus shows that web tables contain a large number of n-ary relations. This means that the data contained in web tables is of higher complexity than previously assumed

    An analysis of the education potential of sites in the Cape Peninsula for secondary school fieldwork in environmental studies

    Get PDF
    In South African secondary schools much less fieldwork is undertaken than in a number of other countries despite fieldwork being required by some school syllabuses and the fact that, in many areas, suitable sites are ready to hand. In an attempt to assess the nature of future demands for fieldwork sites, this study reviews developments in education which lead to increasing emphasis on teaching outside the classroom, and the reasons why so little fieldwork is being done are analyzed. A methodology is developed for selecting fieldwork sites taking into account educational priorities and practical constraints. This is worked out in practice by drawing up a fieldwork syllabus for a particular school, and selecting sites in the Cape Peninsula for field studies. Finally, the educational potential of a sample of these sites is indicated by means of exercises prepared for secondary school children

    The nuclear-conventional nexus in Western military planning for European contingencies

    Get PDF
    The nuclear-conventional nexus is central to many peacetime intra-Alliance debates, and it is a critical reference point for military planners. The linkage between nuclear and conventional military power also provides a distinctive dimension to the control of military operations during crisis and war.The management of this nexus is dependent on evolving political and operational factors such as: trans-Atlantic diplomacy and European political developments; the modernisation of theatre nuclear forces and doctrine; and the prospects for nuclear proliferation. It is argued that planning for nuclear and convention.al military units in and around Europe should be reviewed within the context of a shift in doctrine that more clearly addresses the requirements of crisis management. For this to occur strategic analysis should recognise how regional political factors both reflect, and help to mould, the juxtapositioning of nuclear and conventional military power. Such analysis would show that, within Europe, nuclear and conventional forces have acquired overlapping but not coterminous roles. These ideas are developed within an analytical framework which brings together: a discussion of the nature of strategy; a history of the nuclear-conventional nexus; and an examination of factors affecting the character of the linkage between nuclear and conventional forces in Europe

    WInte.r - a web data integration framework

    Full text link
    The Web provides a plethora of structured data, such as semantic annotations in web pages, data from HTML tables, datasets from open data portals, or linked data from the Linked Open Data Cloud. For many use cases, it is necessary to integrate such web data with existing local datasets. This integration entails schema matching, identity resolution, as well as data fusion. As an alternative to using a combination of partial or ad hoc solutions, this poster presents the Web Data Integration Framework (WInte.r ), which supports end-to-end data integration by providing algorithms and building blocks for data pre-processing, schema matching, and identity resolution, as well as data fusion. While being fully usable out-of-the box, the framework is highly customisable and allows for the composition of sophisticated integration architectures such as T2K Match, which is used to match millions of web tables against DBpedia. A second use case for which WInte.r was employed is the task of stitching (combining) web tables from the same web site into larger tables as a preprocessing step before matching. The WInte.r framework is written in Java and is available as open source under the Apache 2.0 license
    corecore