19 research outputs found

    On Pattern Mining in Graph Data to Support Decision-Making

    Get PDF
    In recent years graph data models became increasingly important in both research and industry. Their core is a generic data structure of things (vertices) and connections among those things (edges). Rich graph models such as the property graph model promise an extraordinary analytical power because relationships can be evaluated without knowledge about a domain-specific database schema. This dissertation studies the usage of graph models for data integration and data mining of business data. Although a typical company's business data implicitly describes a graph it is usually stored in multiple relational databases. Therefore, we propose the first semi-automated approach to transform data from multiple relational databases into a single graph whose vertices represent domain objects and whose edges represent their mutual relationships. This transformation is the base of our conceptual framework BIIIG (Business Intelligence with Integrated Instance Graphs). We further proposed a graph-based approach to data integration. The process is executed after the transformation. In established data mining approaches interrelated input data is mostly represented by tuples of measure values and dimension values. In the context of graphs these values must be attached to the graph structure and aggregated measure values are graph attributes. Since the latter was not supported by any existing model, we proposed the use of collections of property graphs. They act as data structure of the novel Extended Property Graph Model (EPGM). The model supports vertices and edges that may appear in different graphs as well as graph properties. Further on, we proposed some operators that benefit from this data structure, for example, graph-based aggregation of measure values. A primitive operation of graph pattern mining is frequent subgraph mining (FSM). However, existing algorithms provided no support for directed multigraphs. We extended the popular gSpan algorithm to overcome this limitation. Some patterns might not be frequent while their generalizations are. Generalized graph patterns can be mined by attaching vertices to taxonomies. We proposed a novel approach to Generalized Multidimensional Frequent Subgraph Mining (GM-FSM), in particular the first solution to generalized FSM that supports not only directed multigraphs but also multiple dimensional taxonomies. In scenarios that compare patterns of different categories, e.g., fraud or not, FSM is not sufficient since pattern frequencies may differ by category. Further on, determining all pattern frequencies without frequency pruning is not an option due to the computational complexity of FSM. Thus, we developed an FSM extension to extract patterns that are characteristic for a specific category according to a user-defined interestingness function called Characteristic Subgraph Mining (CSM). Parts of this work were done in the context of GRADOOP, a framework for distributed graph analytics. To make the primitive operation of frequent subgraph mining available to this framework, we developed Distributed In-Memory gSpan (DIMSpan), a frequent subgraph miner that is tailored to the characteristics of shared-nothing clusters and distributed dataflow systems. Finally, the results of use case evaluations in cooperation with a large scale enterprise will be presented. This includes a report of practical experiences gained in implementation and application of the proposed algorithms

    Entities with quantities : extraction, search, and ranking

    Get PDF
    Quantities are more than numeric values. They denote measures of the world’s entities such as heights of buildings, running times of athletes, energy efficiency of car models or energy production of power plants, all expressed in numbers with associated units. Entity-centric search and question answering (QA) are well supported by modern search engines. However, they do not work well when the queries involve quantity filters, such as searching for athletes who ran 200m under 20 seconds or companies with quarterly revenue above $2 Billion. State-of-the-art systems fail to understand the quantities, including the condition (less than, above, etc.), the unit of interest (seconds, dollar, etc.), and the context of the quantity (200m race, quarterly revenue, etc.). QA systems based on structured knowledge bases (KBs) also fail as quantities are poorly covered by state-of-the-art KBs. In this dissertation, we developed new methods to advance the state-of-the-art on quantity knowledge extraction and search.Zahlen sind mehr als nur numerische Werte. Sie beschreiben Maße von EntitĂ€ten wie die Höhe von GebĂ€uden, die Laufzeit von Sportlern, die Energieeffizienz von Automodellen oder die Energieerzeugung von Kraftwerken - jeweils ausgedrĂŒckt durch Zahlen mit zugehörigen Einheiten. EntitĂ€tszentriete Anfragen und direktes Question-Answering werden von Suchmaschinen hĂ€ufig gut unterstĂŒtzt. Sie funktionieren jedoch nicht gut, wenn die Fragen Zahlenfilter beinhalten, wie z. B. die Suche nach Sportlern, die 200m unter 20 Sekunden gelaufen sind, oder nach Unternehmen mit einem Quartalsumsatz von ĂŒber 2 Milliarden US-Dollar. Selbst moderne Systeme schaffen es nicht, QuantitĂ€ten, einschließlich der genannten Bedingungen (weniger als, ĂŒber, etc.), der Maßeinheiten (Sekunden, Dollar, etc.) und des Kontexts (200-Meter-Rennen, Quartalsumsatz usw.), zu verstehen. Auch QA-Systeme, die auf strukturierten Wissensbanken (“Knowledge Bases”, KBs) aufgebaut sind, versagen, da quantitative Eigenschaften von modernen KBs kaum erfasst werden. In dieser Dissertation werden neue Methoden entwickelt, um den Stand der Technik zur Wissensextraktion und -suche von QuantitĂ€ten voranzutreiben. Unsere HauptbeitrĂ€ge sind die folgenden: ‱ ZunĂ€chst prĂ€sentieren wir Qsearch [Ho et al., 2019, Ho et al., 2020] – ein System, das mit erweiterten Fragen mit QuantitĂ€tsfiltern umgehen kann, indem es Hinweise verwendet, die sowohl in der Frage als auch in den Textquellen vorhanden sind. Qsearch umfasst zwei HauptbeitrĂ€ge. Der erste Beitrag ist ein tiefes neuronales Netzwerkmodell, das fĂŒr die Extraktion quantitĂ€tszentrierter Tupel aus Textquellen entwickelt wurde. Der zweite Beitrag ist ein neuartiges Query-Matching-Modell zum Finden und zur Reihung passender Tupel. ‱ Zweitens, um beim Vorgang heterogene Tabellen einzubinden, stellen wir QuTE [Ho et al., 2021a, Ho et al., 2021b] vor – ein System zum Extrahieren von QuantitĂ€tsinformationen aus Webquellen, insbesondere Ad-hoc Webtabellen in HTML-Seiten. Der Beitrag von QuTE umfasst eine Methode zur VerknĂŒpfung von QuantitĂ€ts- und EntitĂ€tsspalten, fĂŒr die externe Textquellen genutzt werden. Zur Beantwortung von Fragen kontextualisieren wir die extrahierten EntitĂ€ts-QuantitĂ€ts-Paare mit informativen Hinweisen aus der Tabelle und stellen eine neue Methode zur Konsolidierung und verbesserteer Reihung von Antwortkandidaten durch Inter-Fakten-Konsistenz vor. ‱ Drittens stellen wir QL [Ho et al., 2022] vor – eine Recall-orientierte Methode zur Anreicherung von Knowledge Bases (KBs) mit quantitativen Fakten. Moderne KBs wie Wikidata oder YAGO decken viele EntitĂ€ten und ihre relevanten Informationen ab, ĂŒbersehen aber oft wichtige quantitative Eigenschaften. QL ist frage-gesteuert und basiert auf iterativem Lernen mit zwei HauptbeitrĂ€gen, um die KB-Abdeckung zu verbessern. Der erste Beitrag ist eine Methode zur Expansion von Fragen, um einen grĂ¶ĂŸeren Pool an Faktenkandidaten zu erfassen. Der zweite Beitrag ist eine Technik zur Selbstkonsistenz durch BerĂŒcksichtigung der Werteverteilungen von QuantitĂ€ten

    Novel Approaches to Preserving Utility in Privacy Enhancing Technologies

    Get PDF
    Significant amount of individual information are being collected and analyzed today through a wide variety of applications across different industries. While pursuing better utility by discovering knowledge from the data, an individual’s privacy may be compromised during an analysis: corporate networks monitor their online behavior, advertising companies collect and share their private information, and cybercriminals cause financial damages through security breaches. To this end, the data typically goes under certain anonymization techniques, e.g., CryptoPAn [Computer Networks’04], which replaces real IP addresses with prefix-preserving pseudonyms, or Differentially Private (DP) [ICALP’06] techniques which modify the answer to a query by adding a zero-mean noise distributed according to, e.g., a Laplace distribution. Unfortunately, most such techniques either are vulnerable to adversaries with prior knowledge, e.g., some network flows in the data, or require heavy data sanitization or perturbation, both of which may result in a significant loss of data utility. Therefore, the fundamental trade-off between privacy and utility (i.e., analysis accuracy) has attracted significant attention in various settings [ICALP’06, ACM CCS’14]. In line with this track of research, in this dissertation we aim to build utility-maximized and privacy-preserving tools for Internet communications. Such tools can be employed not only by dissidents and whistleblowers, but also by ordinary Internet users on a daily basis. To this end, we combine the development of practical systems with rigorous theoretical analysis, and incorporate techniques from various disciplines such as computer networking, cryptography, and statistical analysis. During the research, we proposed three different frameworks in some well-known settings outlined in the following. First, we propose The Multi-view Approach to preserve both privacy and utility in network trace anonymization, Second, The R2DP Approach which is a novel technique on differentially private mechanism design with maximized utility, and Third, The DPOD Approach that is a novel framework on privacy preserving Anomaly detection in the outsourcing setting

    High-Fidelity Provenance:Exploring the Intersection of Provenance and Security

    Get PDF
    In the past 25 years, the World Wide Web has disrupted the way news are disseminated and consumed. However, the euphoria for the democratization of news publishing was soon followed by scepticism, as a new phenomenon emerged: fake news. With no gatekeepers to vouch for it, the veracity of the information served over the World Wide Web became a major public concern. The Reuters Digital News Report 2020 cites that in at least half of the EU member countries, 50% or more of the population is concerned about online fake news. To help address the problem of trust on information communi- cated over the World Wide Web, it has been proposed to also make available the provenance metadata of the information. Similar to artwork provenance, this would include a detailed track of how the information was created, updated and propagated to produce the result we read, as well as what agents—human or software—were involved in the process. However, keeping track of provenance information is a non-trivial task. Current approaches, are often of limited scope and may require modifying existing applications to also generate provenance information along with thei regular output. This thesis explores how provenance can be automatically tracked in an application-agnostic manner, without having to modify the individual applications. We frame provenance capture as a data flow analysis problem and explore the use of dynamic taint analysis in this context. Our work shows that this appoach improves on the quality of provenance captured compared to traditonal approaches, yielding what we term as high-fidelity provenance. We explore the performance cost of this approach and use deterministic record and replay to bring it down to a more practical level. Furthermore, we create and present the tooling necessary for the expanding the use of using deterministic record and replay for provenance analysis. The thesis concludes with an application of high-fidelity provenance as a tool for state-of-the art offensive security analysis, based on the intuition that software too can be misguided by "fake news". This demonstrates that the potential uses of high-fidelity provenance for security extend beyond traditional forensics analysis

    A new Nested Graph Model for Data Integration

    Get PDF
    Despite graph data gained increasing interest in several fields, no data model suitable for both querying and integrating differently structured graph and (semi)structured data has been currently conceived. The lack of operators allowing combinations of (multiple) graphs in current graph query languages (graph joins), and on graph data structure allowing neither data integration nor nested multidimensional representations (graph nesting) are a possible motivation. In order to make such data integration possible, this thesis proposes a novel model (General Semistructured data Model) allowing the representation of both graphs and arbitrarily nested contents (e.g., one node can be contained by more than just one parent node), thus allowing the definition of a nested graph model, where both vertices and edges may include (overlapping) graphs. We provide two graph joins algorithms (Graph Conjunctive Equijoin Algorithm and Graph Conjunctive Less-equal Algorithm) and one graph nesting algorithm (Two HOp Separated Patterns). Their evaluation on top of our secondary memory representation showed the inefficiency of existing query languages’ query plan on top of their respective data models (relational, graph and document-oriented). In all three algorithms, the enhancement was possible by using an adjacency list graph representation, thus reducing the cost of joining the vertices with their respective outgoing (or ingoing) edges, and by associating hash values to both vertices and edges. As a secondary outcome of this thesis, a general data integration scenario is provided where both graph data and other semistructured and structured data could be represented and integrated into the General Semistructured data Model. A new query language outlines the feasibility of this approach (General Semistructured Query Language) over the former data model, also allowing to express both graph joins and graph nestings. This language is also capable of representing both traversal and data manipulation operators

    Ethical Evidence and Policymaking

    Get PDF
    EPDF and EPUB available Open Access under CC-BY-NC-ND licence. This important book offers practical advice for using evidence and research in policymaking. The book has two aims. First, it builds a case for ethics and global values in research and knowledge exchange, and second, it examines specific policy areas and how evidence can guide practice. The book covers important policy areas including the GM debate, the environment, Black Lives Matter and COVID-19. Each chapter assesses the ethical challenges, the status of evidence in explaining or describing the issue and possible solutions to the problem. The book will enable policymakers and their advisors to seek evidence for their decisions from research that has been conducted ethically and with integrity

    Ethical research in public policy.

    Get PDF
    Public policy research is research for a purpose, guided by a distinctive range of normative considerations. The values are the values of public service; the work is generally done in the public domain; and the research is an intrinsic part of the democratic process, which depends on deliberation and accountability. Conventional representations of ethical research typically focus on ‘human subjects’ research, which raises different kinds of ethical issues to public policy research. Existing research ethics advice does not address the issues surrounding public policy research. Such research is typically concerned with collective action and the work of institutions, and the central guiding principles are not about responsibility to research participants, but duties to the public, as seen in principles of beneficence, citizenship, empowerment and the democratic process

    Leveraging Semantic Annotations for Event-focused Search & Summarization

    Get PDF
    Today in this Big Data era, overwhelming amounts of textual information across different sources with a high degree of redundancy has made it hard for a consumer to retrospect on past events. A plausible solution is to link semantically similar information contained across the different sources to enforce a structure thereby providing multiple access paths to relevant information. Keeping this larger goal in view, this work uses Wikipedia and online news articles as two prominent yet disparate information sources to address the following three problems: ‱ We address a linking problem to connect Wikipedia excerpts to news articles by casting it into an IR task. Our novel approach integrates time, geolocations, and entities with text to identify relevant documents that can be linked to a given excerpt. ‱ We address an unsupervised extractive multi-document summarization task to generate a fixed-length event digest that facilitates efficient consumption of information contained within a large set of documents. Our novel approach proposes an ILP for global inference across text, time, geolocations, and entities associated with the event. ‱ To estimate temporal focus of short event descriptions, we present a semi-supervised approach that leverages redundancy within a longitudinal news collection to estimate accurate probabilistic time models. Extensive experimental evaluations demonstrate the effectiveness and viability of our proposed approaches towards achieving the larger goal.Im heutigen Big Data Zeitalters existieren ĂŒberwĂ€ltigende Mengen an Textinformationen, die ĂŒber mehrere Quellen verteilt sind und ein hohes Maß an Redundanz haben. Durch diese Gegebenheiten ist eine Retroperspektive auf vergangene Ereignisse fĂŒr Konsumenten nur schwer möglich. Eine plausible Lösung ist die VerknĂŒpfung semantisch Ă€hnlicher, aber ĂŒber mehrere Quellen verteilter Informationen, um dadurch eine Struktur zu erzwingen, die mehrere Zugriffspfade auf relevante Informationen, bietet. Vor diesem Hintergrund benutzt diese Dissertation Wikipedia und Onlinenachrichten als zwei prominente, aber dennoch grundverschiedene Informationsquellen, um die folgenden drei Probleme anzusprechen: ‱ Wir adressieren ein VerknĂŒpfungsproblem, um Wikipedia-AuszĂŒge mit Nachrichtenartikeln zu verbinden und das Problem in eine Information-Retrieval-Aufgabe umzuwandeln. Unser neuartiger Ansatz integriert Zeit- und GeobezĂŒge sowie EntitĂ€ten mit Text, um relevante Dokumente, die mit einem gegebenen Auszug verknĂŒpft werden können, zu identifizieren. ‱ Wir befassen uns mit einer unĂŒberwachten Extraktionsmethode zur automatischen Zusammenfassung von Texten aus mehreren Dokumenten um Ereigniszusammenfassungen mit fester LĂ€nge zu generieren, was eine effiziente Aufnahme von Informationen aus großen Dokumentenmassen ermöglicht. Unser neuartiger Ansatz schlĂ€gt eine ganzzahlige lineare Optimierungslösung vor, die globale Inferenzen ĂŒber Text, Zeit, Geolokationen und mit Ereignis-verbundenen EntitĂ€ten zieht. ‱ Um den zeitlichen Fokus kurzer Ereignisbeschreibungen abzuschĂ€tzen, stellen wir einen semi-ĂŒberwachten Ansatz vor, der die Redundanz innerhalb einer langzeitigen Dokumentensammlung ausnutzt, um genaue probabilistische Zeitmodelle abzuschĂ€tzen. Umfangreiche experimentelle Auswertungen zeigen die Wirksamkeit und TragfĂ€higkeit unserer vorgeschlagenen AnsĂ€tze zur Erreichung des grĂ¶ĂŸeren Ziels
    corecore