128 research outputs found

    Graph based Anomaly Detection and Description: A Survey

    Get PDF
    Detecting anomalies in data is a vital task, with numerous high-impact applications in areas such as security, finance, health care, and law enforcement. While numerous techniques have been developed in past years for spotting outliers and anomalies in unstructured collections of multi-dimensional points, with graph data becoming ubiquitous, techniques for structured graph data have been of focus recently. As objects in graphs have long-range correlations, a suite of novel technology has been developed for anomaly detection in graph data. This survey aims to provide a general, comprehensive, and structured overview of the state-of-the-art methods for anomaly detection in data represented as graphs. As a key contribution, we give a general framework for the algorithms categorized under various settings: unsupervised vs. (semi-)supervised approaches, for static vs. dynamic graphs, for attributed vs. plain graphs. We highlight the effectiveness, scalability, generality, and robustness aspects of the methods. What is more, we stress the importance of anomaly attribution and highlight the major techniques that facilitate digging out the root cause, or the ‘why’, of the detected anomalies for further analysis and sense-making. Finally, we present several real-world applications of graph-based anomaly detection in diverse domains, including financial, auction, computer traffic, and social networks. We conclude our survey with a discussion on open theoretical and practical challenges in the field

    A parameter-less algorithm for tensor co-clustering

    Get PDF

    Searching and mining in enriched geo-spatial data

    Get PDF
    The emergence of new data collection mechanisms in geo-spatial applications paired with a heightened tendency of users to volunteer information provides an ever-increasing flow of data of high volume, complex nature, and often associated with inherent uncertainty. Such mechanisms include crowdsourcing, automated knowledge inference, tracking, and social media data repositories. Such data bearing additional information from multiple sources like probability distributions, text or numerical attributes, social context, or multimedia content can be called multi-enriched. Searching and mining this abundance of information holds many challenges, if all of the data's potential is to be released. This thesis addresses several major issues arising in that field, namely path queries using multi-enriched data, trend mining in social media data, and handling uncertainty in geo-spatial data. In all cases, the developed methods have made significant contributions and have appeared in or were accepted into various renowned international peer-reviewed venues. A common use of geo-spatial data is path queries in road networks where traditional methods optimise results based on absolute and ofttimes singular metrics, i.e., finding the shortest paths based on distance or the best trade-off between distance and travel time. Integrating additional aspects like qualitative or social data by enriching the data model with knowledge derived from sources as mentioned above allows for queries that can be issued to fit a broader scope of needs or preferences. This thesis presents two implementations of incorporating multi-enriched data into road networks. In one case, a range of qualitative data sources is evaluated to gain knowledge about user preferences which is subsequently matched with locations represented in a road network and integrated into its components. Several methods are presented for highly customisable path queries that incorporate a wide spectrum of data. In a second case, a framework is described for resource distribution with reappearance in road networks to serve one or more clients, resulting in paths that provide maximum gain based on a probabilistic evaluation of available resources. Applications for this include finding parking spots. Social media trends are an emerging research area giving insight in user sentiment and important topics. Such trends consist of bursts of messages concerning a certain topic within a time frame, significantly deviating from the average appearance frequency of the same topic. By investigating the dissemination of such trends in space and time, this thesis presents methods to classify trend archetypes to predict future dissemination of a trend. Processing and querying uncertain data is particularly demanding given the additional knowledge required to yield results with probabilistic guarantees. Since such knowledge is not always available and queries are not easily scaled to larger datasets due to the #P-complete nature of the problem, many existing approaches reduce the data to a deterministic representation of its underlying model to eliminate uncertainty. However, data uncertainty can also provide valuable insight into the nature of the data that cannot be represented in a deterministic manner. This thesis presents techniques for clustering uncertain data as well as query processing, that take the additional information from uncertainty models into account while preserving scalability using a sampling-based approach, while previous approaches could only provide one of the two. The given solutions enable the application of various existing clustering techniques or query types to a framework that manages the uncertainty.Das Erscheinen neuer Methoden zur Datenerhebung in räumlichen Applikationen gepaart mit einer erhöhten Bereitschaft der Nutzer, Daten über sich preiszugeben, generiert einen stetig steigenden Fluss von Daten in großer Menge, komplexer Natur, und oft gepaart mit inhärenter Unsicherheit. Beispiele für solche Mechanismen sind Crowdsourcing, automatisierte Wissensinferenz, Tracking, und Daten aus sozialen Medien. Derartige Daten, angereichert mit mit zusätzlichen Informationen aus verschiedenen Quellen wie Wahrscheinlichkeitsverteilungen, Text- oder numerische Attribute, sozialem Kontext, oder Multimediainhalten, werden als multi-enriched bezeichnet. Suche und Datamining in dieser weiten Datenmenge hält viele Herausforderungen bereit, wenn das gesamte Potenzial der Daten genutzt werden soll. Diese Arbeit geht auf mehrere große Fragestellungen in diesem Feld ein, insbesondere Pfadanfragen in multi-enriched Daten, Trend-mining in Daten aus sozialen Netzwerken, und die Beherrschung von Unsicherheit in räumlichen Daten. In all diesen Fällen haben die entwickelten Methoden signifikante Forschungsbeiträge geleistet und wurden veröffentlicht oder angenommen zu diversen renommierten internationalen, von Experten begutachteten Konferenzen und Journals. Ein gängiges Anwendungsgebiet räumlicher Daten sind Pfadanfragen in Straßennetzwerken, wo traditionelle Methoden die Resultate anhand absoluter und oft auch singulärer Maße optimieren, d.h., der kürzeste Pfad in Bezug auf die Distanz oder der beste Kompromiss zwischen Distanz und Reisezeit. Durch die Integration zusätzlicher Aspekte wie qualitativer Daten oder Daten aus sozialen Netzwerken als Anreicherung des Datenmodells mit aus diesen Quellen abgeleitetem Wissen werden Anfragen möglich, die ein breiteres Spektrum an Anforderungen oder Präferenzen erfüllen. Diese Arbeit präsentiert zwei Ansätze, solche multi-enriched Daten in Straßennetze einzufügen. Zum einen wird eine Reihe qualitativer Datenquellen ausgewertet, um Wissen über Nutzerpräferenzen zu generieren, welches darauf mit Örtlichkeiten im Straßennetz abgeglichen und in das Netz integriert wird. Diverse Methoden werden präsentiert, die stark personalisierbare Pfadanfragen ermöglichen, die ein weites Spektrum an Daten mit einbeziehen. Im zweiten Fall wird ein Framework präsentiert, das eine Ressourcenverteilung im Straßennetzwerk modelliert, bei der einmal verbrauchte Ressourcen erneut auftauchen können. Resultierende Pfade ergeben einen maximalen Ertrag basieren auf einer probabilistischen Evaluation der verfügbaren Ressourcen. Eine Anwendung ist die Suche nach Parkplätzen. Trends in sozialen Medien sind ein entstehendes Forscchungsgebiet, das Einblicke in Benutzerverhalten und wichtige Themen zulässt. Solche Trends bestehen aus großen Mengen an Nachrichten zu einem bestimmten Thema innerhalb eines Zeitfensters, so dass die Auftrittsfrequenz signifikant über den durchschnittlichen Level liegt. Durch die Untersuchung der Fortpflanzung solcher Trends in Raum und Zeit präsentiert diese Arbeit Methoden, um Trends nach Archetypen zu klassifizieren und ihren zukünftigen Weg vorherzusagen. Die Anfragebearbeitung und Datamining in unsicheren Daten ist besonders herausfordernd, insbesondere im Hinblick auf das notwendige Zusatzwissen, um Resultate mit probabilistischen Garantien zu erzielen. Solches Wissen ist nicht immer verfügbar und Anfragen lassen sich aufgrund der \P-Vollständigkeit des Problems nicht ohne Weiteres auf größere Datensätze skalieren. Dennoch kann Datenunsicherheit wertvollen Einblick in die Struktur der Daten liefern, der mit deterministischen Methoden nicht erreichbar wäre. Diese Arbeit präsentiert Techniken zum Clustering unsicherer Daten sowie zur Anfragebearbeitung, die die Zusatzinformation aus dem Unsicherheitsmodell in Betracht ziehen, jedoch gleichzeitig die Skalierbarkeit des Ansatzes auf große Datenmengen sicherstellen

    Classification of Explainable Artificial Intelligence Methods through Their Output Formats

    Get PDF
    Machine and deep learning have proven their utility to generate data-driven models with high accuracy and precision. However, their non-linear, complex structures are often difficult to interpret. Consequently, many scholars have developed a plethora of methods to explain their functioning and the logic of their inferences. This systematic review aimed to organise these methods into a hierarchical classification system that builds upon and extends existing taxonomies by adding a significant dimension—the output formats. The reviewed scientific papers were retrieved by conducting an initial search on Google Scholar with the keywords “explainable artificial intelligence”; “explainable machine learning”; and “interpretable machine learning”. A subsequent iterative search was carried out by checking the bibliography of these articles. The addition of the dimension of the explanation format makes the proposed classification system a practical tool for scholars, supporting them to select the most suitable type of explanation format for the problem at hand. Given the wide variety of challenges faced by researchers, the existing XAI methods provide several solutions to meet the requirements that differ considerably between the users, problems and application fields of artificial intelligence (AI). The task of identifying the most appropriate explanation can be daunting, thus the need for a classification system that helps with the selection of methods. This work concludes by critically identifying the limitations of the formats of explanations and by providing recommendations and possible future research directions on how to build a more generally applicable XAI method. Future work should be flexible enough to meet the many requirements posed by the widespread use of AI in several fields, and the new regulation

    Tensor factorization for relational learning

    Get PDF
    Relational learning is concerned with learning from data where information is primarily represented in form of relations between entities. In recent years, this branch of machine learning has become increasingly important, as relational data is generated in an unprecedented amount and has become ubiquitous in many fields of application such as bioinformatics, artificial intelligence and social network analysis. However, relational learning is a very challenging task, due to the network structure and the high dimensionality of relational data. In this thesis we propose that tensor factorization can be the basis for scalable solutions for learning from relational data and present novel tensor factorization algorithms that are particularly suited for this task. In the first part of the thesis, we present the RESCAL model -- a novel tensor factorization for relational learning -- and discuss its capabilities for exploiting the idiosyncratic properties of relational data. In particular, we show that, unlike existing tensor factorizations, our proposed method is capable of exploiting contextual information that is more distant in the relational graph. Furthermore, we present an efficient algorithm for computing the factorization. We show that our method achieves better or on-par results on common benchmark data sets, when compared to current state-of-the-art relational learning methods, while being significantly faster to compute. In the second part of the thesis, we focus on large-scale relational learning and its applications to Linked Data. By exploiting the inherent sparsity of relational data, an efficient computation of RESCAL can scale up to the size of large knowledge bases, consisting of millions of entities, hundreds of relations and billions of known facts. We show this analytically via a thorough analysis of the runtime and memory complexity of the algorithm as well as experimentally via the factorization of the YAGO2 core ontology and the prediction of relationships in this large knowledge base on a single desktop computer. Furthermore, we derive a new procedure to reduce the runtime complexity for regularized factorizations from O(r^5) to O(r^3) -- where r denotes the number of latent components of the factorization -- by exploiting special properties of the factorization. We also present an efficient method for including attributes of entities in the factorization through a novel coupled tensor-matrix factorization. Experimentally, we show that RESCAL allows us to approach several relational learning tasks that are important to Linked Data. In the third part of this thesis, we focus on the theoretical analysis of learning with tensor factorizations. Although tensor factorizations have become increasingly popular for solving machine learning tasks on various forms of structured data, there exist only very few theoretical results on the generalization abilities of these methods. Here, we present the first known generalization error bounds for tensor factorizations. To derive these bounds, we extend known bounds for matrix factorizations to the tensor case. Furthermore, we analyze how these bounds behave for learning on over- and understructured representations, for instance, when matrix factorizations are applied to tensor data. In the course of deriving generalization bounds, we also discuss the tensor product as a principled way to represent structured data in vector spaces for machine learning tasks. In addition, we evaluate our theoretical discussion with experiments on synthetic data, which support our analysis

    9th International Workshop "What can FCA do for Artificial Intelligence?" (FCA4AI 2021)

    Get PDF
    International audienceFormal Concept Analysis (FCA) is a mathematically well-founded theory aimed at classification and knowledge discovery that can be used for many purposes in Artificial Intelligence (AI). The objective of the ninth edition of the FCA4AI workshop (see http://www.fca4ai.hse.ru/) is to investigate several issues such as: how can FCA support various AI activities (knowledge discovery, knowledge engineering, machine learning, data mining, information retrieval, recommendation...), how can FCA be extended in order to help AI researchers to solve new and complex problems in their domains, and how FCA can play a role in current trends in AI such as explainable AI and fairness of algorithms in decision making.The workshop was held in co-location with IJCAI 2021, Montréal, Canada, August, 28 2021

    Machine learning for managing structured and semi-structured data

    Get PDF
    As the digitalization of private, commercial, and public sectors advances rapidly, an increasing amount of data is becoming available. In order to gain insights or knowledge from these enormous amounts of raw data, a deep analysis is essential. The immense volume requires highly automated processes with minimal manual interaction. In recent years, machine learning methods have taken on a central role in this task. In addition to the individual data points, their interrelationships often play a decisive role, e.g. whether two patients are related to each other or whether they are treated by the same physician. Hence, relational learning is an important branch of research, which studies how to harness this explicitly available structural information between different data points. Recently, graph neural networks have gained importance. These can be considered an extension of convolutional neural networks from regular grids to general (irregular) graphs. Knowledge graphs play an essential role in representing facts about entities in a machine-readable way. While great efforts are made to store as many facts as possible in these graphs, they often remain incomplete, i.e., true facts are missing. Manual verification and expansion of the graphs is becoming increasingly difficult due to the large volume of data and must therefore be assisted or substituted by automated procedures which predict missing facts. The field of knowledge graph completion can be roughly divided into two categories: Link Prediction and Entity Alignment. In Link Prediction, machine learning models are trained to predict unknown facts between entities based on the known facts. Entity Alignment aims at identifying shared entities between graphs in order to link several such knowledge graphs based on some provided seed alignment pairs. In this thesis, we present important advances in the field of knowledge graph completion. For Entity Alignment, we show how to reduce the number of required seed alignments while maintaining performance by novel active learning techniques. We also discuss the power of textual features and show that graph-neural-network-based methods have difficulties with noisy alignment data. For Link Prediction, we demonstrate how to improve the prediction for unknown entities at training time by exploiting additional metadata on individual statements, often available in modern graphs. Supported with results from a large-scale experimental study, we present an analysis of the effect of individual components of machine learning models, e.g., the interaction function or loss criterion, on the task of link prediction. We also introduce a software library that simplifies the implementation and study of such components and makes them accessible to a wide research community, ranging from relational learning researchers to applied fields, such as life sciences. Finally, we propose a novel metric for evaluating ranking results, as used for both completion tasks. It allows for easier interpretation and comparison, especially in cases with different numbers of ranking candidates, as encountered in the de-facto standard evaluation protocols for both tasks.Mit der rasant fortschreitenden Digitalisierung des privaten, kommerziellen und öffentlichen Sektors werden immer größere Datenmengen verfügbar. Um aus diesen enormen Mengen an Rohdaten Erkenntnisse oder Wissen zu gewinnen, ist eine tiefgehende Analyse unerlässlich. Das immense Volumen erfordert hochautomatisierte Prozesse mit minimaler manueller Interaktion. In den letzten Jahren haben Methoden des maschinellen Lernens eine zentrale Rolle bei dieser Aufgabe eingenommen. Neben den einzelnen Datenpunkten spielen oft auch deren Zusammenhänge eine entscheidende Rolle, z.B. ob zwei Patienten miteinander verwandt sind oder ob sie vom selben Arzt behandelt werden. Daher ist das relationale Lernen ein wichtiger Forschungszweig, der untersucht, wie diese explizit verfügbaren strukturellen Informationen zwischen verschiedenen Datenpunkten nutzbar gemacht werden können. In letzter Zeit haben Graph Neural Networks an Bedeutung gewonnen. Diese können als eine Erweiterung von CNNs von regelmäßigen Gittern auf allgemeine (unregelmäßige) Graphen betrachtet werden. Wissensgraphen spielen eine wesentliche Rolle bei der Darstellung von Fakten über Entitäten in maschinenlesbaren Form. Obwohl große Anstrengungen unternommen werden, so viele Fakten wie möglich in diesen Graphen zu speichern, bleiben sie oft unvollständig, d. h. es fehlen Fakten. Die manuelle Überprüfung und Erweiterung der Graphen wird aufgrund der großen Datenmengen immer schwieriger und muss daher durch automatisierte Verfahren unterstützt oder ersetzt werden, die fehlende Fakten vorhersagen. Das Gebiet der Wissensgraphenvervollständigung lässt sich grob in zwei Kategorien einteilen: Link Prediction und Entity Alignment. Bei der Link Prediction werden maschinelle Lernmodelle trainiert, um unbekannte Fakten zwischen Entitäten auf der Grundlage der bekannten Fakten vorherzusagen. Entity Alignment zielt darauf ab, gemeinsame Entitäten zwischen Graphen zu identifizieren, um mehrere solcher Wissensgraphen auf der Grundlage einiger vorgegebener Paare zu verknüpfen. In dieser Arbeit stellen wir wichtige Fortschritte auf dem Gebiet der Vervollständigung von Wissensgraphen vor. Für das Entity Alignment zeigen wir, wie die Anzahl der benötigten Paare reduziert werden kann, während die Leistung durch neuartige aktive Lerntechniken erhalten bleibt. Wir erörtern auch die Leistungsfähigkeit von Textmerkmalen und zeigen, dass auf Graph-Neural-Networks basierende Methoden Schwierigkeiten mit verrauschten Paar-Daten haben. Für die Link Prediction demonstrieren wir, wie die Vorhersage für unbekannte Entitäten zur Trainingszeit verbessert werden kann, indem zusätzliche Metadaten zu einzelnen Aussagen genutzt werden, die oft in modernen Graphen verfügbar sind. Gestützt auf Ergebnisse einer groß angelegten experimentellen Studie präsentieren wir eine Analyse der Auswirkungen einzelner Komponenten von Modellen des maschinellen Lernens, z. B. der Interaktionsfunktion oder des Verlustkriteriums, auf die Aufgabe der Link Prediction. Außerdem stellen wir eine Softwarebibliothek vor, die die Implementierung und Untersuchung solcher Komponenten vereinfacht und sie einer breiten Forschungsgemeinschaft zugänglich macht, die von Forschern im Bereich des relationalen Lernens bis hin zu angewandten Bereichen wie den Biowissenschaften reicht. Schließlich schlagen wir eine neuartige Metrik für die Bewertung von Ranking-Ergebnissen vor, wie sie für beide Aufgaben verwendet wird. Sie ermöglicht eine einfachere Interpretation und einen leichteren Vergleich, insbesondere in Fällen mit einer unterschiedlichen Anzahl von Kandidaten, wie sie in den de-facto Standardbewertungsprotokollen für beide Aufgaben vorkommen

    Tensor factorization for relational learning

    Get PDF
    Relational learning is concerned with learning from data where information is primarily represented in form of relations between entities. In recent years, this branch of machine learning has become increasingly important, as relational data is generated in an unprecedented amount and has become ubiquitous in many fields of application such as bioinformatics, artificial intelligence and social network analysis. However, relational learning is a very challenging task, due to the network structure and the high dimensionality of relational data. In this thesis we propose that tensor factorization can be the basis for scalable solutions for learning from relational data and present novel tensor factorization algorithms that are particularly suited for this task. In the first part of the thesis, we present the RESCAL model -- a novel tensor factorization for relational learning -- and discuss its capabilities for exploiting the idiosyncratic properties of relational data. In particular, we show that, unlike existing tensor factorizations, our proposed method is capable of exploiting contextual information that is more distant in the relational graph. Furthermore, we present an efficient algorithm for computing the factorization. We show that our method achieves better or on-par results on common benchmark data sets, when compared to current state-of-the-art relational learning methods, while being significantly faster to compute. In the second part of the thesis, we focus on large-scale relational learning and its applications to Linked Data. By exploiting the inherent sparsity of relational data, an efficient computation of RESCAL can scale up to the size of large knowledge bases, consisting of millions of entities, hundreds of relations and billions of known facts. We show this analytically via a thorough analysis of the runtime and memory complexity of the algorithm as well as experimentally via the factorization of the YAGO2 core ontology and the prediction of relationships in this large knowledge base on a single desktop computer. Furthermore, we derive a new procedure to reduce the runtime complexity for regularized factorizations from O(r^5) to O(r^3) -- where r denotes the number of latent components of the factorization -- by exploiting special properties of the factorization. We also present an efficient method for including attributes of entities in the factorization through a novel coupled tensor-matrix factorization. Experimentally, we show that RESCAL allows us to approach several relational learning tasks that are important to Linked Data. In the third part of this thesis, we focus on the theoretical analysis of learning with tensor factorizations. Although tensor factorizations have become increasingly popular for solving machine learning tasks on various forms of structured data, there exist only very few theoretical results on the generalization abilities of these methods. Here, we present the first known generalization error bounds for tensor factorizations. To derive these bounds, we extend known bounds for matrix factorizations to the tensor case. Furthermore, we analyze how these bounds behave for learning on over- and understructured representations, for instance, when matrix factorizations are applied to tensor data. In the course of deriving generalization bounds, we also discuss the tensor product as a principled way to represent structured data in vector spaces for machine learning tasks. In addition, we evaluate our theoretical discussion with experiments on synthetic data, which support our analysis
    corecore