27 research outputs found

    A Multi-view Context-aware Approach to Android Malware Detection and Malicious Code Localization

    Full text link
    Existing Android malware detection approaches use a variety of features such as security sensitive APIs, system calls, control-flow structures and information flows in conjunction with Machine Learning classifiers to achieve accurate detection. Each of these feature sets provides a unique semantic perspective (or view) of apps' behaviours with inherent strengths and limitations. Meaning, some views are more amenable to detect certain attacks but may not be suitable to characterise several other attacks. Most of the existing malware detection approaches use only one (or a selected few) of the aforementioned feature sets which prevent them from detecting a vast majority of attacks. Addressing this limitation, we propose MKLDroid, a unified framework that systematically integrates multiple views of apps for performing comprehensive malware detection and malicious code localisation. The rationale is that, while a malware app can disguise itself in some views, disguising in every view while maintaining malicious intent will be much harder. MKLDroid uses a graph kernel to capture structural and contextual information from apps' dependency graphs and identify malice code patterns in each view. Subsequently, it employs Multiple Kernel Learning (MKL) to find a weighted combination of the views which yields the best detection accuracy. Besides multi-view learning, MKLDroid's unique and salient trait is its ability to locate fine-grained malice code portions in dependency graphs (e.g., methods/classes). Through our large-scale experiments on several datasets (incl. wild apps), we demonstrate that MKLDroid outperforms three state-of-the-art techniques consistently, in terms of accuracy while maintaining comparable efficiency. In our malicious code localisation experiments on a dataset of repackaged malware, MKLDroid was able to identify all the malice classes with 94% average recall

    A Comprehensive Survey on Deep Graph Representation Learning

    Full text link
    Graph representation learning aims to effectively encode high-dimensional sparse graph-structured data into low-dimensional dense vectors, which is a fundamental task that has been widely studied in a range of fields, including machine learning and data mining. Classic graph embedding methods follow the basic idea that the embedding vectors of interconnected nodes in the graph can still maintain a relatively close distance, thereby preserving the structural information between the nodes in the graph. However, this is sub-optimal due to: (i) traditional methods have limited model capacity which limits the learning performance; (ii) existing techniques typically rely on unsupervised learning strategies and fail to couple with the latest learning paradigms; (iii) representation learning and downstream tasks are dependent on each other which should be jointly enhanced. With the remarkable success of deep learning, deep graph representation learning has shown great potential and advantages over shallow (traditional) methods, there exist a large number of deep graph representation learning techniques have been proposed in the past decade, especially graph neural networks. In this survey, we conduct a comprehensive survey on current deep graph representation learning algorithms by proposing a new taxonomy of existing state-of-the-art literature. Specifically, we systematically summarize the essential components of graph representation learning and categorize existing approaches by the ways of graph neural network architectures and the most recent advanced learning paradigms. Moreover, this survey also provides the practical and promising applications of deep graph representation learning. Last but not least, we state new perspectives and suggest challenging directions which deserve further investigations in the future

    Semantic-guided predictive modeling and relational learning within industrial knowledge graphs

    Get PDF
    The ubiquitous availability of data in today’s manufacturing environments, mainly driven by the extended usage of software and built-in sensing capabilities in automation systems, enables companies to embrace more advanced predictive modeling and analysis in order to optimize processes and usage of equipment. While the potential insight gained from such analysis is high, it often remains untapped, since integration and analysis of data silos from different production domains requires high manual effort and is therefore not economic. Addressing these challenges, digital representations of production equipment, so-called digital twins, have emerged leading the way to semantic interoperability across systems in different domains. From a data modeling point of view, digital twins can be seen as industrial knowledge graphs, which are used as semantic backbone of manufacturing software systems and data analytics. Due to the prevalent historically grown and scattered manufacturing software system landscape that is comprising of numerous proprietary information models, data sources are highly heterogeneous. Therefore, there is an increasing need for semi-automatic support in data modeling, enabling end-user engineers to model their domain and maintain a unified semantic knowledge graph across the company. Once the data modeling and integration is done, further challenges arise, since there has been little research on how knowledge graphs can contribute to the simplification and abstraction of statistical analysis and predictive modeling, especially in manufacturing. In this thesis, new approaches for modeling and maintaining industrial knowledge graphs with focus on the application of statistical models are presented. First, concerning data modeling, we discuss requirements from several existing standard information models and analytic use cases in the manufacturing and automation system domains and derive a fragment of the OWL 2 language that is expressive enough to cover the required semantics for a broad range of use cases. The prototypical implementation enables domain end-users, i.e. engineers, to extend the basis ontology model with intuitive semantics. Furthermore it supports efficient reasoning and constraint checking via translation to rule-based representations. Based on these models, we propose an architecture for the end-user facilitated application of statistical models using ontological concepts and ontology-based data access paradigms. In addition to that we present an approach for domain knowledge-driven preparation of predictive models in terms of feature selection and show how schema-level reasoning in the OWL 2 language can be employed for this task within knowledge graphs of industrial automation systems. A production cycle time prediction model in an example application scenario serves as a proof of concept and demonstrates that axiomatized domain knowledge about features can give competitive performance compared to purely data-driven ones. In the case of high-dimensional data with small sample size, we show that graph kernels of domain ontologies can provide additional information on the degree of variable dependence. Furthermore, a special application of feature selection in graph-structured data is presented and we develop a method that allows to incorporate domain constraints derived from meta-paths in knowledge graphs in a branch-and-bound pattern enumeration algorithm. Lastly, we discuss maintenance of facts in large-scale industrial knowledge graphs focused on latent variable models for the automated population and completion of missing facts. State-of-the art approaches can not deal with time-series data in form of events that naturally occur in industrial applications. Therefore we present an extension of learning knowledge graph embeddings in conjunction with data in form of event logs. Finally, we design several use case scenarios of missing information and evaluate our embedding approach on data coming from a real-world factory environment. We draw the conclusion that industrial knowledge graphs are a powerful tool that can be used by end-users in the manufacturing domain for data modeling and model validation. They are especially suitable in terms of the facilitated application of statistical models in conjunction with background domain knowledge by providing information about features upfront. Furthermore, relational learning approaches showed great potential to semi-automatically infer missing facts and provide recommendations to production operators on how to keep stored facts in synch with the real world

    Understanding Graph Data Through Deep Learning Lens

    Get PDF
    Deep neural network models have established themselves as an unparalleled force in the domains of vision, speech and text processing applications in recent years. However, graphs have formed a significant component of data analytics including applications in Internet of Things, social networks, pharmaceuticals and bioinformatics. An important characteristic of these deep learning techniques is their ability to learn the important features which are necessary to excel at a given task, unlike traditional machine learning algorithms which are dependent on handcrafted features. However, there have been comparatively fewer e�orts in deep learning to directly work on graph inputs. Various real-world problems can be easily solved by posing them as a graph analysis problem. Considering the direct impact of the success of graph analysis on business outcomes, importance of studying these complex graph data has increased exponentially over the years. In this thesis, we address three contributions towards understanding graph data: (i) The first contribution seeks to find anomalies in graphs using graphical models; (ii) The second contribution uses deep learning with spatio-temporal random walks to learn representations of graph trajectories (paths) and shows great promise on standard graph datasets; and (iii) The third contribution seeks to propose a novel deep neural network that implicitly models attention to allow for interpretation of graph classification.

    Exploiting semantic web knowledge graphs in data mining

    Full text link
    Data Mining and Knowledge Discovery in Databases (KDD) is a research field concerned with deriving higher-level insights from data. The tasks performed in that field are knowledge intensive and can often benefit from using additional knowledge from various sources. Therefore, many approaches have been proposed in this area that combine Semantic Web data with the data mining and knowledge discovery process. Semantic Web knowledge graphs are a backbone of many information systems that require access to structured knowledge. Such knowledge graphs contain factual knowledge about real word entities and the relations between them, which can be utilized in various natural language processing, information retrieval, and any data mining applications. Following the principles of the Semantic Web, Semantic Web knowledge graphs are publicly available as Linked Open Data. Linked Open Data is an open, interlinked collection of datasets in machine-interpretable form, covering most of the real world domains. In this thesis, we investigate the hypothesis if Semantic Web knowledge graphs can be exploited as background knowledge in different steps of the knowledge discovery process, and different data mining tasks. More precisely, we aim to show that Semantic Web knowledge graphs can be utilized for generating valuable data mining features that can be used in various data mining tasks. Identifying, collecting and integrating useful background knowledge for a given data mining application can be a tedious and time consuming task. Furthermore, most data mining tools require features in propositional form, i.e., binary, nominal or numerical features associated with an instance, while Linked Open Data sources are usually graphs by nature. Therefore, in Part I, we evaluate unsupervised feature generation strategies from types and relations in knowledge graphs, which are used in different data mining tasks, i.e., classification, regression, and outlier detection. As the number of generated features grows rapidly with the number of instances in the dataset, we provide a strategy for feature selection in hierarchical feature space, in order to select only the most informative and most representative features for a given dataset. Furthermore, we provide an end-to-end tool for mining the Web of Linked Data, which provides functionalities for each step of the knowledge discovery process, i.e., linking local data to a Semantic Web knowledge graph, integrating features from multiple knowledge graphs, feature generation and selection, and building machine learning models. However, we show that such feature generation strategies often lead to high dimensional feature vectors even after dimensionality reduction, and also, the reusability of such feature vectors across different datasets is limited. In Part II, we propose an approach that circumvents the shortcomings introduced with the approaches in Part I. More precisely, we develop an approach that is able to embed complete Semantic Web knowledge graphs in a low dimensional feature space, where each entity and relation in the knowledge graph is represented as a numerical vector. Projecting such latent representations of entities into a lower dimensional feature space shows that semantically similar entities appear closer to each other. We use several Semantic Web knowledge graphs to show that such latent representation of entities have high relevance for different data mining tasks. Furthermore, we show that such features can be easily reused for different datasets and different tasks. In Part III, we describe a list of applications that exploit Semantic Web knowledge graphs, besides the standard data mining tasks, like classification and regression. We show that the approaches developed in Part I and Part II can be used in applications in various domains. More precisely, we show that Semantic Web graphs can be exploited for analyzing statistics, building recommender systems, entity and document modeling, and taxonomy induction. %In Part III, we focus on semantic annotations in HTML pages, which are another realization of the Semantic Web vision. Semantic annotations are integrated into the code of HTML pages using markup languages, like Microformats, RDFa, and Microdata. While such data covers various domains and topics, and can be useful for developing various data mining applications, additional steps of cleaning and integrating the data need to be performed. In this thesis, we describe a set of approaches for processing long literals and images extracted from semantic annotations in HTML pages. We showcase the approaches in the e-commerce domain. Such approaches contribute in building and consuming Semantic Web knowledge graphs

    Representation learning on complex data

    Get PDF
    Machine learning has enabled remarkable progress in various fields of research and application in recent years. The primary objective of machine learning consists of developing algorithms that can learn and improve through observation and experience. Machine learning algorithms learn from data, which may exhibit various forms of complexity, which pose fundamental challenges. In this thesis, we address two major types of data complexity: First, data is often inherently connected and can be modeled by a single or multiple graphs. Machine learning methods could potentially exploit these connections, for instance, to find groups of similar users in a social network for targeted marketing or to predict functional properties of proteins for drug design. Secondly, data is often high-dimensional, for instance, due to a large number of recorded features or induced by a quadratic pixel grid on images. Classical machine learning methods perennially fail when exposed to high-dimensional data as several key assumptions cease to be satisfied. Therefore, a major challenge associated with machine learning on graphs and high-dimensional data is to derive meaningful representations of this data, which allow models to learn effectively. In contrast to conventional manual feature engineering methods, representation learning aims at automatically learning data representations that are particularly suitable for a specific task at hand. Driven by a rapidly increasing availability of data, these methods have celebrated tremendous success for tasks such as object detection in images and speech recognition. However, there is still a considerable amount of research work to be done to fully leverage such techniques for learning on graphs and high-dimensional data. In this thesis, we address the problem of learning meaningful representations for highly-effective machine learning on complex data, in particular, graph data and high-dimensional data. Additionally, most of our proposed methods are highly scalable, allowing them to learn from massive amounts of data. While we address a wide range of general learning problems with different modes of supervision, ranging from unsupervised problems on unlabeled data to (semi-)-supervised learning on annotated data sets, we evaluate our models on specific tasks from fields such as social network analysis, information security, and computer vision. The first part of this thesis addresses representation learning on graphs. While existing graph neural network models commonly perform synchronous message passing between nodes and thus struggle with long-range dependencies and efficiency issues, our first proposed method performs fast asynchronous message passing and, therefore, supports adaptive and efficient learning and additionally scales to large graphs. Another contribution consists of a novel graph-based approach to malware detection and classification based on network traffic. While existing methods classify individual network flows between two endpoints, our algorithm collects all traffic in a monitored network within a specific time frame and builds a communication graph, which is then classified using a novel graph neural network model. The developed model can be generally applied to further graph classification or anomaly detection tasks. Two further contributions challenge a common assumption made by graph learning methods, termed homophily, which states that nodes with similar properties are usually closely connected in the graph. To this end, we develop a method that predicts node-level properties leveraging the distribution of class labels appearing in the neighborhood of the respective node. That allows our model to learn general relations between a node and its neighbors, which are not limited to homophily. Another proposed method specifically models structural similarity between nodes to model different roles, for instance, influencers and followers in a social network. In particular, we develop an unsupervised algorithm for deriving node descriptors based on how nodes spread probability mass to their neighbors and aggregate these descriptors to represent entire graphs. The second part of this thesis addresses representation learning on high-dimensional data. Specifically, we consider the problem of clustering high-dimensional data, such as images, texts, or gene expression profiles. Classical clustering algorithms struggle with this type of data since it can usually not be assumed that data objects will be similar w.r.t. all attributes, but only within a particular subspace of the full-dimensional ambient space. Subspace clustering is an approach to clustering high-dimensional data based on this assumption. While there already exist powerful neural network-based subspace clustering methods, these methods commonly suffer from scalability issues and lack a theoretical foundation. To this end, we propose a novel metric learning approach to subspace clustering, which can provably recover linear subspaces under suitable assumptions and, at the same time, tremendously reduces the required numbear of model parameters and memory compared to existing algorithms.Maschinelles Lernen hat in den letzten Jahren bemerkenswerte Fortschritte in verschiedenen Forschungs- und Anwendungsbereichen ermöglicht. Das primäre Ziel des maschinellen Lernens besteht darin, Algorithmen zu entwickeln, die durch Beobachtung und Erfahrung lernen und sich verbessern können. Algorithmen des maschinellen Lernens lernen aus Daten, die verschiedene Formen von Komplexität aufweisen können, was grundlegende Herausforderungen mit sich bringt. Im Rahmen dieser Dissertation werden zwei Haupttypen von Datenkomplexität behandelt: Erstens weisen Daten oft inhärente Verbindungen, die durch einen einzelnen oder mehrere Graphen modelliert werden können. Methoden des maschinellen Lernens können diese Verbindungen potenziell ausnutzen, um beispielsweise Gruppen ähnlicher Nutzer in einem sozialen Netzwerk für gezieltes Marketing zu finden oder um funktionale Eigenschaften von Proteinen für das Design von Medikamenten vorherzusagen. Zweitens sind die Daten oft hochdimensional, z. B. aufgrund einer großen Anzahl von erfassten Merkmalen oder bedingt durch ein quadratisches Pixelraster auf Bildern. Klassische Methoden des maschinellen Lernens versagen immer wieder, wenn sie hochdimensionalen Daten ausgesetzt werden, da mehrere Schlüsselannahmen nicht mehr erfüllt sind. Daher besteht eine große Herausforderung beim maschinellen Lernen auf Graphen und hochdimensionalen Daten darin, sinnvolle Repräsentationen dieser Daten abzuleiten, die es den Modellen ermöglichen, effektiv zu lernen. Im Gegensatz zu konventionellen manuellen Feature-Engineering-Methoden zielt Representation Learning darauf ab, automatisch Datenrepräsentationen zu lernen, die für eine bestimmte Aufgabenstellung besonders geeignet sind. Angetrieben durch eine rasant steigende Datenverfügbarkeit haben diese Methoden bei Aufgaben wie der Objekterkennung in Bildern und der Spracherkennung enorme Erfolge gefeiert. Es besteht jedoch noch ein erheblicher Forschungsbedarf, um solche Verfahren für das Lernen auf Graphen und hochdimensionalen Daten voll auszuschöpfen. Diese Dissertation beschäftigt sich mit dem Problem des Lernens sinnvoller Repräsentationen für hocheffektives maschinelles Lernen auf komplexen Daten, insbesondere auf Graphen und hochdimensionalen Daten. Zusätzlich sind die meisten hier vorgeschlagenen Methoden hoch skalierbar, so dass sie aus großen Datenmengen lernen können. Obgleich eine breite Palette von allgemeinen Lernproblemen mit verschiedenen Arten der Überwachung adressiert wird, die von unüberwachten Problemen auf unannotierten Daten bis hin zum (semi-)überwachten Lernen auf annotierten Datensätzen reichen, werden die vorgestellten Metoden anhand spezifischen Anwendungen aus Bereichen wie der Analyse sozialer Netzwerke, der Informationssicherheit und der Computer Vision evaluiert. Der erste Teil der Dissertation befasst sich mit dem Representation Learning auf Graphen. Während existierende neuronale Netze für Graphen üblicherweise eine synchrone Nachrichtenübermittlung zwischen den Knoten durchführen und somit mit langreichweitigen Abhängigkeiten und Effizienzproblemen zu kämpfen haben, führt die erste hier vorgeschlagene Methode eine schnelle asynchrone Nachrichtenübermittlung durch und unterstützt somit adaptives und effizientes Lernen und skaliert zudem auf große Graphen. Ein weiterer Beitrag besteht in einem neuartigen graphenbasierten Ansatz zur Malware-Erkennung und -Klassifizierung auf Basis des Netzwerkverkehrs. Während bestehende Methoden einzelne Netzwerkflüsse zwischen zwei Endpunkten klassifizieren, sammelt der vorgeschlagene Algorithmus den gesamten Verkehr in einem überwachten Netzwerk innerhalb eines bestimmten Zeitraums und baut einen Kommunikationsgraphen auf, der dann mithilfe eines neuartigen neuronalen Netzes für Graphen klassifiziert wird. Das entwickelte Modell kann allgemein für weitere Graphenklassifizierungs- oder Anomalieerkennungsaufgaben eingesetzt werden. Zwei weitere Beiträge stellen eine gängige Annahme von Graphen-Lernmethoden in Frage, die so genannte Homophilie-Annahme, die besagt, dass Knoten mit ähnlichen Eigenschaften in der Regel eng im Graphen verbunden sind. Zu diesem Zweck wird eine Methode entwickelt, die Eigenschaften auf Knotenebene vorhersagt, indem sie die Verteilung der annotierten Klassen in der Nachbarschaft des jeweiligen Knotens nutzt. Das erlaubt dem vorgeschlagenen Modell, allgemeine Beziehungen zwischen einem Knoten und seinen Nachbarn zu lernen, die nicht auf Homophilie beschränkt sind. Eine weitere vorgeschlagene Methode modelliert strukturelle Ähnlichkeit zwischen Knoten, um unterschiedliche Rollen zu modellieren, zum Beispiel Influencer und Follower in einem sozialen Netzwerk. Insbesondere entwickeln wir einen unüberwachten Algorithmus zur Ableitung von Knoten-Deskriptoren, die darauf basieren, wie Knoten Wahrscheinlichkeitsmasse auf ihre Nachbarn verteilen, und aggregieren diese Deskriptoren, um ganze Graphen darzustellen. Der zweite Teil dieser Dissertation befasst sich mit dem Representation Learning auf hochdimensionalen Daten. Konkret wird das Problem des Clusterns hochdimensionaler Daten, wie z. B. Bilder, Texte oder Genexpressionsprofile, betrachtet. Klassische Clustering-Algorithmen haben mit dieser Art von Daten zu kämpfen, da in der Regel nicht davon ausgegangen werden kann, dass die Datenobjekte in Bezug auf alle Attribute ähnlich sind, sondern nur innerhalb eines bestimmten Unterraums des volldimensionalen Datenraums. Das Unterraum-Clustering ist ein Ansatz zum Clustern hochdimensionaler Daten, der auf dieser Annahme basiert. Obwohl es bereits leistungsfähige, auf neuronalen Netzen basierende Unterraum-Clustering-Methoden gibt, leiden diese Methoden im Allgemeinen unter Skalierbarkeitsproblemen und es fehlt ihnen an einer theoretischen Grundlage. Zu diesem Zweck wird ein neuartiger Metric Learning Ansatz für das Unterraum-Clustering vorgeschlagen, der unter geeigneten Annahmen nachweislich lineare Unterräume detektieren kann und gleichzeitig die erforderliche Anzahl von Modellparametern und Speicher im Vergleich zu bestehenden Algorithmen enorm reduziert

    A Survey on Graph Representation Learning Methods

    Full text link
    Graphs representation learning has been a very active research area in recent years. The goal of graph representation learning is to generate graph representation vectors that capture the structure and features of large graphs accurately. This is especially important because the quality of the graph representation vectors will affect the performance of these vectors in downstream tasks such as node classification, link prediction and anomaly detection. Many techniques are proposed for generating effective graph representation vectors. Two of the most prevalent categories of graph representation learning are graph embedding methods without using graph neural nets (GNN), which we denote as non-GNN based graph embedding methods, and graph neural nets (GNN) based methods. Non-GNN graph embedding methods are based on techniques such as random walks, temporal point processes and neural network learning methods. GNN-based methods, on the other hand, are the application of deep learning on graph data. In this survey, we provide an overview of these two categories and cover the current state-of-the-art methods for both static and dynamic graphs. Finally, we explore some open and ongoing research directions for future work

    Learning with Graphs using Kernels from Propagated Information

    Get PDF
    Traditional machine learning approaches are designed to learn from independent vector-valued data points. The assumption that instances are independent, however, is not always true. On the contrary, there are numerous domains where data points are cross-linked, for example social networks, where persons are linked by friendship relations. These relations among data points make traditional machine learning diffcult and often insuffcient. Furthermore, data points themselves can have complex structure, for example molecules or proteins constructed from various bindings of different atoms. Networked and structured data are naturally represented by graphs, and for learning we aimto exploit their structure to improve upon non-graph-based methods. However, graphs encountered in real-world applications often come with rich additional information. This naturally implies many challenges for representation and learning: node information is likely to be incomplete leading to partially labeled graphs, information can be aggregated from multiple sources and can therefore be uncertain, or additional information on nodes and edges can be derived from complex sensor measurements, thus being naturally continuous. Although learning with graphs is an active research area, learning with structured data, substantially modeling structural similarities of graphs, mostly assumes fully labeled graphs of reasonable sizes with discrete and certain node and edge information, and learning with networked data, naturally dealing with missing information and huge graphs, mostly assumes homophily and forgets about structural similarity. To close these gaps, we present a novel paradigm for learning with graphs, that exploits the intermediate results of iterative information propagation schemes on graphs. Originally developed for within-network relational and semi-supervised learning, these propagation schemes have two desirable properties: they capture structural information and they can naturally adapt to the aforementioned issues of real-world graph data. Additionally, information propagation can be efficiently realized by random walks leading to fast, flexible, and scalable feature and kernel computations. Further, by considering intermediate random walk distributions, we can model structural similarity for learning with structured and networked data. We develop several approaches based on this paradigm. In particular, we introduce propagation kernels for learning on the graph level and coinciding walk kernels and Markov logic sets for learning on the node level. Finally, we present two application domains where kernels from propagated information successfully tackle real-world problems

    Fundamentals

    Get PDF
    Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters
    corecore