111 research outputs found

    Extending low-rank matrix factorizations for emerging applications

    Get PDF
    Low-rank matrix factorizations have become increasingly popular to project high dimensional data into latent spaces with small dimensions in order to obtain better understandings of the data and thus more accurate predictions. In particular, they have been widely applied to important applications such as collaborative filtering and social network analysis. In this thesis, I investigate the applications and extensions of the ideas of the low-rank matrix factorization to solve several practically important problems arise from collaborative filtering and social network analysis. A key challenge in recommendation system research is how to effectively profile new users, a problem generally known as \emph{cold-start} recommendation. In the first part of this work, we extend the low-rank matrix factorization by allowing the latent factors to have more complex structures --- decision trees to solve the problem of cold-start recommendations. In particular, we present \emph{functional matrix factorization} (fMF), a novel cold-start recommendation method that solves the problem of adaptive interview construction based on low-rank matrix factorizations. The second part of this work considers the efficiency problem of making recommendations in the context of large user and item spaces. Specifically, we address the problem through learning binary codes for collaborative filtering, which can be viewed as restricting the latent factors in low-rank matrix factorizations to be binary vectors that represent the binary codes for both users and items. In the third part of this work, we investigate the applications of low-rank matrix factorizations in the context of social network analysis. Specifically, we propose a convex optimization approach to discover the hidden network of social influence with low-rank and sparse structure by modeling the recurrent events at different individuals as multi-dimensional Hawkes processes, emphasizing the mutual-excitation nature of the dynamics of event occurrences. The proposed framework combines the estimation of mutually exciting process and the low-rank matrix factorization in a principled manner. In the fourth part of this work, we estimate the triggering kernels for the Hawkes process. In particular, we focus on estimating the triggering kernels from an infinite dimensional functional space with the Euler Lagrange equation, which can be viewed as applying the idea of low-rank factorizations in the functional space.Ph.D

    Networked Data Analytics: Network Comparison And Applied Graph Signal Processing

    Get PDF
    Networked data structures has been getting big, ubiquitous, and pervasive. As our day-to-day activities become more incorporated with and influenced by the digital world, we rely more on our intuition to provide us a high-level idea and subconscious understanding of the encountered data. This thesis aims at translating the qualitative intuitions we have about networked data into quantitative and formal tools by designing rigorous yet reasonable algorithms. In a nutshell, this thesis constructs models to compare and cluster networked data, to simplify a complicated networked structure, and to formalize the notion of smoothness and variation for domain-specific signals on a network. This thesis consists of two interrelated thrusts which explore both the scenarios where networks have intrinsic value and are themselves the object of study, and where the interest is for signals defined on top of the networks, so we leverage the information in the network to analyze the signals. Our results suggest that the intuition we have in analyzing huge data can be transformed into rigorous algorithms, and often the intuition results in superior performance, new observations, better complexity, and/or bridging two commonly implemented methods. Even though different in the principles they investigate, both thrusts are constructed on what we think as a contemporary alternation in data analytics: from building an algorithm then understanding it to having an intuition then building an algorithm around it. We show that in order to formalize the intuitive idea to measure the difference between a pair of networks of arbitrary sizes, we could design two algorithms based on the intuition to find mappings between the node sets or to map one network into the subset of another network. Such methods also lead to a clustering algorithm to categorize networked data structures. Besides, we could define the notion of frequencies of a given network by ordering features in the network according to how important they are to the overall information conveyed by the network. These proposed algorithms succeed in comparing collaboration histories of researchers, clustering research communities via their publication patterns, categorizing moving objects from uncertain measurmenets, and separating networks constructed from different processes. In the context of data analytics on top of networks, we design domain-specific tools by leveraging the recent advances in graph signal processing, which formalizes the intuitive notion of smoothness and variation of signals defined on top of networked structures, and generalizes conventional Fourier analysis to the graph domain. In specific, we show how these tools can be used to better classify the cancer subtypes by considering genetic profiles as signals on top of gene-to-gene interaction networks, to gain new insights to explain the difference between human beings in learning new tasks and switching attentions by considering brain activities as signals on top of brain connectivity networks, as well as to demonstrate how common methods in rating prediction are special graph filters and to base on this observation to design novel recommendation system algorithms

    LIPIcs, Volume 251, ITCS 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 251, ITCS 2023, Complete Volum

    Modeling Visit Potential of Geographic Locations Based on Mobility Data

    Get PDF
    Every day people interact with the environment by passing or visiting geographic locations. Information about such entity-location interactions can be used in a number of applications and its value has been recognized by companies and public institutions. However, although the necessary tracking technologies such as GPS, GSM or RFID have long found their way into everyday life, the practical usage of visit information is still limited. Besides economic and ethical reasons for the restricted usage of entity-location interactions there are also two very basic problems. First, no formal definition of entity-location interaction quantities exists. Second, at the current state of technology, no tracking technology guarantees complete observations, and the treatment of missing data in mobility applications has been neglected in trajectory data mining so far. This thesis therefore focuses on the definition and estimation of quantities about the visiting behavior between mobile entities and geographic locations from incomplete mobility data. In a first step we provide an application-independent language to evaluate entity-location interactions. Based on a uniform notation, we define a family of quantities called visit potential, which contains the most basic interaction quantities and can be extended on need. By identifying the common background of all quantities we are able to analyze relationships between different quantities and to infer consistency requirements between related parameterizations of the quantities. We demonstrate the general applicability of visit potential using two real-world applications for which we give a precise definition of the employed entity-location interaction quantities in terms of visit potential. Second, this thesis provides the first systematic analysis of methods for the handling of missing data in mobility mining. We select a set of promising methods that take different approaches to handling missing data and test their robustness with respect to different scenarios. Our analyses consider different mechanisms and intensities of missing data under artificial censoring as well as varying visit intensities. We hereby analyze not only the applicability of the selected methods but also provide a systematic approach for parameterization and testing that can also be applied to the analysis of other mobility data sets. Our experiments show that only two of the tested methods supply unbiased estimates of visit potential quantities and are applicable to the domain. In addition, both methods supply unbiased estimates only of a single quantity. Therefore, it will be a future challenge to design methods for the entire collection of visit potential quantities. The topic of this thesis is motivated by applied research at the Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS for business applications in outdoor advertisement. We will use the outdoor advertisement scenario throughout this thesis for demonstration and experimentation.Modellierung von Besuchsgrößen geographischer Orte anhand von Mobilitätsdaten Täglich interagieren Menschen mit ihrer Umgebung, indem sie sich im geografischen Raum bewegen oder gezielt geografische Orte aufsuchen. Informationen über derartige Besuche sind sehr wertvoll und können in einer Reihe von Anwendungen eingesetzt werden. Üblicherweise werden dazu die Bewegungen von Personen mit Hilfe von GPS, GSM oder RFID Technologien verfolgt. Durch eine räumliche Verschneidung der Trajektorien mit der Positionsangabe eines bestimmten Ortes können dann die Besuche extrahiert werden. Allerdings ist derzeitig die Verwendung von Besuchsinformationen in der Praxis begrenzt. Dies hat, neben ökonomischen und ethischen Gründen, vor allem zwei grundlegende Ursachen. Erstens existiert keine formelle Definition von Größen, um Besuchsinformationen einheitlich auszuwerten. Zweitens können aktuelle Technologien keine vollständige Erfassung von Bewegungsinformationen garantieren. Das bedeutet, dass die Basisdaten zur Auswertung von Besuchsinformationen grundsätzlich Lücken enthalten. Für eine fehlerfreie Auswertung der Daten müssen diese Lücken adäquat behandelt werden. Allerdings wurde dieses Thema in der bisherigen Data Mining Literatur zur Auswertung von Bewegungsdaten vernachlässigt. Daher widmet sich diese Dissertation der Definition von Größen zur Auswertung von Besuchsinformationen sowie dem Schätzen dieser Größen aus unvollständigen Bewegungsdaten. Im ersten Teil der Dissertation wird eine anwendungsunabhängige Beschreibungssprache formuliert, um Besuchsinformationen auszuwerten. Auf Basis einer einheitlichen Notation wird eine Familie von Größen namens visit potential definiert, die grundlegende Besuchsgrößen enthält und offen für Erweiterungen ist. Die gemeinsame Basis aller Besuchsgrößen erlaubt weiterhin, Beziehungen zwischen verschiedenen Größen zu analysieren sowie Konsistenzanforderungen zwischen ähnlichen Parametrisierungen der Größen abzuleiten. Abschließend zeigt die Arbeit die generelle Anwendbarkeit der definierten Besuchsgrößen in zwei realen Anwendungen, für die eine präzise Definition der eingesetzten Statistiken mit Hilfe der Besuchsgrößen gegeben wird. Der zweite Teil der Dissertation enthält die erste systematische Methodenanalyse für die Handhabung von unvollständigen Bewegungsdaten. Hierfür werden vier vielversprechende Methoden aus unterschiedlichen Bereichen zur Behandlung von fehlenden Daten ausgewählt und auf ihre Robustheit unter verschiedenen Annahmen getestet. Mit Hilfe einer künstlichen Zensur werden verschiedene Mechanismen und Grade von fehlenden Daten untersucht. Außerdem wird die Robustheit der Methoden für verschieden hohe Besuchsniveaus betrachtet. Die durchgeführten Experimente geben dabei nicht nur Auskunft über die Anwendbarkeit der getesteten Methoden, sondern stellen auch ein systematisches Vorgehen für das Testen und Parametrisieren weiterer Methoden zur Verfügung. Die Ergebnisse der Experimente belegen, dass nur zwei der vier ausgewählten Methoden für die Schätzung von Besuchsgrößen geeignet sind. Beide Methoden liefern jedoch nur für jeweils eine Besuchsgröße erwartungstreue Schätzwerte. Daher besteht eine zukünftige Herausforderung darin, Schätzmethoden für die Gesamtheit an Besuchsgrößen zu entwickeln. Diese Arbeit ist durch anwendungsorientierte Forschung am Fraunhofer-Institut für Intelligente Analyse- und Informationssysteme IAIS im Bereich der Außenwerbung motiviert. Das Außenwerbeszenario sowie die darüber zur Verfügung gestellten Anwendungsdaten werden durchgängig zur Demonstration und für die Experimente in der Arbeit eingesetzt

    Tensor factorization for relational learning

    Get PDF
    Relational learning is concerned with learning from data where information is primarily represented in form of relations between entities. In recent years, this branch of machine learning has become increasingly important, as relational data is generated in an unprecedented amount and has become ubiquitous in many fields of application such as bioinformatics, artificial intelligence and social network analysis. However, relational learning is a very challenging task, due to the network structure and the high dimensionality of relational data. In this thesis we propose that tensor factorization can be the basis for scalable solutions for learning from relational data and present novel tensor factorization algorithms that are particularly suited for this task. In the first part of the thesis, we present the RESCAL model -- a novel tensor factorization for relational learning -- and discuss its capabilities for exploiting the idiosyncratic properties of relational data. In particular, we show that, unlike existing tensor factorizations, our proposed method is capable of exploiting contextual information that is more distant in the relational graph. Furthermore, we present an efficient algorithm for computing the factorization. We show that our method achieves better or on-par results on common benchmark data sets, when compared to current state-of-the-art relational learning methods, while being significantly faster to compute. In the second part of the thesis, we focus on large-scale relational learning and its applications to Linked Data. By exploiting the inherent sparsity of relational data, an efficient computation of RESCAL can scale up to the size of large knowledge bases, consisting of millions of entities, hundreds of relations and billions of known facts. We show this analytically via a thorough analysis of the runtime and memory complexity of the algorithm as well as experimentally via the factorization of the YAGO2 core ontology and the prediction of relationships in this large knowledge base on a single desktop computer. Furthermore, we derive a new procedure to reduce the runtime complexity for regularized factorizations from O(r^5) to O(r^3) -- where r denotes the number of latent components of the factorization -- by exploiting special properties of the factorization. We also present an efficient method for including attributes of entities in the factorization through a novel coupled tensor-matrix factorization. Experimentally, we show that RESCAL allows us to approach several relational learning tasks that are important to Linked Data. In the third part of this thesis, we focus on the theoretical analysis of learning with tensor factorizations. Although tensor factorizations have become increasingly popular for solving machine learning tasks on various forms of structured data, there exist only very few theoretical results on the generalization abilities of these methods. Here, we present the first known generalization error bounds for tensor factorizations. To derive these bounds, we extend known bounds for matrix factorizations to the tensor case. Furthermore, we analyze how these bounds behave for learning on over- and understructured representations, for instance, when matrix factorizations are applied to tensor data. In the course of deriving generalization bounds, we also discuss the tensor product as a principled way to represent structured data in vector spaces for machine learning tasks. In addition, we evaluate our theoretical discussion with experiments on synthetic data, which support our analysis

    Tensor factorization for relational learning

    Get PDF
    Relational learning is concerned with learning from data where information is primarily represented in form of relations between entities. In recent years, this branch of machine learning has become increasingly important, as relational data is generated in an unprecedented amount and has become ubiquitous in many fields of application such as bioinformatics, artificial intelligence and social network analysis. However, relational learning is a very challenging task, due to the network structure and the high dimensionality of relational data. In this thesis we propose that tensor factorization can be the basis for scalable solutions for learning from relational data and present novel tensor factorization algorithms that are particularly suited for this task. In the first part of the thesis, we present the RESCAL model -- a novel tensor factorization for relational learning -- and discuss its capabilities for exploiting the idiosyncratic properties of relational data. In particular, we show that, unlike existing tensor factorizations, our proposed method is capable of exploiting contextual information that is more distant in the relational graph. Furthermore, we present an efficient algorithm for computing the factorization. We show that our method achieves better or on-par results on common benchmark data sets, when compared to current state-of-the-art relational learning methods, while being significantly faster to compute. In the second part of the thesis, we focus on large-scale relational learning and its applications to Linked Data. By exploiting the inherent sparsity of relational data, an efficient computation of RESCAL can scale up to the size of large knowledge bases, consisting of millions of entities, hundreds of relations and billions of known facts. We show this analytically via a thorough analysis of the runtime and memory complexity of the algorithm as well as experimentally via the factorization of the YAGO2 core ontology and the prediction of relationships in this large knowledge base on a single desktop computer. Furthermore, we derive a new procedure to reduce the runtime complexity for regularized factorizations from O(r^5) to O(r^3) -- where r denotes the number of latent components of the factorization -- by exploiting special properties of the factorization. We also present an efficient method for including attributes of entities in the factorization through a novel coupled tensor-matrix factorization. Experimentally, we show that RESCAL allows us to approach several relational learning tasks that are important to Linked Data. In the third part of this thesis, we focus on the theoretical analysis of learning with tensor factorizations. Although tensor factorizations have become increasingly popular for solving machine learning tasks on various forms of structured data, there exist only very few theoretical results on the generalization abilities of these methods. Here, we present the first known generalization error bounds for tensor factorizations. To derive these bounds, we extend known bounds for matrix factorizations to the tensor case. Furthermore, we analyze how these bounds behave for learning on over- and understructured representations, for instance, when matrix factorizations are applied to tensor data. In the course of deriving generalization bounds, we also discuss the tensor product as a principled way to represent structured data in vector spaces for machine learning tasks. In addition, we evaluate our theoretical discussion with experiments on synthetic data, which support our analysis

    Foundations of Software Science and Computation Structures

    Get PDF
    This open access book constitutes the proceedings of the 22nd International Conference on Foundations of Software Science and Computational Structures, FOSSACS 2019, which took place in Prague, Czech Republic, in April 2019, held as part of the European Joint Conference on Theory and Practice of Software, ETAPS 2019. The 29 papers presented in this volume were carefully reviewed and selected from 85 submissions. They deal with foundational research with a clear significance for software science
    corecore