5 research outputs found

    Information-theoretic graph mining

    Get PDF
    Real world data from various application domains can be modeled as a graph, e.g. social networks and biomedical networks like protein interaction networks or co-activation networks of brain regions. A graph is a powerful concept to model arbitrary (structural) relationships among objects. In recent years, the prevalence of social networks has made graph mining an important center of attention in the data mining field. There are many important tasks in graph mining, such as graph clustering, outlier detection, and link prediction. Many algorithms have been proposed in the literature to solve these tasks. However, normally these issues are solved separately, although they are closely related. Detecting and exploiting the relationship among them is a new challenge in graph mining. Moreover, with data explosion, more information has already been integrated into graph structure. For example, bipartite graphs contain two types of node and graphs with node attributes offer additional non-structural information. Therefore, more challenges arise from the increasing graph complexity. This thesis aims to solve these challenges in order to gain new knowledge from graph data. An important paradigm of data mining used in this thesis is the principle of Minimum Description Length (MDL). It follows the assumption: the more knowledge we have learned from the data, the better we are able to compress the data. The MDL principle balances the complexity of the selected model and the goodness of fit between model and data. Thus, it naturally avoids over-fitting. This thesis proposes several algorithms based on the MDL principle to acquire knowledge from various types of graphs: Info-spot (Automatically Spotting Information-rich Nodes in Graphs) proposes a parameter-free and efficient algorithm for the fully automatic detection of interesting nodes which is a novel outlier notion in graph. Then in contrast to traditional graph mining approaches that focus on discovering dense subgraphs, a novel graph mining technique CXprime (Compression-based eXploiting Primitives) is proposed. It models the transitivity and the hubness of a graph using structure primitives (all possible three-node substructures). Under the coding scheme of CXprime, clusters with structural information can be discovered, dominating substructures of a graph can be distinguished, and a new link prediction score based on substructures is proposed. The next algorithm SCMiner (Summarization-Compression Miner) integrates tasks such as graph summarization, graph clustering, link prediction, and the discovery of the hidden structure of a bipartite graph on the basis of data compression. Finally, a method for non-redundant graph clustering called IROC (Information-theoretic non-Redundant Overlapping Clustering) is proposed to smartly combine structural information with non-structural information based on MDL. IROC is able to detect overlapping communities within subspaces of the attributes. To sum up, algorithms to unify different learning tasks for various types of graphs are proposed. Additionally, these algorithms are based on the MDL principle, which facilitates the unification of different graph learning tasks, the integration of different graph types, and the automatic selection of input parameters that are otherwise difficult to estimate

    Constructing hypergraphs from temporal data

    Full text link
    A wide range of systems across the social and natural sciences produce temporal data consisting of interaction events among nodes in disjoint sets. Online shopping, for example, generates purchasing events of the form (user, product, time of purchase), and mutualistic interactions in plant-pollinator systems generate pollination events of the form (insect, plant, time of pollination). These data sets can be meaningfully modeled as temporal hypergraph snapshots in which multiple nodes within one set (i.e. online shoppers) share a hyperedge if they interacted with a common node in the opposite set (i.e. purchased the same product) within a given time window, allowing for the application of a range of hypergraph analysis techniques. However, it is often unclear how to choose the number and duration of these temporal snapshots, which have a strong influence on the final hypergraph representations. Here we propose a principled, efficient, nonparametric solution to this longstanding problem by extracting temporal hypergraph snapshots that optimally capture structural regularities in temporal event data according to the minimum description length principle. We demonstrate our methods on real and synthetic datasets, finding that they can recover planted artificial hypergraph structure in the presence of considerable noise and reveal meaningful activity fluctuations in human mobility data

    The Minimum Description Length Principle for Pattern Mining: A Survey

    Full text link
    This is about the Minimum Description Length (MDL) principle applied to pattern mining. The length of this description is kept to the minimum. Mining patterns is a core task in data analysis and, beyond issues of efficient enumeration, the selection of patterns constitutes a major challenge. The MDL principle, a model selection method grounded in information theory, has been applied to pattern mining with the aim to obtain compact high-quality sets of patterns. After giving an outline of relevant concepts from information theory and coding, as well as of work on the theory behind the MDL and similar principles, we review MDL-based methods for mining various types of data and patterns. Finally, we open a discussion on some issues regarding these methods, and highlight currently active related data analysis problems

    Summarization-based mining bipartite graphs

    No full text
    corecore