1,120 research outputs found

    DSL: Discriminative Subgraph Learning via Sparse Self-Representation

    Full text link
    The goal in network state prediction (NSP) is to classify the global state (label) associated with features embedded in a graph. This graph structure encoding feature relationships is the key distinctive aspect of NSP compared to classical supervised learning. NSP arises in various applications: gene expression samples embedded in a protein-protein interaction (PPI) network, temporal snapshots of infrastructure or sensor networks, and fMRI coherence network samples from multiple subjects to name a few. Instances from these domains are typically ``wide'' (more features than samples), and thus, feature sub-selection is required for robust and generalizable prediction. How to best employ the network structure in order to learn succinct connected subgraphs encompassing the most discriminative features becomes a central challenge in NSP. Prior work employs connected subgraph sampling or graph smoothing within optimization frameworks, resulting in either large variance of quality or weak control over the connectivity of selected subgraphs. In this work we propose an optimization framework for discriminative subgraph learning (DSL) which simultaneously enforces (i) sparsity, (ii) connectivity and (iii) high discriminative power of the resulting subgraphs of features. Our optimization algorithm is a single-step solution for the NSP and the associated feature selection problem. It is rooted in the rich literature on maximal-margin optimization, spectral graph methods and sparse subspace self-representation. DSL simultaneously ensures solution interpretability and superior predictive power (up to 16% improvement in challenging instances compared to baselines), with execution times up to an hour for large instances.Comment: 9 page

    Dynamic load balancing for the distributed mining of molecular structures

    Get PDF
    In molecular biology, it is often desirable to find common properties in large numbers of drug candidates. One family of methods stems from the data mining community, where algorithms to find frequent graphs have received increasing attention over the past years. However, the computational complexity of the underlying problem and the large amount of data to be explored essentially render sequential algorithms useless. In this paper, we present a distributed approach to the frequent subgraph mining problem to discover interesting patterns in molecular compounds. This problem is characterized by a highly irregular search tree, whereby no reliable workload prediction is available. We describe the three main aspects of the proposed distributed algorithm, namely, a dynamic partitioning of the search space, a distribution process based on a peer-to-peer communication framework, and a novel receiverinitiated load balancing algorithm. The effectiveness of the distributed method has been evaluated on the well-known National Cancer Institute’s HIV-screening data set, where we were able to show close-to linear speedup in a network of workstations. The proposed approach also allows for dynamic resource aggregation in a non dedicated computational environment. These features make it suitable for large-scale, multi-domain, heterogeneous environments, such as computational grids

    Efficient mining of discriminative molecular fragments

    Get PDF
    Frequent pattern discovery in structured data is receiving an increasing attention in many application areas of sciences. However, the computational complexity and the large amount of data to be explored often make the sequential algorithms unsuitable. In this context high performance distributed computing becomes a very interesting and promising approach. In this paper we present a parallel formulation of the frequent subgraph mining problem to discover interesting patterns in molecular compounds. The application is characterized by a highly irregular tree-structured computation. No estimation is available for task workloads, which show a power-law distribution in a wide range. The proposed approach allows dynamic resource aggregation and provides fault and latency tolerance. These features make the distributed application suitable for multi-domain heterogeneous environments, such as computational Grids. The distributed application has been evaluated on the well known National Cancer Institute’s HIV-screening dataset

    Outlier Detection from Network Data with Subnetwork Interpretation

    Full text link
    Detecting a small number of outliers from a set of data observations is always challenging. This problem is more difficult in the setting of multiple network samples, where computing the anomalous degree of a network sample is generally not sufficient. In fact, explaining why the network is exceptional, expressed in the form of subnetwork, is also equally important. In this paper, we develop a novel algorithm to address these two key problems. We treat each network sample as a potential outlier and identify subnetworks that mostly discriminate it from nearby regular samples. The algorithm is developed in the framework of network regression combined with the constraints on both network topology and L1-norm shrinkage to perform subnetwork discovery. Our method thus goes beyond subspace/subgraph discovery and we show that it converges to a global optimum. Evaluation on various real-world network datasets demonstrates that our algorithm not only outperforms baselines in both network and high dimensional setting, but also discovers highly relevant and interpretable local subnetworks, further enhancing our understanding of anomalous networks

    Mining Representative Unsubstituted Graph Patterns Using Prior Similarity Matrix

    Full text link
    One of the most powerful techniques to study protein structures is to look for recurrent fragments (also called substructures or spatial motifs), then use them as patterns to characterize the proteins under study. An emergent trend consists in parsing proteins three-dimensional (3D) structures into graphs of amino acids. Hence, the search of recurrent spatial motifs is formulated as a process of frequent subgraph discovery where each subgraph represents a spatial motif. In this scope, several efficient approaches for frequent subgraph discovery have been proposed in the literature. However, the set of discovered frequent subgraphs is too large to be efficiently analyzed and explored in any further process. In this paper, we propose a novel pattern selection approach that shrinks the large number of discovered frequent subgraphs by selecting the representative ones. Existing pattern selection approaches do not exploit the domain knowledge. Yet, in our approach we incorporate the evolutionary information of amino acids defined in the substitution matrices in order to select the representative subgraphs. We show the effectiveness of our approach on a number of real datasets. The results issued from our experiments show that our approach is able to considerably decrease the number of motifs while enhancing their interestingness

    kLog: A Language for Logical and Relational Learning with Kernels

    Full text link
    We introduce kLog, a novel approach to statistical relational learning. Unlike standard approaches, kLog does not represent a probability distribution directly. It is rather a language to perform kernel-based learning on expressive logical and relational representations. kLog allows users to specify learning problems declaratively. It builds on simple but powerful concepts: learning from interpretations, entity/relationship data modeling, logic programming, and deductive databases. Access by the kernel to the rich representation is mediated by a technique we call graphicalization: the relational representation is first transformed into a graph --- in particular, a grounded entity/relationship diagram. Subsequently, a choice of graph kernel defines the feature space. kLog supports mixed numerical and symbolic data, as well as background knowledge in the form of Prolog or Datalog programs as in inductive logic programming systems. The kLog framework can be applied to tackle the same range of tasks that has made statistical relational learning so popular, including classification, regression, multitask learning, and collective classification. We also report about empirical comparisons, showing that kLog can be either more accurate, or much faster at the same level of accuracy, than Tilde and Alchemy. kLog is GPLv3 licensed and is available at http://klog.dinfo.unifi.it along with tutorials

    Predictive analysis of real-time strategy games using graph mining

    Get PDF
    Machine learning and computational intelligence have facilitated the development of recommendation systems for a broad range of domains. Such recommendations are based on contextual information that is explicitly provided or pervasively collected. Recommendation systems often improve decision-making or increase the efficacy of a task. Real-Time Strategy (RTS) video games are not only a popular entertainment medium, they also are an abstraction of many real-world applications where the aim is to increase your resources and decrease those of your opponent. Using predictive analytics, which examines past examples of success and failure, we can learn how to predict positive outcomes for such scenarios. To do this, one way to represent this type of data in order to model relationships between entities is by using graphs. The vast amount of data has resulting in complex and large graphs that are difficult to process. Hence, researchers frequently employ parallelized or distributed processing. But first, the graph data must be partitioned and assigned to multiple processors in such a way that the workload will be balanced, and inter-processor communication will be minimized. The latter problem may be complicated by the existence of edges between vertices in a graph that have been assigned to different processors. One objective of this research is to develop an accurate predictive recommendation system for multiplayer strategic games to determine recommendations for moves that a player should, and should not, make which can provide a competitive advantage. Another objective is to determine how to partition a single undirected graph in order to optimize multiprocessor load balancing and reduce the number of edges between split subgraphs --Abstract, page iv
    • …
    corecore