9 research outputs found

    Multi-graph-view subgraph mining for graph classification

    Full text link
    © 2015, Springer-Verlag London. In this paper, we formulate a new multi-graph-view learning task, where each object to be classified contains graphs from multiple graph-views. This problem setting is essentially different from traditional single-graph-view graph classification, where graphs are collected from one single-feature view. To solve the problem, we propose a cross graph-view subgraph feature-based learning algorithm that explores an optimal set of subgraphs, across multiple graph-views, as features to represent graphs. Specifically, we derive an evaluation criterion to estimate the discriminative power and redundancy of subgraph features across all views, with a branch-and-bound algorithm being proposed to prune subgraph search space. Because graph-views may complement each other and play different roles in a learning task, we assign each view with a weight value indicating its importance to the learning task and further use an optimization process to find optimal weight values for each graph-view. The iteration between cross graph-view subgraph scoring and graph-view weight updating forms a closed loop to find optimal subgraphs to represent graphs for multi-graph-view learning. Experiments and comparisons on real-world tasks demonstrate the algorithm’s superior performance

    GLSVM: Integrating Structured Feature Selection and Large Margin Classification

    Full text link
    Abstract—High dimensional data challenges current feature selection methods. For many real world problems we often have prior knowledge about the relationship of features. For example in microarray data analysis, genes from the same biological pathways are expected to have similar relationship to the outcome that we target to predict. Recent regularization methods on Support Vector Machine (SVM) have achieved great success to perform feature selection and model selection simultaneously for high dimensional data, but neglect such re-lationship among features. To build interpretable SVM models, the structure information of features should be incorporated. In this paper, we propose an algorithm GLSVM that automatically perform model selection and feature selection in SVMs. To incorporate the prior knowledge of feature relationship, we extend standard 2 norm SVM and use a penalty function that employs a L2 norm regularization term including the normalized Laplacian of the graph and L1 penalty. We have demonstrated the effectiveness of our methods and compare them to the state-of-the-art using two real-world benchmarks. I

    Learning from Structured Data with High Dimensional Structured Input and Output Domain

    Get PDF
    Structured data is accumulated rapidly in many applications, e.g. Bioinformatics, Cheminformatics, social network analysis, natural language processing and text mining. Designing and analyzing algorithms for handling these large collections of structured data has received significant interests in data mining and machine learning communities, both in the input and output domain. However, it is nontrivial to adopt traditional machine learning algorithms, e.g. SVM, linear regression to structured data. For one thing, the structural information in the input domain and output domain is ignored if applying the normal algorithms to structured data. For another, the major challenge in learning from many high-dimensional structured data is that input/output domain can contain tens of thousands even larger number of features and labels. With the high dimensional structured input space and/or structured output space, learning a low dimensional and consistent structured predictive function is important for both robustness and interpretability of the model. In this dissertation, we will present a few machine learning models that learn from the data with structured input features and structured output tasks. For learning from the data with structured input features, I have developed structured sparse boosting for graph classification, structured joint sparse PCA for anomaly detection and localization. Besides learning from structured input, I also investigated the interplay between structured input and output under the context of multi-task learning. In particular, I designed a multi-task learning algorithms that performs structured feature selection & task relationship Inference. We will demonstrate the applications of these structured models on subgraph based graph classification, networked data stream anomaly detection/localization, multiple cancer type prediction, neuron activity prediction and social behavior prediction. Finally, through my intern work at IBM T.J. Watson Research, I will demonstrate how to leverage structural information from mobile data (e.g. call detail record and GPS data) to derive important places from people's daily life for transit optimization and urban planning

    Discriminative Subgraph Pattern Mining and Its Applications

    Get PDF
    My dissertation concentrates on two problems in mining discriminative subgraphs: how to efficiently identify subgraph patterns that discriminate two sets of graphs and how to improve discrimination power of subgraph patterns by allowing flexibility. To achieve high efficiency, I adapted evolutionary computation to subgraph mining and proposed to learn how to prune search space from search history. To allow flexibility, I proposed to loosely assemble small rigid graphs for structural flexibility and I proposed a label relaxation technique for label flexibility. I evaluated how applications of discriminative subgraphs can benefit from more efficient and effective mining algorithms. Experimental results showed that the proposed algorithms outperform other algorithms in terms of speed. In addition, using discriminative subgraph patterns found by the proposed algorithms leads to competitive or higher classification accuracy than other methods. Allowing structural flexibility enables users to identify subgraph patterns with even higher discrimination power.Doctor of Philosoph

    Graph Pattern Mining Techniques to Identify Potential Model Organisms

    Get PDF
    Recent advances in high throughput technologies have led to an increasing amount of rich and diverse biological data and related literature. Model organisms are classically selected as subjects for studying human disease based on their genotypic and phenotypic features. A significant problem with model organism identification is the determination of characteristic features related to biological processes that can provide insights into the mechanisms underlying diseases. These insights could have a positive impact on the diagnosis and management of diseases and the development of therapeutic drugs. The increased availability of biological data presents an opportunity to develop data mining methods that can address these challenges and help scientists formulate and test data-driven hypotheses. In this dissertation, data mining methods were developed to provide a quantitative approach for the identification of potential model organisms based on underlying features that may be correlated with disease manifestation in humans. The work encompassed three major types of contributions that aimed to address challenges related to inferring information from biological data available from a range of sources. First, new statistical models and algorithms for graph pattern mining were developed and tested on diverse genres of data (biological networks, drug chemical compounds, and text documents). Second, data mining techniques were developed and shown to identify characteristic disease patterns (disease fingerprints), predict potentially new genetic pathways, and facilitate the assessment of organisms as potential disease models. Third, a methodology was developed that combined the application of graph-based models with information derived from natural language processing methods to identify statistically significant patterns in biomedical text. Together, the approaches developed for this dissertation show promise for summarizing the information about biological processes and phenomena associated with organisms broadly and for the potential assessment of their suitability to study human diseases

    Structure Feature Selection For Graph Classification

    No full text
    With the development of highly efficient graph data collection technology in many application fields, classification of graph data emerges as an important topic in the data mining and machine learning community. Towards building highly accurate classification models for graph data, here we present an efficient graph feature selection method. In our method, we use frequent subgraphs as features for graph classification. Different from existing methods, we consider the spatial distribution of the subgraph features in the graph data and select those ones that have consistent spatial location. We have applied our feature selection methods to several cheminformatics benchmarks. Our method demonstrates a significant improvement of prediction as compared to the state-of-the-art methods
    corecore