76 research outputs found

    Significant Subgraph Mining with Multiple Testing Correction

    Full text link
    The problem of finding itemsets that are statistically significantly enriched in a class of transactions is complicated by the need to correct for multiple hypothesis testing. Pruning untestable hypotheses was recently proposed as a strategy for this task of significant itemset mining. It was shown to lead to greater statistical power, the discovery of more truly significant itemsets, than the standard Bonferroni correction on real-world datasets. An open question, however, is whether this strategy of excluding untestable hypotheses also leads to greater statistical power in subgraph mining, in which the number of hypotheses is much larger than in itemset mining. Here we answer this question by an empirical investigation on eight popular graph benchmark datasets. We propose a new efficient search strategy, which always returns the same solution as the state-of-the-art approach and is approximately two orders of magnitude faster. Moreover, we exploit the dependence between subgraphs by considering the effective number of tests and thereby further increase the statistical power.Comment: 18 pages, 5 figure, accepted to the 2015 SIAM International Conference on Data Mining (SDM15

    Task Sensitive Feature Exploration and Learning for Multitask Graph Classification

    Full text link
    © 2016 IEEE. Multitask learning (MTL) is commonly used for jointly optimizing multiple learning tasks. To date, all existing MTL methods have been designed for tasks with feature-vector represented instances, but cannot be applied to structure data, such as graphs. More importantly, when carrying out MTL, existing methods mainly focus on exploring overall commonality or disparity between tasks for learning, but cannot explicitly capture task relationships in the feature space, so they are unable to answer important questions, such as what exactly is shared between tasks and what is the uniqueness of one task differing from others? In this paper, we formulate a new multitask graph learning problem, and propose a task sensitive feature exploration and learning algorithm for multitask graph classification. Because graphs do not have features available, we advocate a task sensitive feature exploration and learning paradigm to jointly discover discriminative subgraph features across different tasks. In addition, a feature learning process is carried out to categorize each subgraph feature into one of three categories: 1) common feature; 2) task auxiliary feature; and 3) task specific feature, indicating whether the feature is shared by all tasks, by a subset of tasks, or by only one specific task, respectively. The feature learning and the multiple task learning are iteratively optimized to form a multitask graph classification model with a global optimization goal. Experiments on real-world functional brain analysis and chemical compound categorization demonstrate the algorithm's performance. Results confirm that our method can be used to explicitly capture task correlations and uniqueness in the feature space, and explicitly answer what are shared between tasks and what is the uniqueness of a specific task

    Mining and analysis of real-world graphs

    Get PDF
    Networked systems are everywhere - such as the Internet, social networks, biological networks, transportation networks, power grid networks, etc. They can be very large yet enormously complex. They can contain a lot of information, either open and transparent or under the cover and coded. Such real-world systems can be modeled using graphs and be mined and analyzed through the lens of network analysis. Network analysis can be applied in recognition of frequent patterns among the connected components in a large graph, such as social networks, where visual analysis is almost impossible. Frequent patterns illuminate statistically important subgraphs that are usually small enough to analyze visually. Graph mining has different practical applications in fraud detection, outliers detection, chemical molecules, etc., based on the necessity of extracting and understanding the information yielded. Network analysis can also be used to quantitatively evaluate and improve the resilience of infrastructure networks such as the Internet or power grids. Infrastructure networks directly affect the quality of people\u27s lives. However, a disastrous incident in these networks may lead to a cascading breakdown of the whole network and serious economic consequences. In essence, network analysis can help us gain actionable insights and make better data-driven decisions based on the networks. On that note, the objective of this dissertation is to improve upon existing tools for more accurate mining and analysis of real-world networks --Abstract, page iv

    Ontology change management and identification of change patterns

    Get PDF
    Ontologies can support a variety of purposes, ranging from capturing the conceptual knowledge to the organisation of digital content and information. However, information systems are always subject to change and ontology change management can pose challenges. In this sense, the application and representation of ontology changes in terms of higher-level change operations can describe more meaningful semantics behind the applied change. In this paper, we propose a fourphase process that covers the operationalization, representation and detection of higherlevel changes in ontology evolution life cycle. We present different levels of change operators based on the granularity and domainspecificity of changes. The first layer is based on generic atomic level change operators, whereas the next two layers are user-defined (generic/domainspecific) change patterns. We introduce layered change logs for the explicit operational representation of ontology changes. We formalised the change log using a graph-based approach. We introduce a technique to identify composite changes that not only assists in formulating ontology change log data in a more concise manner, but also helps in realizing the semantics and intent behind any applied change. Furthermore, we identify frequent change sequences that are applied as a reference in order to discover reusable, often domainspecific and usagedriven change patterns. We describe the pattern identification algorithms and evaluate their performance

    Graph based pattern discovery in protein structures

    Get PDF
    The rapidly growing body of 3D protein structure data provides new opportunities to study the relation between protein structure and protein function. Local structure pattern of proteins has been the focus of recent efforts to link structural features found in proteins to protein function. In addition, structure patterns have demonstrated values in applications such as predicting protein-protein interaction, engineering proteins, and designing novel medicines. My thesis introduces graph-based representations of protein structure and new subgraph mining algorithms to identify recurring structure patterns common to a set of proteins. These techniques enable families of proteins exhibiting similar function to be analyzed for structural similarity. Previous approaches to protein local structure pattern discovery operate in a pairwise fashion and have prohibitive computational cost when scaled to families of proteins. The graph mining strategy is robust in the face of errors in the structure, and errors in the set of proteins thought to share a function. Two collaborations with domain experts at the UNC School of Pharmacy and the UNC Medical School demonstrate the utility of these techniques. The first is to predict the function of several newly characterized protein structures. The second is to identify conserved structural features in evolutionarily related proteins

    Predicting drug effectiveness in Cancer Cell Lines using Machine Learning and Graph Mining

    Get PDF
    O cancro é uma doença heterogênea, com um nivel de diversidade entre tumores considerável. Os biomarcadores, no contexto de uma doença oncológica, permitem a identificação da capacidade de resposta de um paciente a um dado fármaco. Estes tratamentos especificos têm produzido resultados em média superiores aos de uso mais abrangente. No entanto a ligação entre a resposta ao tratamento e o valor de um dado biomarcador é em muitos casos ainda desconhecida. O objectivo deste projecto é, com base em resultados prévios e na caracterização tanto dos fármacos como dos tecidos celulares, conseguir prever a eficácia de um fármaco em um tumor .Cancer is an heterogeneous disease, with a high degree of diversity between tumours. Biomarkers, in the context of an oncological disease, allow the identification of the response from a patient to a given drug. These specific treatments have been producing results that are superior on average to broader ones. However, the relationship between a drug's response a biomarkers value is in many cases yet unknown. Some models to predict this relationship have already been built, using machine learning methods. The input arecharacterizations of both the drug and the tissue along with the result of the drug's use on a given tissue.The goal of this thesis is to improve on previous models and the characterization of both the drug and the tissue through the introduction of graph mining and other machine learning methods

    Near-optimal supervised feature selection among frequent subgraphs

    Full text link
    corecore