923 research outputs found

    Unsupervised feature selection for outlier detection by modelling hierarchical value-feature couplings

    Full text link
    © 2016 IEEE. Proper feature selection for unsupervised outlier detection can improve detection performance but is very challenging due to complex feature interactions, the mixture of relevant features with noisy/redundant features in imbalanced data, and the unavailability of class labels. Little work has been done on this challenge. This paper proposes a novel Coupled Unsupervised Feature Selection framework (CUFS for short) to filter out noisy or redundant features for subsequent outlier detection in categorical data. CUFS quantifies the outlierness (or relevance) of features by learning and integrating both the feature value couplings and feature couplings. Such value-To-feature couplings capture intrinsic data characteristics and distinguish relevant features from those noisy/redundant features. CUFS is further instantiated into a parameter-free Dense Subgraph-based Feature Selection method, called DSFS. We prove that DSFS retains a 2-Approximation feature subset to the optimal subset. Extensive evaluation results on 15 real-world data sets show that DSFS obtains an average 48% feature reduction rate, and enables three different types of pattern-based outlier detection methods to achieve substantially better AUC improvements and/or perform orders of magnitude faster than on the original feature set. Compared to its feature selection contender, on average, all three DSFS-based detectors achieve more than 20% AUC improvement

    Machine Learning and Rule Mining Techniques in the Study of Gene Inactivation and RNA Interference

    Get PDF
    RNA interference (RNAi) and gene inactivation are extensively used biological terms in biomedical research. Two categories of small ribonucleic acid (RNA) molecules, viz., microRNA (miRNA) and small interfering RNA (siRNA) are central to the RNAi. There are various kinds of algorithms developed related to RNAi and gene silencing. In this book chapter, we provided a comprehensive review of various machine learning and association rule mining algorithms developed to handle different biological problems such as detection of gene signature, biomarker, gene module, potentially disordered protein, differentially methylated region and many more. We also provided a comparative study of different well-known classifiers along with other used methods. In addition, we demonstrated the brief biological information regarding the immense biological challenges for gene activation as well as their advantages, disadvantages and possible therapeutic strategies. Finally, our study helps the bioinformaticians to understand the overall immense idea in different research dimensions including several learning algorithms for the benevolent of the disease discovery

    Homophily Outlier Detection in Non-IID Categorical Data

    Full text link
    Most of existing outlier detection methods assume that the outlier factors (i.e., outlierness scoring measures) of data entities (e.g., feature values and data objects) are Independent and Identically Distributed (IID). This assumption does not hold in real-world applications where the outlierness of different entities is dependent on each other and/or taken from different probability distributions (non-IID). This may lead to the failure of detecting important outliers that are too subtle to be identified without considering the non-IID nature. The issue is even intensified in more challenging contexts, e.g., high-dimensional data with many noisy features. This work introduces a novel outlier detection framework and its two instances to identify outliers in categorical data by capturing non-IID outlier factors. Our approach first defines and incorporates distribution-sensitive outlier factors and their interdependence into a value-value graph-based representation. It then models an outlierness propagation process in the value graph to learn the outlierness of feature values. The learned value outlierness allows for either direct outlier detection or outlying feature selection. The graph representation and mining approach is employed here to well capture the rich non-IID characteristics. Our empirical results on 15 real-world data sets with different levels of data complexities show that (i) the proposed outlier detection methods significantly outperform five state-of-the-art methods at the 95%/99% confidence level, achieving 10%-28% AUC improvement on the 10 most complex data sets; and (ii) the proposed feature selection methods significantly outperform three competing methods in enabling subsequent outlier detection of two different existing detectors.Comment: To appear in Data Ming and Knowledge Discovery Journa

    Deep Learning for Community Detection: Progress, Challenges and Opportunities

    Full text link
    As communities represent similar opinions, similar functions, similar purposes, etc., community detection is an important and extremely useful tool in both scientific inquiry and data analytics. However, the classic methods of community detection, such as spectral clustering and statistical inference, are falling by the wayside as deep learning techniques demonstrate an increasing capacity to handle high-dimensional graph data with impressive performance. Thus, a survey of current progress in community detection through deep learning is timely. Structured into three broad research streams in this domain - deep neural networks, deep graph embedding, and graph neural networks, this article summarizes the contributions of the various frameworks, models, and algorithms in each stream along with the current challenges that remain unsolved and the future research opportunities yet to be explored.Comment: Accepted Paper in the 29th International Joint Conference on Artificial Intelligence (IJCAI 20), Survey Trac

    Visual analytics for relationships in scientific data

    Get PDF
    Domain scientists hope to address grand scientific challenges by exploring the abundance of data generated and made available through modern high-throughput techniques. Typical scientific investigations can make use of novel visualization tools that enable dynamic formulation and fine-tuning of hypotheses to aid the process of evaluating sensitivity of key parameters. These general tools should be applicable to many disciplines: allowing biologists to develop an intuitive understanding of the structure of coexpression networks and discover genes that reside in critical positions of biological pathways, intelligence analysts to decompose social networks, and climate scientists to model extrapolate future climate conditions. By using a graph as a universal data representation of correlation, our novel visualization tool employs several techniques that when used in an integrated manner provide innovative analytical capabilities. Our tool integrates techniques such as graph layout, qualitative subgraph extraction through a novel 2D user interface, quantitative subgraph extraction using graph-theoretic algorithms or by querying an optimized B-tree, dynamic level-of-detail graph abstraction, and template-based fuzzy classification using neural networks. We demonstrate our system using real-world workflows from several large-scale studies. Parallel coordinates has proven to be a scalable visualization and navigation framework for multivariate data. However, when data with thousands of variables are at hand, we do not have a comprehensive solution to select the right set of variables and order them to uncover important or potentially insightful patterns. We present algorithms to rank axes based upon the importance of bivariate relationships among the variables and showcase the efficacy of the proposed system by demonstrating autonomous detection of patterns in a modern large-scale dataset of time-varying climate simulation
    corecore