923 research outputs found
Unsupervised feature selection for outlier detection by modelling hierarchical value-feature couplings
© 2016 IEEE. Proper feature selection for unsupervised outlier detection can improve detection performance but is very challenging due to complex feature interactions, the mixture of relevant features with noisy/redundant features in imbalanced data, and the unavailability of class labels. Little work has been done on this challenge. This paper proposes a novel Coupled Unsupervised Feature Selection framework (CUFS for short) to filter out noisy or redundant features for subsequent outlier detection in categorical data. CUFS quantifies the outlierness (or relevance) of features by learning and integrating both the feature value couplings and feature couplings. Such value-To-feature couplings capture intrinsic data characteristics and distinguish relevant features from those noisy/redundant features. CUFS is further instantiated into a parameter-free Dense Subgraph-based Feature Selection method, called DSFS. We prove that DSFS retains a 2-Approximation feature subset to the optimal subset. Extensive evaluation results on 15 real-world data sets show that DSFS obtains an average 48% feature reduction rate, and enables three different types of pattern-based outlier detection methods to achieve substantially better AUC improvements and/or perform orders of magnitude faster than on the original feature set. Compared to its feature selection contender, on average, all three DSFS-based detectors achieve more than 20% AUC improvement
Supervised maximum-likelihood weighting of composite protein networks for complex prediction
10.1186/1752-0509-6-S2-S13BMC Systems Biology6SUPPL.2
Machine Learning and Rule Mining Techniques in the Study of Gene Inactivation and RNA Interference
RNA interference (RNAi) and gene inactivation are extensively used biological terms in biomedical research. Two categories of small ribonucleic acid (RNA) molecules, viz., microRNA (miRNA) and small interfering RNA (siRNA) are central to the RNAi. There are various kinds of algorithms developed related to RNAi and gene silencing. In this book chapter, we provided a comprehensive review of various machine learning and association rule mining algorithms developed to handle different biological problems such as detection of gene signature, biomarker, gene module, potentially disordered protein, differentially methylated region and many more. We also provided a comparative study of different well-known classifiers along with other used methods. In addition, we demonstrated the brief biological information regarding the immense biological challenges for gene activation as well as their advantages, disadvantages and possible therapeutic strategies. Finally, our study helps the bioinformaticians to understand the overall immense idea in different research dimensions including several learning algorithms for the benevolent of the disease discovery
Homophily Outlier Detection in Non-IID Categorical Data
Most of existing outlier detection methods assume that the outlier factors
(i.e., outlierness scoring measures) of data entities (e.g., feature values and
data objects) are Independent and Identically Distributed (IID). This
assumption does not hold in real-world applications where the outlierness of
different entities is dependent on each other and/or taken from different
probability distributions (non-IID). This may lead to the failure of detecting
important outliers that are too subtle to be identified without considering the
non-IID nature. The issue is even intensified in more challenging contexts,
e.g., high-dimensional data with many noisy features. This work introduces a
novel outlier detection framework and its two instances to identify outliers in
categorical data by capturing non-IID outlier factors. Our approach first
defines and incorporates distribution-sensitive outlier factors and their
interdependence into a value-value graph-based representation. It then models
an outlierness propagation process in the value graph to learn the outlierness
of feature values. The learned value outlierness allows for either direct
outlier detection or outlying feature selection. The graph representation and
mining approach is employed here to well capture the rich non-IID
characteristics. Our empirical results on 15 real-world data sets with
different levels of data complexities show that (i) the proposed outlier
detection methods significantly outperform five state-of-the-art methods at the
95%/99% confidence level, achieving 10%-28% AUC improvement on the 10 most
complex data sets; and (ii) the proposed feature selection methods
significantly outperform three competing methods in enabling subsequent outlier
detection of two different existing detectors.Comment: To appear in Data Ming and Knowledge Discovery Journa
Deep Learning for Community Detection: Progress, Challenges and Opportunities
As communities represent similar opinions, similar functions, similar
purposes, etc., community detection is an important and extremely useful tool
in both scientific inquiry and data analytics. However, the classic methods of
community detection, such as spectral clustering and statistical inference, are
falling by the wayside as deep learning techniques demonstrate an increasing
capacity to handle high-dimensional graph data with impressive performance.
Thus, a survey of current progress in community detection through deep learning
is timely. Structured into three broad research streams in this domain - deep
neural networks, deep graph embedding, and graph neural networks, this article
summarizes the contributions of the various frameworks, models, and algorithms
in each stream along with the current challenges that remain unsolved and the
future research opportunities yet to be explored.Comment: Accepted Paper in the 29th International Joint Conference on
Artificial Intelligence (IJCAI 20), Survey Trac
Visual analytics for relationships in scientific data
Domain scientists hope to address grand scientific challenges by exploring the abundance of data generated and made available through modern high-throughput techniques. Typical scientific investigations can make use of novel visualization tools that enable dynamic formulation and fine-tuning of hypotheses to aid the process of evaluating sensitivity of key parameters. These general tools should be applicable to many disciplines: allowing biologists to develop an intuitive understanding of the structure of coexpression networks and discover genes that reside in critical positions of biological pathways, intelligence analysts to decompose social networks, and climate scientists to model extrapolate future climate conditions. By using a graph as a universal data representation of correlation, our novel visualization tool employs several techniques that when used in an integrated manner provide innovative analytical capabilities. Our tool integrates techniques such as graph layout, qualitative subgraph extraction through a novel 2D user interface, quantitative subgraph extraction using graph-theoretic algorithms or by querying an optimized B-tree, dynamic level-of-detail graph abstraction, and template-based fuzzy classification using neural networks. We demonstrate our system using real-world workflows from several large-scale studies. Parallel coordinates has proven to be a scalable visualization and navigation framework for multivariate data. However, when data with thousands of variables are at hand, we do not have a comprehensive solution to select the right set of variables and order them to uncover important or potentially insightful patterns. We present algorithms to rank axes based upon the importance of bivariate relationships among the variables and showcase the efficacy of the proposed system by demonstrating autonomous detection of patterns in a modern large-scale dataset of time-varying climate simulation
- …