63,145 research outputs found
Localized Sparse Incomplete Multi-view Clustering
Incomplete multi-view clustering, which aims to solve the clustering problem
on the incomplete multi-view data with partial view missing, has received more
and more attention in recent years. Although numerous methods have been
developed, most of the methods either cannot flexibly handle the incomplete
multi-view data with arbitrary missing views or do not consider the negative
factor of information imbalance among views. Moreover, some methods do not
fully explore the local structure of all incomplete views. To tackle these
problems, this paper proposes a simple but effective method, named localized
sparse incomplete multi-view clustering (LSIMVC). Different from the existing
methods, LSIMVC intends to learn a sparse and structured consensus latent
representation from the incomplete multi-view data by optimizing a sparse
regularized and novel graph embedded multi-view matrix factorization model.
Specifically, in such a novel model based on the matrix factorization, a l1
norm based sparse constraint is introduced to obtain the sparse low-dimensional
individual representations and the sparse consensus representation. Moreover, a
novel local graph embedding term is introduced to learn the structured
consensus representation. Different from the existing works, our local graph
embedding term aggregates the graph embedding task and consensus representation
learning task into a concise term. Furthermore, to reduce the imbalance factor
of incomplete multi-view learning, an adaptive weighted learning scheme is
introduced to LSIMVC. Finally, an efficient optimization strategy is given to
solve the optimization problem of our proposed model. Comprehensive
experimental results performed on six incomplete multi-view databases verify
that the performance of our LSIMVC is superior to the state-of-the-art IMC
approaches. The code is available in https://github.com/justsmart/LSIMVC.Comment: Published in IEEE Transactions on Multimedia (TMM). The code is
available at Github https://github.com/justsmart/LSIMV
Machine Learning and Integrative Analysis of Biomedical Big Data.
Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues
Multi-view constrained clustering with an incomplete mapping between views
Multi-view learning algorithms typically assume a complete bipartite mapping
between the different views in order to exchange information during the
learning process. However, many applications provide only a partial mapping
between the views, creating a challenge for current methods. To address this
problem, we propose a multi-view algorithm based on constrained clustering that
can operate with an incomplete mapping. Given a set of pairwise constraints in
each view, our approach propagates these constraints using a local similarity
measure to those instances that can be mapped to the other views, allowing the
propagated constraints to be transferred across views via the partial mapping.
It uses co-EM to iteratively estimate the propagation within each view based on
the current clustering model, transfer the constraints across views, and then
update the clustering model. By alternating the learning process between views,
this approach produces a unified clustering model that is consistent with all
views. We show that this approach significantly improves clustering performance
over several other methods for transferring constraints and allows multi-view
clustering to be reliably applied when given a limited mapping between the
views. Our evaluation reveals that the propagated constraints have high
precision with respect to the true clusters in the data, explaining their
benefit to clustering performance in both single- and multi-view learning
scenarios
Attributed Network Embedding for Learning in a Dynamic Environment
Network embedding leverages the node proximity manifested to learn a
low-dimensional node vector representation for each node in the network. The
learned embeddings could advance various learning tasks such as node
classification, network clustering, and link prediction. Most, if not all, of
the existing works, are overwhelmingly performed in the context of plain and
static networks. Nonetheless, in reality, network structure often evolves over
time with addition/deletion of links and nodes. Also, a vast majority of
real-world networks are associated with a rich set of node attributes, and
their attribute values are also naturally changing, with the emerging of new
content patterns and the fading of old content patterns. These changing
characteristics motivate us to seek an effective embedding representation to
capture network and attribute evolving patterns, which is of fundamental
importance for learning in a dynamic environment. To our best knowledge, we are
the first to tackle this problem with the following two challenges: (1) the
inherently correlated network and node attributes could be noisy and
incomplete, it necessitates a robust consensus representation to capture their
individual properties and correlations; (2) the embedding learning needs to be
performed in an online fashion to adapt to the changes accordingly. In this
paper, we tackle this problem by proposing a novel dynamic attributed network
embedding framework - DANE. In particular, DANE first provides an offline
method for a consensus embedding and then leverages matrix perturbation theory
to maintain the freshness of the end embedding results in an online manner. We
perform extensive experiments on both synthetic and real attributed networks to
corroborate the effectiveness and efficiency of the proposed framework.Comment: 10 page
An efficient -means-type algorithm for clustering datasets with incomplete records
The -means algorithm is arguably the most popular nonparametric clustering
method but cannot generally be applied to datasets with incomplete records. The
usual practice then is to either impute missing values under an assumed
missing-completely-at-random mechanism or to ignore the incomplete records, and
apply the algorithm on the resulting dataset. We develop an efficient version
of the -means algorithm that allows for clustering in the presence of
incomplete records. Our extension is called -means and reduces to the
-means algorithm when all records are complete. We also provide
initialization strategies for our algorithm and methods to estimate the number
of groups in the dataset. Illustrations and simulations demonstrate the
efficacy of our approach in a variety of settings and patterns of missing data.
Our methods are also applied to the analysis of activation images obtained from
a functional Magnetic Resonance Imaging experiment.Comment: 21 pages, 12 figures, 3 tables, in press, Statistical Analysis and
Data Mining -- The ASA Data Science Journal, 201
Towards an online mitigation strategy for N2O emissions through principal components analysis and clustering techniques
Emission of N2O represents an increasing concern in wastewater treatment, in particular for its large contribution to the plant's carbon footprint (CFP). In view of the potential introduction of more stringent regulations regarding wastewater treatment plants' CFP, there is a growing need for advanced monitoring with online implementation of mitigation strategies for N2O emissions. Mechanistic kinetic modelling in full-scale applications, are often represented by a very detailed representation of the biological mechanisms resulting in an elevated uncertainty on the many parameters used while limited by a poor representation of hydrodynamics. This is particularly true for current N2O kinetic models. In this paper, a possible full-scale implementation of a data mining approach linking plant-specific dynamics to N2O production is proposed. A data mining approach was tested on full-scale data along with different clustering techniques to identify process criticalities. The algorithm was designed to provide an applicable solution for full-scale plants' control logics aimed at online N2O emission mitigation. Results show the ability of the algorithm to isolate specific N2O emission pathways, and highlight possible solutions towards emission control
Online Unsupervised Multi-view Feature Selection
In the era of big data, it is becoming common to have data with multiple
modalities or coming from multiple sources, known as "multi-view data".
Multi-view data are usually unlabeled and come from high-dimensional spaces
(such as language vocabularies), unsupervised multi-view feature selection is
crucial to many applications. However, it is nontrivial due to the following
challenges. First, there are too many instances or the feature dimensionality
is too large. Thus, the data may not fit in memory. How to select useful
features with limited memory space? Second, how to select features from
streaming data and handles the concept drift? Third, how to leverage the
consistent and complementary information from different views to improve the
feature selection in the situation when the data are too big or come in as
streams? To the best of our knowledge, none of the previous works can solve all
the challenges simultaneously. In this paper, we propose an Online unsupervised
Multi-View Feature Selection, OMVFS, which deals with large-scale/streaming
multi-view data in an online fashion. OMVFS embeds unsupervised feature
selection into a clustering algorithm via NMF with sparse learning. It further
incorporates the graph regularization to preserve the local structure
information and help select discriminative features. Instead of storing all the
historical data, OMVFS processes the multi-view data chunk by chunk and
aggregates all the necessary information into several small matrices. By using
the buffering technique, the proposed OMVFS can reduce the computational and
storage cost while taking advantage of the structure information. Furthermore,
OMVFS can capture the concept drifts in the data streams. Extensive experiments
on four real-world datasets show the effectiveness and efficiency of the
proposed OMVFS method. More importantly, OMVFS is about 100 times faster than
the off-line methods
Structure fusion based on graph convolutional networks for semi-supervised classification
Suffering from the multi-view data diversity and complexity for
semi-supervised classification, most of existing graph convolutional networks
focus on the networks architecture construction or the salient graph structure
preservation, and ignore the the complete graph structure for semi-supervised
classification contribution. To mine the more complete distribution structure
from multi-view data with the consideration of the specificity and the
commonality, we propose structure fusion based on graph convolutional networks
(SF-GCN) for improving the performance of semi-supervised classification.
SF-GCN can not only retain the special characteristic of each view data by
spectral embedding, but also capture the common style of multi-view data by
distance metric between multi-graph structures. Suppose the linear relationship
between multi-graph structures, we can construct the optimization function of
structure fusion model by balancing the specificity loss and the commonality
loss. By solving this function, we can simultaneously obtain the fusion
spectral embedding from the multi-view data and the fusion structure as
adjacent matrix to input graph convolutional networks for semi-supervised
classification. Experiments demonstrate that the performance of SF-GCN
outperforms that of the state of the arts on three challenging datasets, which
are Cora,Citeseer and Pubmed in citation networks
- …