36,822 research outputs found
Adapting to Change: Robust Counterfactual Explanations in Dynamic Data Landscapes
We introduce a novel semi-supervised Graph Counterfactual Explainer (GCE)
methodology, Dynamic GRAph Counterfactual Explainer (DyGRACE). It leverages
initial knowledge about the data distribution to search for valid
counterfactuals while avoiding using information from potentially outdated
decision functions in subsequent time steps. Employing two graph autoencoders
(GAEs), DyGRACE learns the representation of each class in a binary
classification scenario. The GAEs minimise the reconstruction error between the
original graph and its learned representation during training. The method
involves (i) optimising a parametric density function (implemented as a
logistic regression function) to identify counterfactuals by maximising the
factual autoencoder's reconstruction error, (ii) minimising the counterfactual
autoencoder's error, and (iii) maximising the similarity between the factual
and counterfactual graphs. This semi-supervised approach is independent of an
underlying black-box oracle. A logistic regression model is trained on a set of
graph pairs to learn weights that aid in finding counterfactuals. At inference,
for each unseen graph, the logistic regressor identifies the best
counterfactual candidate using these learned weights, while the GAEs can be
iteratively updated to represent the continual adaptation of the learned graph
representation over iterations. DyGRACE is quite effective and can act as a
drift detector, identifying distributional drift based on differences in
reconstruction errors between iterations. It avoids reliance on the oracle's
predictions in successive iterations, thereby increasing the efficiency of
counterfactual discovery. DyGRACE, with its capacity for contrastive learning
and drift detection, will offer new avenues for semi-supervised learning and
explanation generation
Similarity modeling for machine learning
Similarity is the extent to which two objects resemble each other. Modeling similarity is an important topic for both machine learning and computer vision. In this dissertation, we first propose a discriminative similarity learning method, then introduce two novel sparse similarity modeling methods for high dimensional data from the perspective of manifold learning and subspace learning. Our sparse similarity modeling methods learn sparse similarity and consequently generate a sparse graph over the data. The generated sparse graph leads to superior performance in clustering and semi-supervised learning, compared to existing sparse graph based methods such as -graph and Sparse Subspace Clustering (SSC).
More concretely, our discriminative similarity learning method adopts a novel pairwise clustering framework by bridging the gap between clustering and multi-class classification. This pairwise clustering framework learns an unsupervised nonparametric classifier from each data partition, and searches for the optimal partition of the data by minimizing the generalization error of the learned classifiers associated with the data partitions.
Regarding to our sparse similarity modeling methods, we propose a novel regularized -graph (--graph) to improve -graph from the perspective of manifold learning. Our --graph generates a sparse graph that is aligned to the manifold structure of the data for better clustering performance. From the perspective of learning the subspace structures of the high dimensional data, we propose -graph that generates a subspace-consistent sparse graph for clustering and semi-supervised learning. Subspace-consistent sparse graph is a sparse graph where a data point is only connected to other data that lie in the same subspace, and the representative method Sparse Subspace Clustering (SSC) proves to generate subspace-consistent sparse graph under certain assumptions on the subspaces and the data, e.g. independent/disjoint subspaces and subspace incoherence/affinity. In contrast, our -graph can generate subspace-consistent sparse graph for arbitrary distinct underlying subspaces under far less restrictive assumptions, i.e. only i.i.d. random data generation according to arbitrary continuous distribution. Extensive experimental results on various data sets demonstrate the superiority of -graph compared to other methods including SSC for both clustering and semi-supervised learning.
The proposed sparse similarity modeling methods require sparse coding using the entire data as the dictionary, which can be inefficient especially in case of large-scale data. In order to overcome this challenge, we propose Support Regularized Sparse Coding (SRSC) where a compact dictionary is learned. The data similarity induced by the support regularized sparse codes leads to compelling clustering performance. Moreover, a feed-forward neural network, termed Deep-SRSC, is designed as a fast encoder to approximate the codes generated by SRSC, further improving the efficiency of SRSC
L'Apprentissage Semi-supervise avec Laplacian Regularise
We study a semi-supervised learning method based on the similarity graph and RegularizedLaplacian. We give convenient optimization formulation of the Regularized Laplacian method and establishits various properties. In particular, we show that the kernel of the methodcan be interpreted in terms of discrete and continuous time random walks and possesses several importantproperties of proximity measures. Both optimization and linear algebra methods can be used for efficientcomputation of the classification functions. We demonstrate on numerical examples that theRegularized Laplacian method is competitive with respect to the other state of the art semi-supervisedlearning methods
Predicting the hosts of prokaryotic viruses using GCN-based semi-supervised learning
Background: Prokaryotic viruses, which infect bacteria and archaea, are the
most abundant and diverse biological entities in the biosphere. To understand
their regulatory roles in various ecosystems and to harness the potential of
bacteriophages for use in therapy, more knowledge of viral-host relationships
is required. High-throughput sequencing and its application to the microbiome
have offered new opportunities for computational approaches for predicting
which hosts particular viruses can infect. However, there are two main
challenges for computational host prediction. First, the empirically known
virus-host relationships are very limited. Second, although sequence similarity
between viruses and their prokaryote hosts have been used as a major feature
for host prediction, the alignment is either missing or ambiguous in many
cases. Thus, there is still a need to improve the accuracy of host prediction.
Results: In this work, we present a semi-supervised learning model, named
HostG, to conduct host prediction for novel viruses. We construct a knowledge
graph by utilizing both virus-virus protein similarity and virus-host DNA
sequence similarity. Then graph convolutional network (GCN) is adopted to
exploit viruses with or without known hosts in training to enhance the learning
ability. During the GCN training, we minimize the expected calibrated error
(ECE) to ensure the confidence of the predictions. We tested HostG on both
simulated and real sequencing data and compared its performance with other
state-of-the-art methods specifcally designed for virus host classification
(VHM-net, WIsH, PHP, HoPhage, RaFAH, vHULK, and VPF-Class). Conclusion: HostG
outperforms other popular methods, demonstrating the efficacy of using a
GCN-based semi-supervised learning approach. A particular advantage of HostG is
its ability to predict hosts from new taxa.Comment: 16 pages, 14 figure
GenPR: Generative PageRank Framework for Semi-supervised Learning on Citation Graphs
International audienceNowadays, Semi-Supervised Learning (SSL) on citation graph data sets is a rapidly growing area of research. However, the recently proposed graph-based SSL algorithms use a default adjacency matrix with binary weights on edges (citations), that causes a loss of the nodes (papers) similarity information. In this work, therefore, we propose a framework focused on embedding PageRank SSL in a generative model. This framework allows one to do joint training of nodes latent space representation and label spreading through the reweighted adjacency matrix by node similarities in the latent space. We explain that a generative model can improve accuracy and reduce the number of iteration steps for PageRank SSL. Moreover, we show that our framework outperforms the best graph-based SSL algorithms on four public citation graph data sets and improves the interpretability of classification results
- …