27,294 research outputs found
Active learning of constraints for semi-supervised clustering
Semi-supervised clustering aims to improve clustering performance by considering user supervision in the form of pairwise constraints. In this paper, we study the active learning problem of selecting pairwise must-link and cannot-link constraints for semisupervised clustering. We consider active learning in an iterative manner where in each iteration queries are selected based on the current clustering solution and the existing constraint set. We apply a general framework that builds on the concept of neighborhood, where neighborhoods contain "labeled examples" of different clusters according to the pairwise constraints. Our active learning method expands the neighborhoods by selecting informative points and querying their relationship with the neighborhoods. Under this framework, we build on the classic uncertainty-based principle and present a novel approach for computing the uncertainty associated with each data point. We further introduce a selection criterion that trades-off the amount of uncertainty of each data point with the expected number of queries (the cost) required to resolve this uncertainty. This allows us to select queries that have the highest information rate. We evaluate the proposed method on the benchmark datasets and the results demonstrate consistent and substantial improvements over the current state-of-the-art
Active Semi-supervised Clustering
Shlukování dat je velice náročný problém, protože v mnoha případech existuje mnoho možných způsobů rozdělení daného datasetu. Shlukování je proto výrazně subjektivní a závislé na daném problému. Aktivní semi-supervisované shlukovací metody aktivně získávají znalost o daném datasetu tak, aby docílily co nejlepšího shlukování pro daný problém. V této práci analyzujeme několik aktivních semi-supervizovaných shlukovacích metod s důrazem na metody, které využívají informace o omezeních dvojic bodů. Dále analyzujeme tři metody pro aktivní učení těchto omezení. Se všemi metodami byly provedeny experimenty na několika různých datasetech. Výsledky experimentů ukazují, že aktivní semi-supervizované metody výrazně zlepšují kvalitu shlukování oproti běžným shlukovacím metodám. Nicméně, žádná ze zkoumaných metod není lepší než ostatní metody na všech datasetech. V této práci navrhujeme další směry pro vylepšení zkoumaných metod.Clustering is a challenging problem since there usually exist multiple possible clusterings. This makes it strongly problem dependent and subjective. Active semi-supervised clustering methods are designed to actively ask for background knowledge in order to provide the best clustering for a given problem. This thesis reviews several state of the art semi-supervised clustering methods with emphasis on methods utilizing pairwise constraints and three schemes for active learning of pairwise constraints. Experiments are conducted to empirically evaluate all reviewed methods on various data sets. Results of the experiments show that active semi-supervised clustering significantly outperforms unsupervised clustering in terms of agreement with a reference clustering. However, none of the methods is superior to the other reviewed methods on all data sets. In the thesis, further directions for extending the current methods are proposed
Graph-based Semi-Supervised & Active Learning for Edge Flows
We present a graph-based semi-supervised learning (SSL) method for learning
edge flows defined on a graph. Specifically, given flow measurements on a
subset of edges, we want to predict the flows on the remaining edges. To this
end, we develop a computational framework that imposes certain constraints on
the overall flows, such as (approximate) flow conservation. These constraints
render our approach different from classical graph-based SSL for vertex labels,
which posits that tightly connected nodes share similar labels and leverages
the graph structure accordingly to extrapolate from a few vertex labels to the
unlabeled vertices. We derive bounds for our method's reconstruction error and
demonstrate its strong performance on synthetic and real-world flow networks
from transportation, physical infrastructure, and the Web. Furthermore, we
provide two active learning algorithms for selecting informative edges on which
to measure flow, which has applications for optimal sensor deployment. The
first strategy selects edges to minimize the reconstruction error bound and
works well on flows that are approximately divergence-free. The second approach
clusters the graph and selects bottleneck edges that cross cluster-boundaries,
which works well on flows with global trends
Multi-view constrained clustering with an incomplete mapping between views
Multi-view learning algorithms typically assume a complete bipartite mapping
between the different views in order to exchange information during the
learning process. However, many applications provide only a partial mapping
between the views, creating a challenge for current methods. To address this
problem, we propose a multi-view algorithm based on constrained clustering that
can operate with an incomplete mapping. Given a set of pairwise constraints in
each view, our approach propagates these constraints using a local similarity
measure to those instances that can be mapped to the other views, allowing the
propagated constraints to be transferred across views via the partial mapping.
It uses co-EM to iteratively estimate the propagation within each view based on
the current clustering model, transfer the constraints across views, and then
update the clustering model. By alternating the learning process between views,
this approach produces a unified clustering model that is consistent with all
views. We show that this approach significantly improves clustering performance
over several other methods for transferring constraints and allows multi-view
clustering to be reliably applied when given a limited mapping between the
views. Our evaluation reveals that the propagated constraints have high
precision with respect to the true clusters in the data, explaining their
benefit to clustering performance in both single- and multi-view learning
scenarios
- …