983 research outputs found

    Taxonomy and clustering in collaborative systems: the case of the on-line encyclopedia Wikipedia

    Full text link
    In this paper we investigate the nature and structure of the relation between imposed classifications and real clustering in a particular case of a scale-free network given by the on-line encyclopedia Wikipedia. We find a statistical similarity in the distributions of community sizes both by using the top-down approach of the categories division present in the archive and in the bottom-up procedure of community detection given by an algorithm based on the spectral properties of the graph. Regardless the statistically similar behaviour the two methods provide a rather different division of the articles, thereby signaling that the nature and presence of power laws is a general feature for these systems and cannot be used as a benchmark to evaluate the suitability of a clustering method.Comment: 5 pages, 3 figures, epl2 styl

    Approaches for enriching and improving textual knowledge bases

    Get PDF
    [no abstract

    Exhaustive and Efficient Constraint Propagation: A Semi-Supervised Learning Perspective and Its Applications

    Full text link
    This paper presents a novel pairwise constraint propagation approach by decomposing the challenging constraint propagation problem into a set of independent semi-supervised learning subproblems which can be solved in quadratic time using label propagation based on k-nearest neighbor graphs. Considering that this time cost is proportional to the number of all possible pairwise constraints, our approach actually provides an efficient solution for exhaustively propagating pairwise constraints throughout the entire dataset. The resulting exhaustive set of propagated pairwise constraints are further used to adjust the similarity matrix for constrained spectral clustering. Other than the traditional constraint propagation on single-source data, our approach is also extended to more challenging constraint propagation on multi-source data where each pairwise constraint is defined over a pair of data points from different sources. This multi-source constraint propagation has an important application to cross-modal multimedia retrieval. Extensive results have shown the superior performance of our approach.Comment: The short version of this paper appears as oral paper in ECCV 201

    Generalized Optimization Framework for Graph-based Semi-supervised Learning

    Get PDF
    We develop a generalized optimization framework for graph-based semi-supervised learning. The framework gives as particular cases the Standard Laplacian, Normalized Laplacian and PageRank based methods. We have also provided new probabilistic interpretation based on random walks and characterized the limiting behaviour of the methods. The random walk based interpretation allows us to explain di erences between the performances of methods with di erent smoothing kernels. It appears that the PageRank based method is robust with respect to the choice of the regularization parameter and the labelled data. We illustrate our theoretical results with two realistic datasets, characterizing di erent challenges: Les Miserables characters social network and Wikipedia hyper-link graph. The graph-based semi-supervised learning classi- es the Wikipedia articles with very good precision and perfect recall employing only the information about the hyper-text links
    • …
    corecore