7,722 research outputs found

    Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch

    Full text link
    Graph-based Semi-supervised learning (SSL) algorithms have been successfully used in a large number of applications. These methods classify initially unlabeled nodes by propagating label information over the structure of graph starting from seed nodes. Graph-based SSL algorithms usually scale linearly with the number of distinct labels (m), and require O(m) space on each node. Unfortunately, there exist many applications of practical significance with very large m over large graphs, demanding better space and time complexity. In this paper, we propose MAD-SKETCH, a novel graph-based SSL algorithm which compactly stores label distribution on each node using Count-min Sketch, a randomized data structure. We present theoretical analysis showing that under mild conditions, MAD-SKETCH can reduce space complexity at each node from O(m) to O(log m), and achieve similar savings in time complexity as well. We support our analysis through experiments on multiple real world datasets. We observe that MAD-SKETCH achieves similar performance as existing state-of-the-art graph- based SSL algorithms, while requiring smaller memory footprint and at the same time achieving up to 10x speedup. We find that MAD-SKETCH is able to scale to datasets with one million labels, which is beyond the scope of existing graph- based SSL algorithms.Comment: 9 page

    Semi-supervised Embedding in Attributed Networks with Outliers

    Full text link
    In this paper, we propose a novel framework, called Semi-supervised Embedding in Attributed Networks with Outliers (SEANO), to learn a low-dimensional vector representation that systematically captures the topological proximity, attribute affinity and label similarity of vertices in a partially labeled attributed network (PLAN). Our method is designed to work in both transductive and inductive settings while explicitly alleviating noise effects from outliers. Experimental results on various datasets drawn from the web, text and image domains demonstrate the advantages of SEANO over state-of-the-art methods in semi-supervised classification under transductive as well as inductive settings. We also show that a subset of parameters in SEANO is interpretable as outlier score and can significantly outperform baseline methods when applied for detecting network outliers. Finally, we present the use of SEANO in a challenging real-world setting -- flood mapping of satellite images and show that it is able to outperform modern remote sensing algorithms for this task.Comment: in Proceedings of SIAM International Conference on Data Mining (SDM'18

    Active Semi-Supervised Learning Using Sampling Theory for Graph Signals

    Full text link
    We consider the problem of offline, pool-based active semi-supervised learning on graphs. This problem is important when the labeled data is scarce and expensive whereas unlabeled data is easily available. The data points are represented by the vertices of an undirected graph with the similarity between them captured by the edge weights. Given a target number of nodes to label, the goal is to choose those nodes that are most informative and then predict the unknown labels. We propose a novel framework for this problem based on our recent results on sampling theory for graph signals. A graph signal is a real-valued function defined on each node of the graph. A notion of frequency for such signals can be defined using the spectrum of the graph Laplacian matrix. The sampling theory for graph signals aims to extend the traditional Nyquist-Shannon sampling theory by allowing us to identify the class of graph signals that can be reconstructed from their values on a subset of vertices. This approach allows us to define a criterion for active learning based on sampling set selection which aims at maximizing the frequency of the signals that can be reconstructed from their samples on the set. Experiments show the effectiveness of our method.Comment: 10 pages, 6 figures, To appear in KDD'1

    A reduced labeled samples (RLS) framework for classification of imbalanced concept-drifting streaming data.

    Get PDF
    Stream processing frameworks are designed to process the streaming data that arrives in time. An example of such data is stream of emails that a user receives every day. Most of the real world data streams are also imbalanced as is in the stream of emails, which contains few spam emails compared to a lot of legitimate emails. The classification of the imbalanced data stream is challenging due to the several reasons: First of all, data streams are huge and they can not be stored in the memory for one time processing. Second, if the data is imbalanced, the accuracy of the majority class mostly dominates the results. Third, data streams are changing over time, and that causes degradation in the model performance. Hence the model should get updated when such changes are detected. Finally, the true labels of the all samples are not available immediately after classification, and only a fraction of the data is possible to get labeled in real world applications. That is because the labeling is expensive and time consuming. In this thesis, a framework for modeling the streaming data when the classes of the data samples are imbalanced is proposed. This framework is called Reduced Labeled Samples (RLS). RLS is a chunk based learning framework that builds a model using partially labeled data stream, when the characteristics of the data change. In RLS, a fraction of the samples are labeled and are used in modeling, and the performance is not significantly different from that of the 100% labeling. RLS maintains an ensemble of classifiers to boost the performance. RLS uses the information from labeled data in a supervised fashion, and also is extended to use the information from unlabeled data in a semi supervised fashion. RLS addresses both binary and multi class partially labeled data stream and the results show the basis of RLS is effective even in the context of multi class classification problems. Overall, the RLS is shown to be an effective framework for processing imbalanced and partially labeled data streams

    A submodular optimization framework for never-ending learning : semi-supervised, online, and active learning.

    Get PDF
    The revolution in information technology and the explosion in the use of computing devices in people\u27s everyday activities has forever changed the perspective of the data mining and machine learning fields. The enormous amounts of easily accessible, information rich data is pushing the data analysis community in general towards a shift of paradigm. In the new paradigm, data comes in the form a stream of billions of records received everyday. The dynamic nature of the data and its sheer size makes it impossible to use the traditional notion of offline learning where the whole data is accessible at any time point. Moreover, no amount of human resources is enough to get expert feedback on the data. In this work we have developed a unified optimization based learning framework that approaches many of the challenges mentioned earlier. Specifically, we developed a Never-Ending Learning framework which combines incremental/online, semi-supervised, and active learning under a unified optimization framework. The established framework is based on the class of submodular optimization methods. At the core of this work we provide a novel formulation of the Semi-Supervised Support Vector Machines (S3VM) in terms of submodular set functions. The new formulation overcomes the non-convexity issues of the S3VM and provides a state of the art solution that is orders of magnitude faster than the cutting edge algorithms in the literature. Next, we provide a stream summarization technique via exemplar selection. This technique makes it possible to keep a fixed size exemplar representation of a data stream that can be used by any label propagation based semi-supervised learning technique. The compact data steam representation allows a wide range of algorithms to be extended to incremental/online learning scenario. Under the same optimization framework, we provide an active learning algorithm that constitute the feedback between the learning machine and an oracle. Finally, the developed Never-Ending Learning framework is essentially transductive in nature. Therefore, our last contribution is an inductive incremental learning technique for incremental training of SVM using the properties of local kernels. We demonstrated through this work the importance and wide applicability of the proposed methodologies
    corecore