60,809 research outputs found
Safe Semi-Supervised Learning with Sparse Graphs
There has been substantial interest from both computer science and statistics in developing methods for graph-based semi-supervised learning. The attraction to the area involves several challenging applications brought forth from academia and industry where little data are available with training responses while lots of data are available overall. Ample evidence has demonstrated the value of several of these methods on real data applications, but it should be kept in mind that they heavily rely on some smoothness assumptions. The general frame- work for graph-based semi-supervised learning is to optimize a smooth function over the nodes of the proximity graph constructed from the feature data which is extremely time consuming as the conventional methods for graph construction in general create a dense graph. Lately the interest has shifted to developing faster and more efficient graph-based techniques on larger data, but it comes with a cost of reduced prediction accuracies and small areas of application. The focus of this research is to generate a graph-based semi-supervised model that attains fast convergence without losing its performance and with a larger applicability. The key feature of the semi-supervised model is that it does not fully rely on the smoothness assumptions and performs adequately on real data. Another model is proposed for the case with availability of multiple views. Empirical analysis with real and simulated data showed the competitive performance of the methods against other machine learning algorithms
Weakly Supervised-Based Oversampling for High Imbalance and High Dimensionality Data Classification
With the abundance of industrial datasets, imbalanced classification has
become a common problem in several application domains. Oversampling is an
effective method to solve imbalanced classification. One of the main challenges
of the existing oversampling methods is to accurately label the new synthetic
samples. Inaccurate labels of the synthetic samples would distort the
distribution of the dataset and possibly worsen the classification performance.
This paper introduces the idea of weakly supervised learning to handle the
inaccurate labeling of synthetic samples caused by traditional oversampling
methods. Graph semi-supervised SMOTE is developed to improve the credibility of
the synthetic samples' labels. In addition, we propose cost-sensitive
neighborhood components analysis for high dimensional datasets and bootstrap
based ensemble framework for highly imbalanced datasets. The proposed method
has achieved good classification performance on 8 synthetic datasets and 3
real-world datasets, especially for high imbalance and high dimensionality
problems. The average performances and robustness are better than the benchmark
methods
Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch
Graph-based Semi-supervised learning (SSL) algorithms have been successfully
used in a large number of applications. These methods classify initially
unlabeled nodes by propagating label information over the structure of graph
starting from seed nodes. Graph-based SSL algorithms usually scale linearly
with the number of distinct labels (m), and require O(m) space on each node.
Unfortunately, there exist many applications of practical significance with
very large m over large graphs, demanding better space and time complexity. In
this paper, we propose MAD-SKETCH, a novel graph-based SSL algorithm which
compactly stores label distribution on each node using Count-min Sketch, a
randomized data structure. We present theoretical analysis showing that under
mild conditions, MAD-SKETCH can reduce space complexity at each node from O(m)
to O(log m), and achieve similar savings in time complexity as well. We support
our analysis through experiments on multiple real world datasets. We observe
that MAD-SKETCH achieves similar performance as existing state-of-the-art
graph- based SSL algorithms, while requiring smaller memory footprint and at
the same time achieving up to 10x speedup. We find that MAD-SKETCH is able to
scale to datasets with one million labels, which is beyond the scope of
existing graph- based SSL algorithms.Comment: 9 page
Data-Driven Shape Analysis and Processing
Data-driven methods play an increasingly important role in discovering
geometric, structural, and semantic relationships between 3D shapes in
collections, and applying this analysis to support intelligent modeling,
editing, and visualization of geometric data. In contrast to traditional
approaches, a key feature of data-driven approaches is that they aggregate
information from a collection of shapes to improve the analysis and processing
of individual shapes. In addition, they are able to learn models that reason
about properties and relationships of shapes without relying on hard-coded
rules or explicitly programmed instructions. We provide an overview of the main
concepts and components of these techniques, and discuss their application to
shape classification, segmentation, matching, reconstruction, modeling and
exploration, as well as scene analysis and synthesis, through reviewing the
literature and relating the existing works with both qualitative and numerical
comparisons. We conclude our report with ideas that can inspire future research
in data-driven shape analysis and processing.Comment: 10 pages, 19 figure
- …