2 research outputs found
Incremental Spectral Sparsification for Large-Scale Graph-Based Semi-Supervised Learning
While the harmonic function solution performs well in many semi-supervised
learning (SSL) tasks, it is known to scale poorly with the number of samples.
Recent successful and scalable methods, such as the eigenfunction method focus
on efficiently approximating the whole spectrum of the graph Laplacian
constructed from the data. This is in contrast to various subsampling and
quantization methods proposed in the past, which may fail in preserving the
graph spectra. However, the impact of the approximation of the spectrum on the
final generalization error is either unknown, or requires strong assumptions on
the data. In this paper, we introduce Sparse-HFS, an efficient
edge-sparsification algorithm for SSL. By constructing an edge-sparse and
spectrally similar graph, we are able to leverage the approximation guarantees
of spectral sparsification methods to bound the generalization error of
Sparse-HFS. As a result, we obtain a theoretically-grounded approximation
scheme for graph-based SSL that also empirically matches the performance of
known large-scale methods
Putting Self-Supervised Token Embedding on the Tables
Information distribution by electronic messages is a privileged means of
transmission for many businesses and individuals, often under the form of
plain-text tables. As their number grows, it becomes necessary to use an
algorithm to extract text and numbers instead of a human. Usual methods are
focused on regular expressions or on a strict structure in the data, but are
not efficient when we have many variations, fuzzy structure or implicit labels.
In this paper we introduce SC2T, a totally self-supervised model for
constructing vector representations of tokens in semi-structured messages by
using characters and context levels that address these issues. It can then be
used for an unsupervised labeling of tokens, or be the basis for a
semi-supervised information extraction system