3,289 research outputs found
Low-rank Label Propagation for Semi-supervised Learning with 100 Millions Samples
The success of semi-supervised learning crucially relies on the scalability
to a huge amount of unlabelled data that are needed to capture the underlying
manifold structure for better classification. Since computing the pairwise
similarity between the training data is prohibitively expensive in most kinds
of input data, currently, there is no general ready-to-use semi-supervised
learning method/tool available for learning with tens of millions or more data
points. In this paper, we adopted the idea of two low-rank label propagation
algorithms, GLNP (Global Linear Neighborhood Propagation) and Kernel Nystr\"om
Approximation, and implemented the parallelized version of the two algorithms
accelerated with Nesterov's accelerated projected gradient descent for Big-data
Label Propagation (BigLP).
The parallel algorithms are tested on five real datasets ranging from 7000 to
10,000,000 in size and a simulation dataset of 100,000,000 samples. In the
experiments, the implementation can scale up to datasets with 100,000,000
samples and hundreds of features and the algorithms also significantly improved
the prediction accuracy when only a very small percentage of the data is
labeled. The results demonstrate that the BigLP implementation is highly
scalable to big data and effective in utilizing the unlabeled data for
semi-supervised learning
Morpho-syntactic Lexicon Generation Using Graph-based Semi-supervised Learning
Morpho-syntactic lexicons provide information about the morphological and
syntactic roles of words in a language. Such lexicons are not available for all
languages and even when available, their coverage can be limited. We present a
graph-based semi-supervised learning method that uses the morphological,
syntactic and semantic relations between words to automatically construct wide
coverage lexicons from small seed sets. Our method is language-independent, and
we show that we can expand a 1000 word seed lexicon to more than 100 times its
size with high quality for 11 languages. In addition, the automatically created
lexicons provide features that improve performance in two downstream tasks:
morphological tagging and dependency parsing.Comment: Transactions of the Association for Computational Linguistics (TACL)
201
Semi-Supervised Affective Meaning Lexicon Expansion Using Semantic and Distributed Word Representations
In this paper, we propose an extension to graph-based sentiment lexicon
induction methods by incorporating distributed and semantic word
representations in building the similarity graph to expand a three-dimensional
sentiment lexicon. We also implemented and evaluated the label propagation
using four different word representations and similarity metrics. Our
comprehensive evaluation of the four approaches was performed on a single data
set, demonstrating that all four methods can generate a significant number of
new sentiment assignments with high accuracy. The highest correlations
(tau=0.51) and the lowest error (mean absolute error < 1.1%), obtained by
combining both the semantic and the distributional features, outperformed the
distributional-based and semantic-based label-propagation models and approached
a supervised algorithm
Imitation Learning with Recurrent Neural Networks
We present a novel view that unifies two frameworks that aim to solve
sequential prediction problems: learning to search (L2S) and recurrent neural
networks (RNN). We point out equivalences between elements of the two
frameworks. By complementing what is missing from one framework comparing to
the other, we introduce a more advanced imitation learning framework that, on
one hand, augments L2S s notion of search space and, on the other hand,
enhances RNNs training procedure to be more robust to compounding errors
arising from training on highly correlated examples.Comment: 5 page
Recommender Systems with Random Walks: A Survey
Recommender engines have become an integral component in today's e-commerce
systems. From recommending books in Amazon to finding friends in social
networks such as Facebook, they have become omnipresent.
Generally, recommender systems can be classified into two main categories:
content based and collaborative filtering based models. Both these models build
relationships between users and items to provide recommendations. Content based
systems achieve this task by utilizing features extracted from the context
available, whereas collaborative systems use shared interests between user-item
subsets.
There is another relatively unexplored approach for providing recommendations
that utilizes a stochastic process named random walks. This study is a survey
exploring use cases of random walks in recommender systems and an attempt at
classifying them.Comment: 15 pages, a survey pape
Search for carbon stars and DZ white dwarfs in SDSS spectra survey through machine learning
Carbon stars and DZ white dwarfs are two types of rare objects in the Galaxy.
In this paper, we have applied the label propagation algorithm to search for
these two types of stars from Data Release Eight (DR8) of the Sloan Digital Sky
Survey (SDSS), which is verified to be efficient by calculating precision and
recall. From nearly two million spectra including stars, galaxies and QSOs, we
have found 260 new carbon stars in which 96 stars have been identified as
dwarfs and 7 identified as giants, and 11 composition spectrum systems (each of
them consists of a white dwarf and a carbon star). Similarly, using the label
propagation method, we have obtained 29 new DZ white dwarfs from SDSS DR8.
Compared with PCA reconstructed spectra, the 29 findings are typical DZ white
dwarfs. We have also investigated their proper motions by comparing them with
proper motion distribution of 9,374 white dwarfs, and found that they satisfy
the current observed white dwarfs by SDSS generally have large proper motions.
In addition, we have estimated their effective temperatures by fitting the
polynomial relationship between effective temperature and g-r color of known DZ
white dwarfs, and found 12 of the 29 new DZ white dwarfs are cool, in which
nine are between 6000K and 6600K, and three are below 6000K.Comment: 5 figures, 6 table
Trends in Combating Image Spam E-mails
With the rapid adoption of Internet as an easy way to communicate, the amount
of unsolicited e-mails, known as spam e-mails, has been growing rapidly. The
major problem of spam e-mails is the loss of productivity and a drain on IT
resources. Today, we receive spam more rapidly than the legitimate e-mails.
Initially, spam e-mails contained only textual messages which were easily
detected by the text-based spam filters. To evade such detection, spammers came
up with a new sophisticated technique called image spam. Image spam consists in
embedding the advertisement text in images rather than in the body of the
e-mail, yet the image contents are not detected by most spam filters. In this
paper, we examine the motivations and the challenges in image spam filtering
research, and we review the recent trends in combating image spam e-mails. The
review indicates that spamming is a business model and spammers are becoming
more sophisticated in their approach to adapt to all challenges, and hence,
defeating the conventional spam filtering technologies. Therefore, image spam
detection techniques should be scalable and adaptable to meet the future
tactics of the spammers
Pairwise Constraint Propagation on Multi-View Data
This paper presents a graph-based learning approach to pairwise constraint
propagation on multi-view data. Although pairwise constraint propagation has
been studied extensively, pairwise constraints are usually defined over pairs
of data points from a single view, i.e., only intra-view constraint propagation
is considered for multi-view tasks. In fact, very little attention has been
paid to inter-view constraint propagation, which is more challenging since
pairwise constraints are now defined over pairs of data points from different
views. In this paper, we propose to decompose the challenging inter-view
constraint propagation problem into semi-supervised learning subproblems so
that they can be efficiently solved based on graph-based label propagation. To
the best of our knowledge, this is the first attempt to give an efficient
solution to inter-view constraint propagation from a semi-supervised learning
viewpoint. Moreover, since graph-based label propagation has been adopted for
basic optimization, we develop two constrained graph construction methods for
interview constraint propagation, which only differ in how the intra-view
pairwise constraints are exploited. The experimental results in cross-view
retrieval have shown the promising performance of our inter-view constraint
propagation
Cross-Graph Learning of Multi-Relational Associations
Cross-graph Relational Learning (CGRL) refers to the problem of predicting
the strengths or labels of multi-relational tuples of heterogeneous object
types, through the joint inference over multiple graphs which specify the
internal connections among each type of objects. CGRL is an open challenge in
machine learning due to the daunting number of all possible tuples to deal with
when the numbers of nodes in multiple graphs are large, and because the labeled
training instances are extremely sparse as typical. Existing methods such as
tensor factorization or tensor-kernel machines do not work well because of the
lack of convex formulation for the optimization of CGRL models, the poor
scalability of the algorithms in handling combinatorial numbers of tuples,
and/or the non-transductive nature of the learning methods which limits their
ability to leverage unlabeled data in training. This paper proposes a novel
framework which formulates CGRL as a convex optimization problem, enables
transductive learning using both labeled and unlabeled tuples, and offers a
scalable algorithm that guarantees the optimal solution and enjoys a linear
time complexity with respect to the sizes of input graphs. In our experiments
with a subset of DBLP publication records and an Enzyme multi-source dataset,
the proposed method successfully scaled to the large cross-graph inference
problem, and outperformed other representative approaches significantly
SCSP: Spectral Clustering Filter Pruning with Soft Self-adaption Manners
Deep Convolutional Neural Networks (CNN) has achieved significant success in
computer vision field. However, the high computational cost of the deep complex
models prevents the deployment on edge devices with limited memory and
computational resource. In this paper, we proposed a novel filter pruning for
convolutional neural networks compression, namely spectral clustering filter
pruning with soft self-adaption manners (SCSP). We first apply spectral
clustering on filters layer by layer to explore their intrinsic connections and
only count on efficient groups. By self-adaption manners, the pruning
operations can be done in few epochs to let the network gradually choose
meaningful groups. According to this strategy, we not only achieve model
compression while keeping considerable performance, but also find a novel angle
to interpret the model compression process
- …