Search CORE

3,289 research outputs found

Low-rank Label Propagation for Semi-supervised Learning with 100 Millions Samples

Author: Kuang Rui
Li Zhuliu
Petegrosso Raphael
Saad Yousef
Zhang Wei
Publication venue
Publication date: 28/02/2017
Field of study

The success of semi-supervised learning crucially relies on the scalability to a huge amount of unlabelled data that are needed to capture the underlying manifold structure for better classification. Since computing the pairwise similarity between the training data is prohibitively expensive in most kinds of input data, currently, there is no general ready-to-use semi-supervised learning method/tool available for learning with tens of millions or more data points. In this paper, we adopted the idea of two low-rank label propagation algorithms, GLNP (Global Linear Neighborhood Propagation) and Kernel Nystr\"om Approximation, and implemented the parallelized version of the two algorithms accelerated with Nesterov's accelerated projected gradient descent for Big-data Label Propagation (BigLP). The parallel algorithms are tested on five real datasets ranging from 7000 to 10,000,000 in size and a simulation dataset of 100,000,000 samples. In the experiments, the implementation can scale up to datasets with 100,000,000 samples and hundreds of features and the algorithms also significantly improved the prediction accuracy when only a very small percentage of the data is labeled. The results demonstrate that the BigLP implementation is highly scalable to big data and effective in utilizing the unlabeled data for semi-supervised learning

arXiv.org e-Print Archive

Morpho-syntactic Lexicon Generation Using Graph-based Semi-supervised Learning

Author: Faruqui Manaal
McDonald Ryan
Soricut Radu
Publication venue
Publication date: 23/01/2016
Field of study

Morpho-syntactic lexicons provide information about the morphological and syntactic roles of words in a language. Such lexicons are not available for all languages and even when available, their coverage can be limited. We present a graph-based semi-supervised learning method that uses the morphological, syntactic and semantic relations between words to automatically construct wide coverage lexicons from small seed sets. Our method is language-independent, and we show that we can expand a 1000 word seed lexicon to more than 100 times its size with high quality for 11 languages. In addition, the automatically created lexicons provide features that improve performance in two downstream tasks: morphological tagging and dependency parsing.Comment: Transactions of the Association for Computational Linguistics (TACL) 201

arXiv.org e-Print Archive

Semi-Supervised Affective Meaning Lexicon Expansion Using Semantic and Distributed Word Representations

Author: Alhothali Areej
Hoey Jesse
Publication venue
Publication date: 28/03/2017
Field of study

In this paper, we propose an extension to graph-based sentiment lexicon induction methods by incorporating distributed and semantic word representations in building the similarity graph to expand a three-dimensional sentiment lexicon. We also implemented and evaluated the label propagation using four different word representations and similarity metrics. Our comprehensive evaluation of the four approaches was performed on a single data set, demonstrating that all four methods can generate a significant number of new sentiment assignments with high accuracy. The highest correlations (tau=0.51) and the lowest error (mean absolute error < 1.1%), obtained by combining both the semantic and the distributional features, outperformed the distributional-based and semantic-based label-propagation models and approached a supervised algorithm

arXiv.org e-Print Archive

Imitation Learning with Recurrent Neural Networks

Author: Nguyen Khanh
Publication venue
Publication date: 18/07/2016
Field of study

We present a novel view that unifies two frameworks that aim to solve sequential prediction problems: learning to search (L2S) and recurrent neural networks (RNN). We point out equivalences between elements of the two frameworks. By complementing what is missing from one framework comparing to the other, we introduce a more advanced imitation learning framework that, on one hand, augments L2S s notion of search space and, on the other hand, enhances RNNs training procedure to be more robust to compounding errors arising from training on highly correlated examples.Comment: 5 page

arXiv.org e-Print Archive

Recommender Systems with Random Walks: A Survey

Author: Semage Laknath
Publication venue
Publication date: 11/11/2017
Field of study

Recommender engines have become an integral component in today's e-commerce systems. From recommending books in Amazon to finding friends in social networks such as Facebook, they have become omnipresent. Generally, recommender systems can be classified into two main categories: content based and collaborative filtering based models. Both these models build relationships between users and items to provide recommendations. Content based systems achieve this task by utilizing features extracted from the context available, whereas collaborative systems use shared interests between user-item subsets. There is another relatively unexplored approach for providing recommendations that utilizes a stochastic process named random walks. This study is a survey exploring use cases of random walks in recommender systems and an attempt at classifying them.Comment: 15 pages, a survey pape

arXiv.org e-Print Archive

Search for carbon stars and DZ white dwarfs in SDSS spectra survey through machine learning

Author: Li Yinbi
Luo Ali
Si Jianmin
Wei Peng
Wu Fuchao
Wu Yihong
Zhang Jiannan
Zhao Yongheng
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 23/12/2013
Field of study

Carbon stars and DZ white dwarfs are two types of rare objects in the Galaxy. In this paper, we have applied the label propagation algorithm to search for these two types of stars from Data Release Eight (DR8) of the Sloan Digital Sky Survey (SDSS), which is verified to be efficient by calculating precision and recall. From nearly two million spectra including stars, galaxies and QSOs, we have found 260 new carbon stars in which 96 stars have been identified as dwarfs and 7 identified as giants, and 11 composition spectrum systems (each of them consists of a white dwarf and a carbon star). Similarly, using the label propagation method, we have obtained 29 new DZ white dwarfs from SDSS DR8. Compared with PCA reconstructed spectra, the 29 findings are typical DZ white dwarfs. We have also investigated their proper motions by comparing them with proper motion distribution of 9,374 white dwarfs, and found that they satisfy the current observed white dwarfs by SDSS generally have large proper motions. In addition, we have estimated their effective temperatures by fitting the polynomial relationship between effective temperature and g-r color of known DZ white dwarfs, and found 12 of the 29 new DZ white dwarfs are cool, in which nine are between 6000K and 6600K, and three are below 6000K.Comment: 5 figures, 6 table

arXiv.org e-Print Archive

Trends in Combating Image Spam E-mails

Author: Ketari Lamia Mohammed
Khanum Mohammadi Akheela
Publication venue
Publication date: 08/12/2012
Field of study

With the rapid adoption of Internet as an easy way to communicate, the amount of unsolicited e-mails, known as spam e-mails, has been growing rapidly. The major problem of spam e-mails is the loss of productivity and a drain on IT resources. Today, we receive spam more rapidly than the legitimate e-mails. Initially, spam e-mails contained only textual messages which were easily detected by the text-based spam filters. To evade such detection, spammers came up with a new sophisticated technique called image spam. Image spam consists in embedding the advertisement text in images rather than in the body of the e-mail, yet the image contents are not detected by most spam filters. In this paper, we examine the motivations and the challenges in image spam filtering research, and we review the recent trends in combating image spam e-mails. The review indicates that spamming is a business model and spammers are becoming more sophisticated in their approach to adapt to all challenges, and hence, defeating the conventional spam filtering technologies. Therefore, image spam detection techniques should be scalable and adaptable to meet the future tactics of the spammers

arXiv.org e-Print Archive

Pairwise Constraint Propagation on Multi-View Data

Author: Lu Zhiwu
Wang Liwei
Publication venue
Publication date: 18/01/2015
Field of study

This paper presents a graph-based learning approach to pairwise constraint propagation on multi-view data. Although pairwise constraint propagation has been studied extensively, pairwise constraints are usually defined over pairs of data points from a single view, i.e., only intra-view constraint propagation is considered for multi-view tasks. In fact, very little attention has been paid to inter-view constraint propagation, which is more challenging since pairwise constraints are now defined over pairs of data points from different views. In this paper, we propose to decompose the challenging inter-view constraint propagation problem into semi-supervised learning subproblems so that they can be efficiently solved based on graph-based label propagation. To the best of our knowledge, this is the first attempt to give an efficient solution to inter-view constraint propagation from a semi-supervised learning viewpoint. Moreover, since graph-based label propagation has been adopted for basic optimization, we develop two constrained graph construction methods for interview constraint propagation, which only differ in how the intra-view pairwise constraints are exploited. The experimental results in cross-view retrieval have shown the promising performance of our inter-view constraint propagation

arXiv.org e-Print Archive

Cross-Graph Learning of Multi-Relational Associations

Author: Liu Hanxiao
Yang Yiming
Publication venue
Publication date: 06/05/2016
Field of study

Cross-graph Relational Learning (CGRL) refers to the problem of predicting the strengths or labels of multi-relational tuples of heterogeneous object types, through the joint inference over multiple graphs which specify the internal connections among each type of objects. CGRL is an open challenge in machine learning due to the daunting number of all possible tuples to deal with when the numbers of nodes in multiple graphs are large, and because the labeled training instances are extremely sparse as typical. Existing methods such as tensor factorization or tensor-kernel machines do not work well because of the lack of convex formulation for the optimization of CGRL models, the poor scalability of the algorithms in handling combinatorial numbers of tuples, and/or the non-transductive nature of the learning methods which limits their ability to leverage unlabeled data in training. This paper proposes a novel framework which formulates CGRL as a convex optimization problem, enables transductive learning using both labeled and unlabeled tuples, and offers a scalable algorithm that guarantees the optimal solution and enjoys a linear time complexity with respect to the sizes of input graphs. In our experiments with a subset of DBLP publication records and an Enzyme multi-source dataset, the proposed method successfully scaled to the large cross-graph inference problem, and outperformed other representative approaches significantly

arXiv.org e-Print Archive

SCSP: Spectral Clustering Filter Pruning with Soft Self-adaption Manners

Author: Fu Yanwei
Qian Xuelin
Xue Xiangyang
Yang Heng
Zhuo Huiyuan
Publication venue
Publication date: 13/06/2018
Field of study

Deep Convolutional Neural Networks (CNN) has achieved significant success in computer vision field. However, the high computational cost of the deep complex models prevents the deployment on edge devices with limited memory and computational resource. In this paper, we proposed a novel filter pruning for convolutional neural networks compression, namely spectral clustering filter pruning with soft self-adaption manners (SCSP). We first apply spectral clustering on filters layer by layer to explore their intrinsic connections and only count on efficient groups. By self-adaption manners, the pruning operations can be done in few epochs to let the network gradually choose meaningful groups. According to this strategy, we not only achieve model compression while keeping considerable performance, but also find a novel angle to interpret the model compression process

arXiv.org e-Print Archive