6,448 research outputs found
An ontology enhanced parallel SVM for scalable spam filter training
This is the post-print version of the final paper published in Neurocomputing. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2013 Elsevier B.V.Spam, under a variety of shapes and forms, continues to inflict increased damage. Varying approaches including Support Vector Machine (SVM) techniques have been proposed for spam filter training and classification. However, SVM training is a computationally intensive process. This paper presents a MapReduce based parallel SVM algorithm for scalable spam filter training. By distributing, processing and optimizing the subsets of the training data across multiple participating computer nodes, the parallel SVM reduces the training time significantly. Ontology semantics are employed to minimize the impact of accuracy degradation when distributing the training data among a number of SVM classifiers. Experimental results show that ontology based augmentation improves the accuracy level of the parallel SVM beyond the original sequential counterpart
Hash Embeddings for Efficient Word Representations
We present hash embeddings, an efficient method for representing words in a
continuous vector form. A hash embedding may be seen as an interpolation
between a standard word embedding and a word embedding created using a random
hash function (the hashing trick). In hash embeddings each token is represented
by -dimensional embeddings vectors and one dimensional weight
vector. The final dimensional representation of the token is the product of
the two. Rather than fitting the embedding vectors for each token these are
selected by the hashing trick from a shared pool of embedding vectors. Our
experiments show that hash embeddings can easily deal with huge vocabularies
consisting of millions of tokens. When using a hash embedding there is no need
to create a dictionary before training nor to perform any kind of vocabulary
pruning after training. We show that models trained using hash embeddings
exhibit at least the same level of performance as models trained using regular
embeddings across a wide range of tasks. Furthermore, the number of parameters
needed by such an embedding is only a fraction of what is required by a regular
embedding. Since standard embeddings and embeddings constructed using the
hashing trick are actually just special cases of a hash embedding, hash
embeddings can be considered an extension and improvement over the existing
regular embedding types
Parallel Processing of Large Graphs
More and more large data collections are gathered worldwide in various IT
systems. Many of them possess the networked nature and need to be processed and
analysed as graph structures. Due to their size they require very often usage
of parallel paradigm for efficient computation. Three parallel techniques have
been compared in the paper: MapReduce, its map-side join extension and Bulk
Synchronous Parallel (BSP). They are implemented for two different graph
problems: calculation of single source shortest paths (SSSP) and collective
classification of graph nodes by means of relational influence propagation
(RIP). The methods and algorithms are applied to several network datasets
differing in size and structural profile, originating from three domains:
telecommunication, multimedia and microblog. The results revealed that
iterative graph processing with the BSP implementation always and
significantly, even up to 10 times outperforms MapReduce, especially for
algorithms with many iterations and sparse communication. Also MapReduce
extension based on map-side join usually noticeably presents better efficiency,
although not as much as BSP. Nevertheless, MapReduce still remains the good
alternative for enormous networks, whose data structures do not fit in local
memories.Comment: Preprint submitted to Future Generation Computer System
- …