8,596 research outputs found
An ontology enhanced parallel SVM for scalable spam filter training
This is the post-print version of the final paper published in Neurocomputing. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2013 Elsevier B.V.Spam, under a variety of shapes and forms, continues to inflict increased damage. Varying approaches including Support Vector Machine (SVM) techniques have been proposed for spam filter training and classification. However, SVM training is a computationally intensive process. This paper presents a MapReduce based parallel SVM algorithm for scalable spam filter training. By distributing, processing and optimizing the subsets of the training data across multiple participating computer nodes, the parallel SVM reduces the training time significantly. Ontology semantics are employed to minimize the impact of accuracy degradation when distributing the training data among a number of SVM classifiers. Experimental results show that ontology based augmentation improves the accuracy level of the parallel SVM beyond the original sequential counterpart
Short Text Classification using Contextual Analysis
Peer reviewedPublisher PD
GENERATING KNOWLEDGE STRUCTURES FROM OPEN DATASETS' TAGS - AN APPROACH BASED ON FORMAL CONCEPT ANALYSIS
Under influence of data transparency initiatives, a variety of institutions have published a significant number of datasets. In most cases, data publishers take advantage of open data portals (ODPs) for making their datasets publicly available. To improve the datasets' discoverability, open data portals (ODPs) group open datasets into categories using various criteria like publishers, institutions, formats, and descriptions. For these purposes, portals take advantage of metadata accompanying datasets. However, a part of metadata may be missing, or may be incomplete or redundant. Each of these situations makes it difficult for users to find appropriate datasets and obtain the desired information. As the number of available datasets grows, this problem becomes easy to notice. This paper is focused on the first step towards decreasing this problem by implementing knowledge structures to be used in situations where a part of datasets' metadata is missing. In particular, we focus on developing knowledge structures capable of suggesting the best match for the category where an uncategorized dataset should belong to. Our approach relies on dataset descriptions provided by users within dataset tags. We take advantage of a formal concept analysis to reveal the shared conceptualization originating from the tags' usage by developing a concept lattice per each category of open datasets. Since tags represent free text metadata entered by users, in this paper we will present a method of optimizing their usage through means of semantic similarity measures based on natural language processing mechanisms. Finally, we will demonstrate the advantage of our proposal by comparing concept lattices generated using formal the concept analysis before and after the optimization process. The main experimental research results will show that our approach is capable of reducing the number of nodes within a lattice more than 40%
Recommended from our members
A framework for evaluating automatic indexing or classification in the context of retrieval
Tools for automatic subject assignment help deal with scale and sustainability in creating and enriching metadata, establishing more connections across and between resources and enhancing consistency. While some software vendors and experimental researchers claim the tools can replace manual subject indexing, hard scientific evidence of their performance in operating information environments is scarce. A major reason for this is that research is usually conducted in laboratory conditions, excluding the complexities of real-life systems and situations. The paper reviews and discusses issues with existing evaluation approaches such as problems of aboutness and relevance assessments, implying the need to use more than a single “gold standard” method when evaluating indexing and retrieval and proposes a comprehensive evaluation framework. The framework is informed by a systematic review of the literature on indexing, classification and approaches: evaluating indexing quality directly through assessment by an evaluator or through comparison with a gold standard; evaluating the quality of computer-assisted indexing directly in the context of an indexing workflow, and evaluating indexing quality indirectly through analyzing retrieval performance
- …