4,235 research outputs found
An ontology enhanced parallel SVM for scalable spam filter training
This is the post-print version of the final paper published in Neurocomputing. The published article is available from the link below. Changes resulting from the publishing process, such as peer review, editing, corrections, structural formatting, and other quality control mechanisms may not be reflected in this document. Changes may have been made to this work since it was submitted for publication. Copyright @ 2013 Elsevier B.V.Spam, under a variety of shapes and forms, continues to inflict increased damage. Varying approaches including Support Vector Machine (SVM) techniques have been proposed for spam filter training and classification. However, SVM training is a computationally intensive process. This paper presents a MapReduce based parallel SVM algorithm for scalable spam filter training. By distributing, processing and optimizing the subsets of the training data across multiple participating computer nodes, the parallel SVM reduces the training time significantly. Ontology semantics are employed to minimize the impact of accuracy degradation when distributing the training data among a number of SVM classifiers. Experimental results show that ontology based augmentation improves the accuracy level of the parallel SVM beyond the original sequential counterpart
Content based image retrieval using unclean positive examples
Conventional content-based image retrieval (CBIR) schemes employing relevance feedback may suffer from some problems in the practical applications. First, most ordinary users would like to complete their search in a single interaction especially on the web. Second, it is time consuming and difficult to label a lot of negative examples with sufficient variety. Third, ordinary users may introduce some noisy examples into the query. This correspondence explores solutions to a new issue that image retrieval using unclean positive examples. In the proposed scheme, multiple feature distances are combined to obtain image similarity using classification technology. To handle the noisy positive examples, a new two-step strategy is proposed by incorporating the methods of data cleaning and noise tolerant classifier. The extensive experiments carried out on two different real image collections validate the effectiveness of the proposed scheme.<br /
Dynamic Data Mining: Methodology and Algorithms
Supervised data stream mining has become an important and challenging data mining task in modern
organizations. The key challenges are threefold: (1) a possibly infinite number of streaming examples
and time-critical analysis constraints; (2) concept drift; and (3) skewed data distributions.
To address these three challenges, this thesis proposes the novel dynamic data mining (DDM)
methodology by effectively applying supervised ensemble models to data stream mining. DDM can be
loosely defined as categorization-organization-selection of supervised ensemble models. It is inspired
by the idea that although the underlying concepts in a data stream are time-varying, their distinctions
can be identified. Therefore, the models trained on the distinct concepts can be dynamically selected in
order to classify incoming examples of similar concepts.
First, following the general paradigm of DDM, we examine the different concept-drifting stream
mining scenarios and propose corresponding effective and efficient data mining algorithms.
• To address concept drift caused merely by changes of variable distributions, which we term
pseudo concept drift, base models built on categorized streaming data are organized and
selected in line with their corresponding variable distribution characteristics.
• To address concept drift caused by changes of variable and class joint distributions, which we
term true concept drift, an effective data categorization scheme is introduced. A group of
working models is dynamically organized and selected for reacting to the drifting concept.
Secondly, we introduce an integration stream mining framework, enabling the paradigm advocated by
DDM to be widely applicable for other stream mining problems. Therefore, we are able to introduce
easily six effective algorithms for mining data streams with skewed class distributions.
In addition, we also introduce a new ensemble model approach for batch learning, following the same
methodology. Both theoretical and empirical studies demonstrate its effectiveness.
Future work would be targeted at improving the effectiveness and efficiency of the proposed
algorithms. Meantime, we would explore the possibilities of using the integration framework to solve
other open stream mining research problems
A Comparative Study of the Application of Different Learning Techniques to Natural Language Interfaces
In this paper we present first results from a comparative study. Its aim is
to test the feasibility of different inductive learning techniques to perform
the automatic acquisition of linguistic knowledge within a natural language
database interface. In our interface architecture the machine learning module
replaces an elaborate semantic analysis component. The learning module learns
the correct mapping of a user's input to the corresponding database command
based on a collection of past input data. We use an existing interface to a
production planning and control system as evaluation and compare the results
achieved by different instance-based and model-based learning algorithms.Comment: 10 pages, to appear CoNLL9
- …