1,807 research outputs found
A traffic classification method using machine learning algorithm
Applying concepts of attack investigation in IT industry, this idea has been developed to design
a Traffic Classification Method using Data Mining techniques at the intersection of Machine
Learning Algorithm, Which will classify the normal and malicious traffic. This classification will
help to learn about the unknown attacks faced by IT industry. The notion of traffic classification
is not a new concept; plenty of work has been done to classify the network traffic for
heterogeneous application nowadays. Existing techniques such as (payload based, port based
and statistical based) have their own pros and cons which will be discussed in this
literature later, but classification using Machine Learning techniques is still an open field to explore and has provided very promising results up till now
Analysing similarity assessment in feature-vector case representations
Case-Based Reasoning (CBR) is a good technique to solve new problems based in previous experience. Main assumption in CBR relies in the hypothesis that similar problems should have similar solutions. CBR systems retrieve the most similar cases or experiences among those stored in the Case Base. Then, previous solutions given to these most similar past-solved cases can be adapted to fit new solutions for new cases or problems in a particular domain, instead of derive them from scratch. Thus, similarity measures are key elements in obtaining reliable similar cases, which will be used to derive solutions for new cases. This paper describes a comparative analysis of several commonly used similarity measures, including a measure previously developed by the authors, and a study on its performance in the CBR retrieval step for feature-vector case representations. The testing has been done using six-teen data sets from the UCI Machine Learning Database Repository, plus two complex environmental databases.Postprint (published version
Towards Intelligent Assistance for a Data Mining Process:-
A data mining (DM) process involves multiple stages. A simple, but typical, process might include
preprocessing data, applying a data-mining algorithm, and postprocessing the mining results.
There are many possible choices for each stage, and only some combinations are valid.
Because of the large space and non-trivial interactions, both novices and data-mining specialists
need assistance in composing and selecting DM processes. Extending notions developed for
statistical expert systems we present a prototype Intelligent Discovery Assistant (IDA), which
provides users with (i) systematic enumerations of valid DM processes, in order that important,
potentially fruitful options are not overlooked, and (ii) effective rankings of these valid processes
by different criteria, to facilitate the choice of DM processes to execute. We use the prototype to
show that an IDA can indeed provide useful enumerations and effective rankings in the context
of simple classification processes. We discuss how an IDA could be an important tool for
knowledge sharing among a team of data miners. Finally, we illustrate the claims with a comprehensive
demonstration of cost-sensitive classification using a more involved process and data
from the 1998 KDDCUP competition.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
A Decision tree-based attribute weighting filter for naive Bayes
The naive Bayes classifier continues to be a popular learning algorithm for data mining applications due to its simplicity and linear run-time. Many enhancements to the basic algorithm have been proposed to help mitigate its primary weakness--the assumption that attributes are independent given the class. All of them improve the performance of naĂŻve Bayes at the expense (to a greater or lesser degree) of execution time and/or simplicity of the final model. In this paper we present a simple filter method for setting attribute weights for use with naive Bayes. Experimental results show that naive Bayes with attribute weights rarely degrades the quality of the model compared to standard naive Bayes and, in many cases, improves it dramatically. The main advantages of this method compared to other approaches for improving naive Bayes is its
run-time complexity and the fact that it maintains the simplicity of the final model
Toward Intelligent Assistance for a Data Mining Process: An Ontology-Based Approach for Cost-Sensitive Classification
A data mining (DM) process involves multiple stages. A simple, but typical, process might include preprocessing data, applying a data mining algorithm, and postprocessing the mining results. There are many possible choices for each stage, and only some combinations are valid. Because of the large space and nontrivial interactions, both novices and data mining specialists need assistance in composing and selecting DM processes. Extending notions developed for statistical expert systems we present a prototype intelligent discovery assistant (IDA), which provides users with 1) systematic enumerations of valid DM processes, in order that important, potentially fruitful options are not overlooked, and 2) effective rankings of these valid processes by different criteria, to facilitate the choice of DM processes to execute. We use the prototype to show that an IDA can indeed provide useful enumerations and effective rankings in the context of simple classification processes. We discuss how an IDA could be an important tool for knowledge sharing among a team of data miners. Finally, we illustrate the claims with a demonstration of cost-sensitive classification using a more complicated process and data from the 1998 KDDCUP competition
Intelligent Assistance for the Data Mining Process: An Ontology-based Approach
A data mining (DM) process involves multiple stages. A simple, but typical, process might include
preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There
are many possible choices for each stage, and only some combinations are valid. Because of the
large space and non-trivial interactions, both novices and data-mining specialists need assistance in
composing and selecting DM processes. We present the concept of Intelligent Discovery Assistants
(IDAs), which provide users with (i) systematic enumerations of valid DM processes, in order that
important, potentially fruitful options are not overlooked, and (ii) effective rankings of these valid
processes by different criteria, to facilitate the choice of DM processes to execute. We use a prototype
to show that an IDA can indeed provide useful enumerations and effective rankings. We discuss
how an IDA is an important tool for knowledge sharing among a team of data miners. Finally,
we illustrate all the claims with a comprehensive demonstration using a more involved process and
data from the 1998 KDDCUP competition.Information Systems Working Papers Serie
Intelligent Assistance for the Data Mining Process: An Ontology-based Approach
A data mining (DM) process involves multiple stages. A simple, but typical, process might include
preprocessing data, applying a data-mining algorithm, and postprocessing the mining results. There
are many possible choices for each stage, and only some combinations are valid. Because of the
large space and non-trivial interactions, both novices and data-mining specialists need assistance in
composing and selecting DM processes. We present the concept of Intelligent Discovery Assistants
(IDAs), which provide users with (i) systematic enumerations of valid DM processes, in order that
important, potentially fruitful options are not overlooked, and (ii) effective rankings of these valid
processes by different criteria, to facilitate the choice of DM processes to execute. We use a prototype
to show that an IDA can indeed provide useful enumerations and effective rankings. We discuss
how an IDA is an important tool for knowledge sharing among a team of data miners. Finally,
we illustrate all the claims with a comprehensive demonstration using a more involved process and
data from the 1998 KDDCUP competition.Information Systems Working Papers Serie
Meta-Analysis of Vaterite Secondary Data Revealed the Synthesis Conditions for Polymorphic Control
Acknowledgements This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.Peer reviewedPostprin
Data Masking, Encryption, and their Effect on Classification Performance: Trade-offs Between Data Security and Utility
As data mining increasingly shapes organizational decision-making, the quality of its results must be questioned to ensure trust in the technology. Inaccuracies can mislead decision-makers and cause costly mistakes. With more data collected for analytical purposes, privacy is also a major concern. Data security policies and regulations are increasingly put in place to manage risks, but these policies and regulations often employ technologies that substitute and/or suppress sensitive details contained in the data sets being mined. Data masking and substitution and/or data encryption and suppression of sensitive attributes from data sets can limit access to important details. It is believed that the use of data masking and encryption can impact the quality of data mining results. This dissertation investigated and compared the causal effects of data masking and encryption on classification performance as a measure of the quality of knowledge discovery. A review of the literature found a gap in the body of knowledge, indicating that this problem had not been studied before in an experimental setting. The objective of this dissertation was to gain an understanding of the trade-offs between data security and utility in the field of analytics and data mining. The research used a nationally recognized cancer incidence database, to show how masking and encryption of potentially sensitive demographic attributes such as patientsâ marital status, race/ethnicity, origin, and year of birth, could have a statistically significant impact on the patientsâ predicted survival. Performance parameters measured by four different classifiers delivered sizable variations in the range of 9% to 10% between a control group, where the select attributes were untouched, and two experimental groups where the attributes were substituted or suppressed to simulate the effects of the data protection techniques. In practice, this represented a corroboration of the potential risk involved when basing medical treatment decisions using data mining applications where attributes in the data sets are masked or encrypted for patient privacy and security concerns
- âŠ