4,674 research outputs found
Dominance-based Rough Set Approach, basic ideas and main trends
Dominance-based Rough Approach (DRSA) has been proposed as a machine learning
and knowledge discovery methodology to handle Multiple Criteria Decision Aiding
(MCDA). Due to its capacity of asking the decision maker (DM) for simple
preference information and supplying easily understandable and explainable
recommendations, DRSA gained much interest during the years and it is now one
of the most appreciated MCDA approaches. In fact, it has been applied also
beyond MCDA domain, as a general knowledge discovery and data mining
methodology for the analysis of monotonic (and also non-monotonic) data. In
this contribution, we recall the basic principles and the main concepts of
DRSA, with a general overview of its developments and software. We present also
a historical reconstruction of the genesis of the methodology, with a specific
focus on the contribution of Roman S{\l}owi\'nski.Comment: This research was partially supported by TAILOR, a project funded by
European Union (EU) Horizon 2020 research and innovation programme under GA
No 952215. This submission is a preprint of a book chapter accepted by
Springer, with very few minor differences of just technical natur
An overview of decision table literature 1982-1995.
This report gives an overview of the literature on decision tables over the past 15 years. As much as possible, for each reference, an author supplied abstract, a number of keywords and a classification are provided. In some cases own comments are added. The purpose of these comments is to show where, how and why decision tables are used. The literature is classified according to application area, theoretical versus practical character, year of publication, country or origin (not necessarily country of publication) and the language of the document. After a description of the scope of the interview, classification results and the classification by topic are presented. The main body of the paper is the ordered list of publications with abstract, classification and comments.
Separate and conquer heuristic allows robust mining of contrast sets from various types of data
Identifying differences between groups is one of the most important knowledge
discovery problems. The procedure, also known as contrast sets mining, is
applied in a wide range of areas like medicine, industry, or economics. In the
paper we present RuleKit-CS, an algorithm for contrast set mining based on a
sequential covering - a well established heuristic for decision rule induction.
Multiple passes accompanied with an attribute penalization scheme allow
generating contrast sets describing same examples with different attributes,
unlike the standard sequential covering. The ability to identify contrast sets
in regression and survival data sets, the feature not provided by the existing
algorithms, further extends the usability of RuleKit-CS. Experiments on wide
range of data sets confirmed RuleKit-CS to be a useful tool for discovering
differences between defined groups. The algorithm is a part of the RuleKit
suite available at GitHub under GNU AGPL 3 licence
(https://github.com/adaa-polsl/RuleKit).
Keywords: Contrast sets, Sequential covering, Rule induction, Regression,
Survival, Knowledge discover
Rule Induction on Data Sets with Set-Value Attributes
Data sets may have instances where multiple values are possible which are described as set-value attributes. The established LEM2 algorithm does not handle data sets with set-value attributes. To solve this problem, a parallel approach was used during LEM2âs execution to avoid preprocessing data. Changing the creation of characteristic sets and attribute-value blocks to include all values for each case allows LEM2 to induce rules on data sets with set-value attributes. The ability to create a single local covering for set-value data sets increases the variety of data LEM2 can process
Rough set and rule-based multicriteria decision aiding
The aim of multicriteria decision aiding is to give the decision maker a recommendation concerning a set of objects evaluated from multiple points of view called criteria. Since a rational decision maker acts with respect to his/her value system, in order to recommend the most-preferred decision, one must identify decision maker's preferences. In this paper, we focus on preference discovery from data concerning some past decisions of the decision maker. We consider the preference model in the form of a set of "if..., then..." decision rules discovered from the data by inductive learning. To structure the data prior to induction of rules, we use the Dominance-based Rough Set Approach (DRSA). DRSA is a methodology for reasoning about data, which handles ordinal evaluations of objects on considered criteria and monotonic relationships between these evaluations and the decision. We review applications of DRSA to a large variety of multicriteria decision problems
Dynamic Rule Covering Classification in Data Mining with Cyber Security Phishing Application
Data mining is the process of discovering useful patterns from datasets using intelligent techniques to help users make certain decisions. A typical data mining task is classification, which involves predicting a target variable known as the class in previously unseen data based on models learnt from an input dataset. Covering is a well-known classification approach that derives models with If-Then rules. Covering methods, such as PRISM, have a competitive predictive performance to other classical classification techniques such as greedy, decision tree and associative classification. Therefore, Covering models are appropriate decision-making tools and users favour them carrying out decisions.
Despite the use of Covering approach in data processing for different classification applications, it is also acknowledged that this approach suffers from the noticeable drawback of inducing massive numbers of rules making the resulting model large and unmanageable by users. This issue is attributed to the way Covering techniques induce the rules as they keep adding items to the ruleâs body, despite the limited data coverage (number of training instances that the rule classifies), until the rule becomes with zero error. This excessive learning overfits the training dataset and also limits the applicability of Covering models in decision making, because managers normally prefer a summarised set of knowledge that they are able to control and comprehend rather a high maintenance models. In practice, there should be a trade-off between the number of rules offered by a classification model and its predictive performance. Another issue associated with the Covering models is the overlapping of training data among the rules, which happens when a ruleâs classified data are discarded during the rule discovery phase. Unfortunately, the impact of a ruleâs removed data on other potential rules is not considered by this approach. However, When removing training data linked with a rule, both frequency and rank of other rulesâ items which have appeared in the removed data are updated. The impacted rules should maintain their true rank and frequency in a dynamic manner during the rule discovery phase rather just keeping the initial computed frequency from the original input dataset.
In response to the aforementioned issues, a new dynamic learning technique based on Covering and rule induction, that we call Enhanced Dynamic Rule Induction (eDRI), is developed. eDRI has been implemented in Java and it has been embedded in WEKA machine learning tool. The developed algorithm incrementally discovers the rules using primarily frequency and rule strength thresholds. These thresholds in practice limit the search space for both items as well as potential rules by discarding any with insufficient data representation as early as possible resulting in an efficient training phase. More importantly, eDRI substantially cuts down the number of training examples scans by continuously updating potential rulesâ frequency and strength parameters in a dynamic manner whenever a rule gets inserted into the classifier. In particular, and for each derived rule, eDRI adjusts on the fly the remaining potential rulesâ items frequencies as well as ranks specifically for those that appeared within the deleted training instances of the derived rule. This gives a more realistic model with minimal rules redundancy, and makes the process of rule induction efficient and dynamic and not static. Moreover, the proposed technique minimises the classifierâs number of rules at preliminary stages by stopping learning when any rule does not meet the ruleâs strength threshold therefore minimising overfitting and ensuring a manageable classifier. Lastly, eDRI prediction procedure not only priorities using the best ranked rule for class forecasting of test data but also restricts the use of the default class rule thus reduces the number of misclassifications.
The aforementioned improvements guarantee classification models with smaller size that do not overfit the training dataset, while maintaining their predictive performance. The eDRI derived models particularly benefit greatly users taking key business decisions since they can provide a rich knowledge base to support their decision making. This is because these modelsâ predictive accuracies are high, easy to understand, and controllable as well as robust, i.e. flexible to be amended without drastic change. eDRI applicability has been evaluated on the hard problem of phishing detection. Phishing normally involves creating a fake well-designed website that has identical similarity to an existing business trustful website aiming to trick users and illegally obtain their credentials such as login information in order to access their financial assets. The experimental results against large phishing datasets revealed that eDRI is highly useful as an anti-phishing tool since it derived manageable size models when compared with other traditional techniques without hindering the classification performance. Further evaluation results using other several classification datasets from different domains obtained from University of California Data Repository have corroborated eDRIâs competitive performance with respect to accuracy, number of knowledge representation, training time and items space reduction. This makes the proposed technique not only efficient in inducing rules but also effective
- âŠ