1,521 research outputs found

    Incremental Perspective for Feature Selection Based on Fuzzy Rough Sets

    Get PDF

    Active Sample Selection Based Incremental Algorithm for Attribute Reduction with Rough Sets

    Get PDF
    Attribute reduction with rough sets is an effective technique for obtaining a compact and informative attribute set from a given dataset. However, traditional algorithms have no explicit provision for handling dynamic datasets where data present themselves in successive samples. Incremental algorithms for attribute reduction with rough sets have been recently introduced to handle dynamic datasets with large samples, though they have high complexity in time and space. To address the time/space complexity issue of the algorithms, this paper presents a novel incremental algorithm for attribute reduction with rough sets based on the adoption of an active sample selection process and an insight into the attribute reduction process. This algorithm first decides whether each incoming sample is useful with respect to the current dataset by the active sample selection process. A useless sample is discarded while a useful sample is selected to update a reduct. At the arrival of a useful sample, the attribute reduction process is then employed to guide how to add and/or delete attributes in the current reduct. The two processes thus constitute the theoretical framework of our algorithm. The proposed algorithm is finally experimentally shown to be efficient in time and space

    An Intelligent Decision Support System for Business IT Security Strategy

    Get PDF
    Cyber threat intelligence (CTI) is an emerging approach to improve cyber security of business IT environment. It has information of an a ected business IT context. CTI sharing tools are available for subscribers, and CTI feeds are increasingly available. If another business IT context is similar to a CTI feed context, the threat described in the CTI feed might also take place in the business IT context. Businesses can take proactive defensive actions if relevant CTI is identi ed. However, a challenge is how to develop an e ective connection strategy for CTI onto business IT contexts. Businesses are still insu ciently using CTI because not all of them have su cient knowledge from domain experts. Moreover, business IT contexts vary over time. When the business IT contextual states have changed, the relevant CTI might be no longer appropriate and applicable. Another challenge is how a connection strategy has the ability to adapt to the business IT contextual changes. To ll the gap, in this Ph.D project, a dynamic connection strategy for CTI onto business IT contexts is proposed and the strategy is instantiated to be a dynamic connection rule assembly system. The system can identify relevant CTI for a business IT context and can modify its internal con gurations and structures to adapt to the business IT contextual changes. This thesis introduces the system development phases from design to delivery, and the contributions to knowledge are explained as follows. A hybrid representation of the dynamic connection strategy is proposed to generalise and interpret the problem domain and the system development. The representation uses selected computational intelligence models and software development models. In terms of the computational intelligence models, a CTI feed context and a business IT context are generalised to be the same type, i.e., context object. Grey number model is selected to represent the attribute values of context objects. Fuzzy sets are used to represent the context objects, and linguistic densities of the attribute values of context objects are reasoned. To assemble applicable connection knowledge, the system constructs a set of connection objects based on the context objects and uses rough set operations to extract applicable connection objects that contain the connection knowledge. Furthermore, to adapt to contextual changes, a rough set based incremental updating approach with multiple operations is developed to incrementally update the approximations. A set of propositions are proposed to describe how the system changes based on the previous states and internal structures of the system, and their complexities and e ciencies are analysed. In terms of the software development models, some uni ed modelling language (UML) models are selected to represent the system in design phase. Activity diagram is used to represent the business process of the system. Use case diagram is used to represent the human interactions with the system. Class diagram is used to represent the internal components and relationships between them. Using the representation, developers can develop a prototype of the system rapidly. Using the representation, an application of the system is developed using mainstream software development techniques. RESTful software architecture is used for the communication of the business IT contextual information and the analysis results using CTI between the server and the clients. A script based method is deployed in the clients to collect the contextual information. Observer pattern and a timer are used for the design and development of the monitor-trigger mechanism. In summary, the representation generalises real-world cases in the problem domain and interprets the system data. A speci c business can initialise an instance of the representation to be a speci c system based on its IT context and CTI feeds, and the knowledge assembled by the system can be used to identify relevant CTI feeds. From the relevant CTI data, the system locates and retrieves the useful information that can inform security decisions and then sends it to the client users. When the system needs to modify itself to adapt to the business IT contextual changes, the system can invoke the corresponding incremental updating functions and avoid a time-consuming re-computation. With this updating strategy, the application can provide its users in the client side with timely support and useful information that can inform security decisions using CTI

    Dynamic Rule Covering Classification in Data Mining with Cyber Security Phishing Application

    Get PDF
    Data mining is the process of discovering useful patterns from datasets using intelligent techniques to help users make certain decisions. A typical data mining task is classification, which involves predicting a target variable known as the class in previously unseen data based on models learnt from an input dataset. Covering is a well-known classification approach that derives models with If-Then rules. Covering methods, such as PRISM, have a competitive predictive performance to other classical classification techniques such as greedy, decision tree and associative classification. Therefore, Covering models are appropriate decision-making tools and users favour them carrying out decisions. Despite the use of Covering approach in data processing for different classification applications, it is also acknowledged that this approach suffers from the noticeable drawback of inducing massive numbers of rules making the resulting model large and unmanageable by users. This issue is attributed to the way Covering techniques induce the rules as they keep adding items to the rule’s body, despite the limited data coverage (number of training instances that the rule classifies), until the rule becomes with zero error. This excessive learning overfits the training dataset and also limits the applicability of Covering models in decision making, because managers normally prefer a summarised set of knowledge that they are able to control and comprehend rather a high maintenance models. In practice, there should be a trade-off between the number of rules offered by a classification model and its predictive performance. Another issue associated with the Covering models is the overlapping of training data among the rules, which happens when a rule’s classified data are discarded during the rule discovery phase. Unfortunately, the impact of a rule’s removed data on other potential rules is not considered by this approach. However, When removing training data linked with a rule, both frequency and rank of other rules’ items which have appeared in the removed data are updated. The impacted rules should maintain their true rank and frequency in a dynamic manner during the rule discovery phase rather just keeping the initial computed frequency from the original input dataset. In response to the aforementioned issues, a new dynamic learning technique based on Covering and rule induction, that we call Enhanced Dynamic Rule Induction (eDRI), is developed. eDRI has been implemented in Java and it has been embedded in WEKA machine learning tool. The developed algorithm incrementally discovers the rules using primarily frequency and rule strength thresholds. These thresholds in practice limit the search space for both items as well as potential rules by discarding any with insufficient data representation as early as possible resulting in an efficient training phase. More importantly, eDRI substantially cuts down the number of training examples scans by continuously updating potential rules’ frequency and strength parameters in a dynamic manner whenever a rule gets inserted into the classifier. In particular, and for each derived rule, eDRI adjusts on the fly the remaining potential rules’ items frequencies as well as ranks specifically for those that appeared within the deleted training instances of the derived rule. This gives a more realistic model with minimal rules redundancy, and makes the process of rule induction efficient and dynamic and not static. Moreover, the proposed technique minimises the classifier’s number of rules at preliminary stages by stopping learning when any rule does not meet the rule’s strength threshold therefore minimising overfitting and ensuring a manageable classifier. Lastly, eDRI prediction procedure not only priorities using the best ranked rule for class forecasting of test data but also restricts the use of the default class rule thus reduces the number of misclassifications. The aforementioned improvements guarantee classification models with smaller size that do not overfit the training dataset, while maintaining their predictive performance. The eDRI derived models particularly benefit greatly users taking key business decisions since they can provide a rich knowledge base to support their decision making. This is because these models’ predictive accuracies are high, easy to understand, and controllable as well as robust, i.e. flexible to be amended without drastic change. eDRI applicability has been evaluated on the hard problem of phishing detection. Phishing normally involves creating a fake well-designed website that has identical similarity to an existing business trustful website aiming to trick users and illegally obtain their credentials such as login information in order to access their financial assets. The experimental results against large phishing datasets revealed that eDRI is highly useful as an anti-phishing tool since it derived manageable size models when compared with other traditional techniques without hindering the classification performance. Further evaluation results using other several classification datasets from different domains obtained from University of California Data Repository have corroborated eDRI’s competitive performance with respect to accuracy, number of knowledge representation, training time and items space reduction. This makes the proposed technique not only efficient in inducing rules but also effective
    • …
    corecore