1,521 research outputs found
Active Sample Selection Based Incremental Algorithm for Attribute Reduction with Rough Sets
Attribute reduction with rough sets is an effective technique for obtaining a compact and informative attribute set from a given dataset. However, traditional algorithms have no explicit provision for handling dynamic datasets where data present themselves in successive samples. Incremental algorithms for attribute reduction with rough sets have been recently introduced to handle dynamic datasets with large samples, though they have high complexity in time and space. To address the time/space complexity issue of the algorithms, this paper presents a novel incremental algorithm for attribute reduction with rough sets based on the adoption of an active sample selection process and an insight into the attribute reduction process. This algorithm first decides whether each incoming sample is useful with respect to the current dataset by the active sample selection process. A useless sample is discarded while a useful sample is selected to update a reduct. At the arrival of a useful sample, the attribute reduction process is then employed to guide how to add and/or delete attributes in the current reduct. The two processes thus constitute the theoretical framework of our algorithm. The proposed algorithm is finally experimentally shown to be efficient in time and space
An Intelligent Decision Support System for Business IT Security Strategy
Cyber threat intelligence (CTI) is an emerging approach to improve cyber security of
business IT environment. It has information of an a ected business IT context. CTI
sharing tools are available for subscribers, and CTI feeds are increasingly available.
If another business IT context is similar to a CTI feed context, the threat described
in the CTI feed might also take place in the business IT context. Businesses can
take proactive defensive actions if relevant CTI is identi ed. However, a challenge is
how to develop an e ective connection strategy for CTI onto business IT contexts.
Businesses are still insu ciently using CTI because not all of them have su cient
knowledge from domain experts. Moreover, business IT contexts vary over time.
When the business IT contextual states have changed, the relevant CTI might be no
longer appropriate and applicable. Another challenge is how a connection strategy
has the ability to adapt to the business IT contextual changes.
To ll the gap, in this Ph.D project, a dynamic connection strategy for CTI onto
business IT contexts is proposed and the strategy is instantiated to be a dynamic
connection rule assembly system. The system can identify relevant CTI for a business
IT context and can modify its internal con gurations and structures to adapt
to the business IT contextual changes.
This thesis introduces the system development phases from design to delivery,
and the contributions to knowledge are explained as follows.
A hybrid representation of the dynamic connection strategy is proposed to generalise
and interpret the problem domain and the system development. The representation
uses selected computational intelligence models and software development
models.
In terms of the computational intelligence models, a CTI feed context and a
business IT context are generalised to be the same type, i.e., context object. Grey
number model is selected to represent the attribute values of context objects. Fuzzy
sets are used to represent the context objects, and linguistic densities of the attribute
values of context objects are reasoned. To assemble applicable connection
knowledge, the system constructs a set of connection objects based on the context
objects and uses rough set operations to extract applicable connection objects that
contain the connection knowledge.
Furthermore, to adapt to contextual changes, a rough set based incremental
updating approach with multiple operations is developed to incrementally update
the approximations. A set of propositions are proposed to describe how the system
changes based on the previous states and internal structures of the system, and their
complexities and e ciencies are analysed.
In terms of the software development models, some uni ed modelling language
(UML) models are selected to represent the system in design phase. Activity diagram
is used to represent the business process of the system. Use case diagram is used to
represent the human interactions with the system. Class diagram is used to represent
the internal components and relationships between them. Using the representation,
developers can develop a prototype of the system rapidly.
Using the representation, an application of the system is developed using mainstream
software development techniques. RESTful software architecture is used
for the communication of the business IT contextual information and the analysis
results using CTI between the server and the clients. A script based method is
deployed in the clients to collect the contextual information. Observer pattern and
a timer are used for the design and development of the monitor-trigger mechanism.
In summary, the representation generalises real-world cases in the problem domain
and interprets the system data. A speci c business can initialise an instance of
the representation to be a speci c system based on its IT context and CTI feeds, and
the knowledge assembled by the system can be used to identify relevant CTI feeds.
From the relevant CTI data, the system locates and retrieves the useful information
that can inform security decisions and then sends it to the client users. When the
system needs to modify itself to adapt to the business IT contextual changes, the
system can invoke the corresponding incremental updating functions and avoid a
time-consuming re-computation. With this updating strategy, the application can
provide its users in the client side with timely support and useful information that
can inform security decisions using CTI
Dynamic Rule Covering Classification in Data Mining with Cyber Security Phishing Application
Data mining is the process of discovering useful patterns from datasets using intelligent techniques to help users make certain decisions. A typical data mining task is classification, which involves predicting a target variable known as the class in previously unseen data based on models learnt from an input dataset. Covering is a well-known classification approach that derives models with If-Then rules. Covering methods, such as PRISM, have a competitive predictive performance to other classical classification techniques such as greedy, decision tree and associative classification. Therefore, Covering models are appropriate decision-making tools and users favour them carrying out decisions.
Despite the use of Covering approach in data processing for different classification applications, it is also acknowledged that this approach suffers from the noticeable drawback of inducing massive numbers of rules making the resulting model large and unmanageable by users. This issue is attributed to the way Covering techniques induce the rules as they keep adding items to the rule’s body, despite the limited data coverage (number of training instances that the rule classifies), until the rule becomes with zero error. This excessive learning overfits the training dataset and also limits the applicability of Covering models in decision making, because managers normally prefer a summarised set of knowledge that they are able to control and comprehend rather a high maintenance models. In practice, there should be a trade-off between the number of rules offered by a classification model and its predictive performance. Another issue associated with the Covering models is the overlapping of training data among the rules, which happens when a rule’s classified data are discarded during the rule discovery phase. Unfortunately, the impact of a rule’s removed data on other potential rules is not considered by this approach. However, When removing training data linked with a rule, both frequency and rank of other rules’ items which have appeared in the removed data are updated. The impacted rules should maintain their true rank and frequency in a dynamic manner during the rule discovery phase rather just keeping the initial computed frequency from the original input dataset.
In response to the aforementioned issues, a new dynamic learning technique based on Covering and rule induction, that we call Enhanced Dynamic Rule Induction (eDRI), is developed. eDRI has been implemented in Java and it has been embedded in WEKA machine learning tool. The developed algorithm incrementally discovers the rules using primarily frequency and rule strength thresholds. These thresholds in practice limit the search space for both items as well as potential rules by discarding any with insufficient data representation as early as possible resulting in an efficient training phase. More importantly, eDRI substantially cuts down the number of training examples scans by continuously updating potential rules’ frequency and strength parameters in a dynamic manner whenever a rule gets inserted into the classifier. In particular, and for each derived rule, eDRI adjusts on the fly the remaining potential rules’ items frequencies as well as ranks specifically for those that appeared within the deleted training instances of the derived rule. This gives a more realistic model with minimal rules redundancy, and makes the process of rule induction efficient and dynamic and not static. Moreover, the proposed technique minimises the classifier’s number of rules at preliminary stages by stopping learning when any rule does not meet the rule’s strength threshold therefore minimising overfitting and ensuring a manageable classifier. Lastly, eDRI prediction procedure not only priorities using the best ranked rule for class forecasting of test data but also restricts the use of the default class rule thus reduces the number of misclassifications.
The aforementioned improvements guarantee classification models with smaller size that do not overfit the training dataset, while maintaining their predictive performance. The eDRI derived models particularly benefit greatly users taking key business decisions since they can provide a rich knowledge base to support their decision making. This is because these models’ predictive accuracies are high, easy to understand, and controllable as well as robust, i.e. flexible to be amended without drastic change. eDRI applicability has been evaluated on the hard problem of phishing detection. Phishing normally involves creating a fake well-designed website that has identical similarity to an existing business trustful website aiming to trick users and illegally obtain their credentials such as login information in order to access their financial assets. The experimental results against large phishing datasets revealed that eDRI is highly useful as an anti-phishing tool since it derived manageable size models when compared with other traditional techniques without hindering the classification performance. Further evaluation results using other several classification datasets from different domains obtained from University of California Data Repository have corroborated eDRI’s competitive performance with respect to accuracy, number of knowledge representation, training time and items space reduction. This makes the proposed technique not only efficient in inducing rules but also effective
- …