255 research outputs found

    Artificial intelligence in the cyber domain: Offense and defense

    Get PDF
    Artificial intelligence techniques have grown rapidly in recent years, and their applications in practice can be seen in many fields, ranging from facial recognition to image analysis. In the cybersecurity domain, AI-based techniques can provide better cyber defense tools and help adversaries improve methods of attack. However, malicious actors are aware of the new prospects too and will probably attempt to use them for nefarious purposes. This survey paper aims at providing an overview of how artificial intelligence can be used in the context of cybersecurity in both offense and defense.Web of Science123art. no. 41

    Phishing Web Page Detection using Optimised Machine Learning

    Get PDF
    Phishing is a type of social engineering attack that can affect any company or anyone. This paper explores the effect that different features and optimisation techniques have on the accuracy of intelligent phishing detection using machine learning algorithms. This paper explores both hyperparameter optimisation as well as feature selection optimisation. For hyperparameter tuning, both TPE (Tree-structured Parzen Estimator) and GA (Genetic Algorithm) were tested, with the best option being model dependent. For feature selection, GA, MFO (Moth Flame Optimisation) and PSO (Particle Swarm Optimisation) were used with PSO working best with a Random Forest model. This work used URL (Uniform Resource Locator), DOM (Document Object Model) structure, page rank and page information related features. This research found that the best combination was Random Forest using PSO for feature selection and TPE for hyperparameter optimisation, giving an accuracy of 99.33%

    A New Incremental Decision Tree Learning for Cyber Security based on ILDA and Mahalanobis Distance

    Get PDF
    A cyber-attack detection is currently essential for computer network protection. The fundamentals of protection are to detect cyber-attack effectively with the ability to combat it in various ways and with constant data learning such as internet traffic. With these functions, each cyber-attack can be memorized and protected effectively any time. This research will present procedures for a cyber-attack detection system Incremental Decision Tree Learning (IDTL) that use the principle through Incremental Linear Discriminant Analysis (ILDA) together with Mahalanobis distance for classification of the hierarchical tree by reducing data features that enhance classification of a variety of malicious data. The proposed model can learn a new incoming datum without involving the previous learned data and discard this datum after being learned. The results of the experiments revealed that the proposed method can improve classification accuracy as compare with other methods. They showed the highest accuracy when compared to other methods. If comparing with the effectiveness of each class, it was found that the proposed method can classify both intrusion datasets and other datasets efficiently

    Dynamic Rule Covering Classification in Data Mining with Cyber Security Phishing Application

    Get PDF
    Data mining is the process of discovering useful patterns from datasets using intelligent techniques to help users make certain decisions. A typical data mining task is classification, which involves predicting a target variable known as the class in previously unseen data based on models learnt from an input dataset. Covering is a well-known classification approach that derives models with If-Then rules. Covering methods, such as PRISM, have a competitive predictive performance to other classical classification techniques such as greedy, decision tree and associative classification. Therefore, Covering models are appropriate decision-making tools and users favour them carrying out decisions. Despite the use of Covering approach in data processing for different classification applications, it is also acknowledged that this approach suffers from the noticeable drawback of inducing massive numbers of rules making the resulting model large and unmanageable by users. This issue is attributed to the way Covering techniques induce the rules as they keep adding items to the rule’s body, despite the limited data coverage (number of training instances that the rule classifies), until the rule becomes with zero error. This excessive learning overfits the training dataset and also limits the applicability of Covering models in decision making, because managers normally prefer a summarised set of knowledge that they are able to control and comprehend rather a high maintenance models. In practice, there should be a trade-off between the number of rules offered by a classification model and its predictive performance. Another issue associated with the Covering models is the overlapping of training data among the rules, which happens when a rule’s classified data are discarded during the rule discovery phase. Unfortunately, the impact of a rule’s removed data on other potential rules is not considered by this approach. However, When removing training data linked with a rule, both frequency and rank of other rules’ items which have appeared in the removed data are updated. The impacted rules should maintain their true rank and frequency in a dynamic manner during the rule discovery phase rather just keeping the initial computed frequency from the original input dataset. In response to the aforementioned issues, a new dynamic learning technique based on Covering and rule induction, that we call Enhanced Dynamic Rule Induction (eDRI), is developed. eDRI has been implemented in Java and it has been embedded in WEKA machine learning tool. The developed algorithm incrementally discovers the rules using primarily frequency and rule strength thresholds. These thresholds in practice limit the search space for both items as well as potential rules by discarding any with insufficient data representation as early as possible resulting in an efficient training phase. More importantly, eDRI substantially cuts down the number of training examples scans by continuously updating potential rules’ frequency and strength parameters in a dynamic manner whenever a rule gets inserted into the classifier. In particular, and for each derived rule, eDRI adjusts on the fly the remaining potential rules’ items frequencies as well as ranks specifically for those that appeared within the deleted training instances of the derived rule. This gives a more realistic model with minimal rules redundancy, and makes the process of rule induction efficient and dynamic and not static. Moreover, the proposed technique minimises the classifier’s number of rules at preliminary stages by stopping learning when any rule does not meet the rule’s strength threshold therefore minimising overfitting and ensuring a manageable classifier. Lastly, eDRI prediction procedure not only priorities using the best ranked rule for class forecasting of test data but also restricts the use of the default class rule thus reduces the number of misclassifications. The aforementioned improvements guarantee classification models with smaller size that do not overfit the training dataset, while maintaining their predictive performance. The eDRI derived models particularly benefit greatly users taking key business decisions since they can provide a rich knowledge base to support their decision making. This is because these models’ predictive accuracies are high, easy to understand, and controllable as well as robust, i.e. flexible to be amended without drastic change. eDRI applicability has been evaluated on the hard problem of phishing detection. Phishing normally involves creating a fake well-designed website that has identical similarity to an existing business trustful website aiming to trick users and illegally obtain their credentials such as login information in order to access their financial assets. The experimental results against large phishing datasets revealed that eDRI is highly useful as an anti-phishing tool since it derived manageable size models when compared with other traditional techniques without hindering the classification performance. Further evaluation results using other several classification datasets from different domains obtained from University of California Data Repository have corroborated eDRI’s competitive performance with respect to accuracy, number of knowledge representation, training time and items space reduction. This makes the proposed technique not only efficient in inducing rules but also effective

    Feature Selection to Enhance Phishing Website Detection Based On URL Using Machine Learning Techniques

    Get PDF
    The detection of phishing websites based on machine learning has gained much attention due to its ability to detect newly generated phishing URLs. To detect phishing websites, most techniques combine URLs, web page content, and external features. However, the content of the web page and external features are time-consuming, require large computing power, and are not suitable for resource-constrained devices. To overcome this problem, this study applies feature selection techniques based on the URL to improve the detection process. The methodology for this study consists of seven stages, including data preparation, preprocessing, splitting the dataset into training and validation, feature selection, 10-fold cross-validation, validating the model, and finally performance evaluation. Two public datasets were used to validate the method. TreeSHAP and Information Gain were used to rank features and select the top 10, 15, and 20. These features are fed into three machine learning classifiers which are Naïve Bayes, Random Forest, and XGBoost. Their performance is evaluated based on accuracy, precision, and recall. As a result, the features ranked by TreeSHAP contributed most to improving detection accuracy. The highest accuracy of 98.59 percent was achieved by XGBoost for the first dataset with 15 features. For the second dataset, the highest accuracy is 90.21 percent using 20 features and Random Forest. As for Naïve Bayes, the highest accuracy recorded is 98.49 percent using the first dataset

    Phishing website detection using genetic algorithm-based feature selection and parameter hypertuning

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Business AnalyticsFalse webpages are created by cyber attackers who seek to mislead users into revealing sensitive and personal information, from credit card details to passwords. Phishing is a class of cyber attacks that mislead users into clicking on false websites, logging into related accounts, and subsequently stealing funds. This cyberattack increases annually given the exponential increase of e-commerce customers, which causes difficulty to distinguish between harmless and false websites. The conventional methods to detect phishing websites are focused on a database of blacklisted and whitelisted. Such methods are not capable to detect new phishing websites. To solve this problem, researchers are developing machine learning (ML) and deep learning-based methods. In this dissertation, a hybrid-based solution, which uses genetic algorithms and ML algorithms for phishing detection based on the URL of the website is proposed. Regarding evaluation, comparisons between conventional ML and DL models are performed using various feature sets resulting from commonly used feature selection methods, such as mutual information and recursive feature elimination. This dissertation proposes a final model with an accuracy of 95.34% on the test set

    Improved techniques for phishing email detection based on random forest and firefly-based support vector machine learning algorithms.

    Get PDF
    Master of Science in Computer Science. University of KwaZulu-Natal, Durban, 2014.Electronic fraud is one of the major challenges faced by the vast majority of online internet users today. Curbing this menace is not an easy task, primarily because of the rapid rate at which fraudsters change their mode of attack. Many techniques have been proposed in the academic literature to handle e-fraud. Some of them include: blacklist, whitelist, and machine learning (ML) based techniques. Among all these techniques, ML-based techniques have proven to be the most efficient, because of their ability to detect new fraudulent attacks as they appear.There are three commonly perpetrated electronic frauds, namely: email spam, phishing and network intrusion. Among these three, more financial loss has been incurred owing to phishing attacks. This research investigates and reports the use of MLand Nature Inspired technique in the domain of phishing detection, with the foremost objective of developing a dynamic and robust phishing email classifier with improved classification accuracy and reduced processing time.Two approaches to phishing email detection are proposed, and two email classifiers are developed based on the proposed approaches. In the first approach, a random forest algorithm is used to construct decision trees,which are,in turn,used for email classification. The second approach introduced a novel MLmethod that hybridizes firefly algorithm (FFA) and support vector machine (SVM). The hybridized method consists of three major stages: feature extraction phase, hyper-parameter selection phase and email classification phase. In the feature extraction phase, the feature vectors of all the features described in Section 3.6 are extracted and saved in a file for easy access.In the second stage, a novel hyper-parameter search algorithm, developed in this research, is used to generate exponentially growing sequence of paired C and Gamma (γ) values. FFA is then used to optimize the generated SVM hyper-parameters and to also find the best hyper-parameter pair. Finally, in the third phase, SVM is used to carry out the classification. This new approach addresses the problem of hyper-parameter optimization in SVM, and in turn, improves the classification speed and accuracy of SVM. Using two publicly available email datasets, some experiments are performed to evaluate the performance of the two proposed phishing email detection techniques. During the evaluation of each approach, a set of features (well suited for phishing detection) are extracted from the training dataset and used to constructthe classifiers. Thereafter, the trained classifiers are evaluated on the test dataset. The evaluations produced very good results. The RF-based classifier yielded a classification accuracy of 99.70%, a FP rate of 0.06% and a FN rate of 2.50%. Also, the hybridized classifier (known as FFA_SVM) produced a classification accuracy of 99.99%, a FP rate of 0.01% and a FN rate of 0.00%

    Security in Data Mining- A Comprehensive Survey

    Get PDF
    Data mining techniques, while allowing the individuals to extract hidden knowledge on one hand, introduce a number of privacy threats on the other hand. In this paper, we study some of these issues along with a detailed discussion on the applications of various data mining techniques for providing security. An efficient classification technique when used properly, would allow an user to differentiate between a phishing website and a normal website, to classify the users as normal users and criminals based on their activities on Social networks (Crime Profiling) and to prevent users from executing malicious codes by labelling them as malicious. The most important applications of Data mining is the detection of intrusions, where different Data mining techniques can be applied to effectively detect an intrusion and report in real time so that necessary actions are taken to thwart the attempts of the intruder. Privacy Preservation, Outlier Detection, Anomaly Detection and PhishingWebsite Classification are discussed in this paper

    A Survey on Attacks and Advances of Intrusion Detection Systems

    Get PDF
    Now day’s information of an organization floating over the internet that increases the traffic on the network as well as threats from attackers. To protect these sensitive material Intrusion Detection System (IDS) is situated in the scheme. It is an application software program or hardware mechanism that compacts with assaults by assembling information from a mixture of systems and network resources, then analyzing indications of defense dilemmas. Network Intrusion Detection (NID) is a method that efforts to determine unauthorized entrance to a network through analyzing traffic on the network. There are a variety of advances of intrusion detection, for instance Data Mining, Pattern Matching, Machine Learning and Measure Based Methods. This survey paper aims towards the proper learning of intrusion detection system with the intention that researchers could create employ of it and discover the new methods towards intrusions. Keywords: Intrusion Detection System, Data Mining, Pattern Matching, Anomaly detection, misuse detection, Machine Learning
    • …
    corecore