4 research outputs found

    Unbiased phishing detection using domain name based features

    Get PDF
    2018 Summer.Includes bibliographical references.Internet users are coming under a barrage of phishing attacks of increasing frequency and sophistication. While these attacks have been remarkably resilient against the vast range of defenses proposed by academia, industry, and research organizations, machine learning approaches appear to be a promising one in distinguishing between phishing and legitimate websites. There are three main concerns with existing machine learning approaches for phishing detection. The first concern is there is neither a framework, preferably open-source, for extracting feature and keeping the dataset updated nor an updated dataset of phishing and legitimate website. The second concern is the large number of features used and the lack of validating arguments for the choice of the features selected to train the machine learning classifier. The last concern relates to the type of datasets used in the literature that seems to be inadvertently biased with respect to the features based on URL or content. In this thesis, we describe the implementation of our open-source and extensible framework to extract features and create up-to-date phishing dataset. With having this framework, named Fresh-Phish, we implemented 29 different features that we used to detect whether a given website is legitimate or phishing. We used 26 features that were reported in related work and added 3 new features and created a dataset of 6,000 websites with these features of which 3,000 were malicious and 3,000 were genuine and tested our approach. Using 6 different classifiers we achieved the accuracy of 93% which is a reasonable high in this field. To address the second and third concerns, we put forward the intuition that the domain name of phishing websites is the tell-tale sign of phishing and holds the key to successful phishing detection. We focus on this aspect of phishing websites and design features that explore the relationship of the domain name to the key elements of the website. Our work differs from existing state-of-the-art as our feature set ensures that there is minimal or no bias with respect to a dataset. Our learning model trains with only seven features and achieves a true positive rate of 98% and a classification accuracy of 97%, on sample dataset. Compared to the state-of-the-art work, our per data instance processing and classification is 4 times faster for legitimate websites and 10 times faster for phishing websites. Importantly, we demonstrate the shortcomings of using features based on URLs as they are likely to be biased towards dataset collection and usage. We show the robustness of our learning algorithm by testing our classifiers on unknown live phishing URLs and achieve a higher detection accuracy of 99.7% compared to the earlier known best result of 95% detection rate

    Fast Detection of Zero-Day Phishing Websites Using Machine Learning

    Get PDF
    The recent global growth in the number of internet users and online applications has led to a massive volume of personal data transactions taking place over the internet. In order to gain access to the valuable data and services involved for undertaking various malicious activities, attackers lure users to phishing websites that steal user credentials and other personal data required to impersonate their victims. Sophisticated phishing toolkits and flux networks are increasingly being used by attackers to create and host phishing websites, respectively, in order to increase the number of phishing attacks and evade detection. This has resulted in an increase in the number of new (zero-day) phishing websites. Anti-malware software and web browsers’ anti-phishing filters are widely used to detect the phishing websites thus preventing users from falling victim to phishing. However, these solutions mostly rely on blacklists of known phishing websites. In these techniques, the time lag between creation of a new phishing website and reporting it as malicious leaves a window during which users are exposed to the zero-day phishing websites. This has contributed to a global increase in the number of successful phishing attacks in recent years. To address the shortcoming, this research proposes three Machine Learning (ML)-based approaches for fast and highly accurate prediction of zero-day phishing websites using novel sets of prediction features. The first approach uses a novel set of 26 features based on URL structure, and webpage structure and contents to predict zero-day phishing webpages that collect users’ personal data. The other two approaches detect zero-day phishing webpages, through their hostnames, that are hosted in Fast Flux Service Networks (FFSNs) and Name Server IP Flux Networks (NSIFNs). The networks consist of frequently changing machines hosting malicious websites and their authoritative name servers respectively. The machines provide a layer of protection to the actual service hosts against blacklisting in order to prolong the active life span of the services. Consequently, the websites in these networks become more harmful than those hosted in normal networks. Aiming to address them, our second proposed approach predicts zero-day phishing hostnames hosted in FFSNs using a novel set of 56 features based on DNS, network and host characteristics of the hosting networks. Our last approach predicts zero-day phishing hostnames hosted in NSIFNs using a novel set of 11 features based on DNS and host characteristics of the hosting networks. The feature set in each approach is evaluated using 11 ML algorithms, achieving a high prediction performance with most of the algorithms. This indicates the relevance and robustness of the feature sets for their respective detection tasks. The feature sets also perform well against data collected over a later time period without retraining the data, indicating their long-term effectiveness in detecting the websites. The approaches use highly diversified feature sets which is expected to enhance the resistance to various detection evasion tactics. The measured prediction times of the first and the third approaches are sufficiently low for potential use for real-time protection of users. This thesis also introduces a multi-class classification technique for evaluating the feature sets in the second and third approaches. The technique predicts each of the hostname types as an independent outcome thus enabling experts to use type-specific measures in taking down the phishing websites. Lastly, highly accurate methods for labelling hostnames based on number of changes of IP addresses of authoritative name servers, monitored over a specific period of time, are proposed
    corecore