7 research outputs found

    Enhanced Spam Detection System for Twitter Social Networking Platform

    Get PDF
    Twitter social site is one of the most popular Online Social Networking Site (OSN) used by popular people such as Ministers, businessman, large companies, actors to share their information. In this site, around 500 million of tweets are posted monthly by the total 313 million Twitter active users. The widespread of Twitter has drawn the interest of spammers. These malicious actors exploit the platform for various nefarious purposes, including monitoring authentic users, disseminating harmful software, and promoting their agendas through URLs embedded in tweets. They engage in tactics like secret following and unfollowing legitimate users, all with the intent of gathering sensitive information.To resolve this problem, a secure spam detection based on machine learning approach is designed. The designed used stop word removal, word to vector model to refined and dimensionally reduced the data. To enhance the quality of the data Cosine similarity is also been applied to measure the similarity score among the tweets and based upon that Artificial Neural Network (ANN) is trained. Later on, it is used to test the efficiency by examining the performance parameters in terms of precision, recall and F-measure. Also, the comparative analysis has been performed to present the efficiency of the work. The average precision, recall and F measure of proposed spam detection model of 0.9252, 0.6107 and 0.734 are obtained

    Comparación de técnicas de machine learning para detección de sitios web de phishing

    Get PDF
    El phishing es el robo de datos personales a través de páginas web falsas. La víctima de este robo es dirigida a esta página falsa, donde se le solicita ingresar sus datos para validar su identidad. Es en ese momento que se realiza el robo, ya que al ingresar sus datos, estos son almacenados y usados por el hacker responsable de dicho ataque para venderlos o ingresar a las entidades y realizar robos o estafas. Para este trabajo se ha investigado sobre distintos métodos de detección de páginas web phishing utilizando técnicas de machine learning. Así, el propósito de este trabajo es realizar una comparación de dichas técnicas que han demostrado ser las más efectivas en la detección de los sitios web phishing. Los resultados obtenidos demuestran que los clasificadores de árboles, denominados Árbol de Decisión y Bosque Aleatorio, han alcanzado las mayores tasas de precisión y efectividad, con valores de entre 97 % y 99 % en la detección de este tipo de páginas

    Unbiased phishing detection using domain name based features

    Get PDF
    2018 Summer.Includes bibliographical references.Internet users are coming under a barrage of phishing attacks of increasing frequency and sophistication. While these attacks have been remarkably resilient against the vast range of defenses proposed by academia, industry, and research organizations, machine learning approaches appear to be a promising one in distinguishing between phishing and legitimate websites. There are three main concerns with existing machine learning approaches for phishing detection. The first concern is there is neither a framework, preferably open-source, for extracting feature and keeping the dataset updated nor an updated dataset of phishing and legitimate website. The second concern is the large number of features used and the lack of validating arguments for the choice of the features selected to train the machine learning classifier. The last concern relates to the type of datasets used in the literature that seems to be inadvertently biased with respect to the features based on URL or content. In this thesis, we describe the implementation of our open-source and extensible framework to extract features and create up-to-date phishing dataset. With having this framework, named Fresh-Phish, we implemented 29 different features that we used to detect whether a given website is legitimate or phishing. We used 26 features that were reported in related work and added 3 new features and created a dataset of 6,000 websites with these features of which 3,000 were malicious and 3,000 were genuine and tested our approach. Using 6 different classifiers we achieved the accuracy of 93% which is a reasonable high in this field. To address the second and third concerns, we put forward the intuition that the domain name of phishing websites is the tell-tale sign of phishing and holds the key to successful phishing detection. We focus on this aspect of phishing websites and design features that explore the relationship of the domain name to the key elements of the website. Our work differs from existing state-of-the-art as our feature set ensures that there is minimal or no bias with respect to a dataset. Our learning model trains with only seven features and achieves a true positive rate of 98% and a classification accuracy of 97%, on sample dataset. Compared to the state-of-the-art work, our per data instance processing and classification is 4 times faster for legitimate websites and 10 times faster for phishing websites. Importantly, we demonstrate the shortcomings of using features based on URLs as they are likely to be biased towards dataset collection and usage. We show the robustness of our learning algorithm by testing our classifiers on unknown live phishing URLs and achieve a higher detection accuracy of 99.7% compared to the earlier known best result of 95% detection rate

    Detection of suspicious URLs in online social networks using supervised machine learning algorithms

    Get PDF
    This thesis proposes the use of several supervised machine learning classification models that were built to detect the distribution of malicious content in OSNs. The main focus was on ensemble learning algorithms such as Random Forest, gradient boosting trees, extra trees, and XGBoost. Features were used to identify social network posts that contain malicious URLs derived from several sources, such as domain WHOIS record, web page content, URL lexical and redirection data, and Twitter metadata. The thesis describes a systematic analysis of the hyper-parameters of tree-based models. The impact of key parameters, such as the number of trees, depth of trees and minimum size of leaf nodes on classification performance, was assessed. The results show that controlling the complexity of Random Forest classifiers applied to social media spam is essential to avoid overfitting and optimise performance. The model complexity could be reduced by removing uninformative features, as the complexity they add to the model is greater than the advantages they give to the model to make decisions. Moreover, model-combining methods were tested, which are the voting and stacking methods. Both show advantages and disadvantages; however, in general, they appear to provide a statistically significant improvement in comparison to the highest singular model. The critical benefit of applying the stacking method to automate the model selection process is that it is effective in giving more weight to more topperforming models and less affected by weak ones. Finally, 'SuspectRate', an online malicious URL detection system, was built to offer a service to give a suspicious probability of tweets with attached URLs. A key feature of this system is that it can dynamically retrain and expand current models

    A Risk management framework for the BYOD environment

    Get PDF
    Computer networks in organisations today have different layers of connections, which are either domain connections or external connections. The hybrid network contains the standard domain connections, cloud base connections, “bring your own device” (BYOD) connections, together with the devices and network connections of the Internet of Things (IoT). All these technologies will need to be incorporated in the Oman Vision 2040 strategy, which will involve changing several cities to smart cities. To implement this strategy artificial intelligence, cloud computing, BYOD and IoT will be adopted. This research will focus on the adoption of BYOD in the Oman context. It will have advantages for organisations, such as increasing productivity and reducing costs. However, these benefits come with security risks and privacy concerns, the users being the main contributors of these risks. The aim of this research is to develop a risk management and security framework for the BYOD environment to minimise these risks. The proposed framework is designed to detect and predict the risks by the use of MDM event logs and function logs. The chosen methodology is a combination of both qualitative and quantitative approaches, known as a mixed-methods approach. The approach adopted in this research will identify the latest threats and risks experienced in BYOD environments. This research also investigates the level of user-awareness of BYOD security methods. The proposed framework will enhance the current techniques for risk management by improving risk detection and prediction of threats, as well as, enabling BYOD risk management systems to generate notifications and recommendations of possible preventive/mitigation actions to deal with them

    Effective Features and Machine Learning Methods for Document Classification

    Get PDF
    Document classification has been involved in a variety of applications, such as phishing and fraud detection, news categorisation, and information retrieval. This thesis aims to provide novel solutions to several important problems presented by document classification. First, an improved Principal Components Analysis (PCA), based on similarity and correlation criteria instead of covariance, is proposed, which aims to capture low-dimensional feature subset that facilitates improved performance in text classification. The experimental results have demonstrated the advantages and usefulness of the proposed method for text classification in high-dimensional feature space in terms of the number of features required to achieve the best classification accuracy. Second, two hybrid feature-subset selection methods are proposed based on the combination (via either union or intersection) of the results of both supervised (in one method) and unsupervised (in the other method) filter approaches prior to the use of a wrapper, leading to low-dimensional feature subset that can achieve both high classification accuracy and good interpretability, and spend less processing time than most current methods. The experimental results have demonstrated the effectiveness of the proposed methods for feature subset selection in high-dimensional feature space in terms of the number of selected features and the processing time spent to achieve the best classification accuracy. Third, a class-specific (supervised) pre-trained approach based on a sparse autoencoder is proposed for acquiring low-dimensional interesting structure of relevant features, which can be used for high-performance document classification. The experimental results have demonstrated the merit of this proposed method for document classification in high-dimensional feature space, in terms of the limited number of features required to achieve good classification accuracy. Finally, deep classifier structures associated with a stacked autoencoder (SAE) for higher-level feature extraction are investigated, aiming to overcome the difficulties experienced in training deep neural networks with limited training data in high-dimensional feature space, such as overfitting and vanishing/exploding gradients. This investigation has resulted in a three-stage learning algorithm for training deep neural networks. In comparison with support vector machines (SVMs) combined with SAE and Deep Multilayer Perceptron (DMLP) with random weight initialisation, the experimental results have shown the advantages and effectiveness of the proposed three-stage learning algorithm
    corecore