12 research outputs found

    ANALYZING TEMPORAL PATTERNS IN PHISHING EMAIL TOPICS

    Get PDF
    In 2020, the Federal Bureau of Investigation (FBI) found phishing to be the most common cybercrime, with a record number of complaints from Americans reporting losses exceeding $4.1 billion. Various phishing prevention methods exist; however, these methods are usually reactionary in nature as they activate only after a phishing campaign has been launched. Priming people ahead of time with the knowledge of which phishing topic is more likely to occur could be an effective proactive phishing prevention strategy. It has been noted that the volume of phishing emails tended to increase around key calendar dates and during times of uncertainty. This thesis aimed to create a classifier to predict which phishing topics have an increased likelihood of occurring in reference to an external event. After distilling around 1.2 million phishes until only meaningful words remained, a Latent Dirichlet allocation (LDA) topic model uncovered 90 latent phishing topics. On average, human evaluators agreed with the composition of a topic 74% of the time in one of the phishing topic evaluation tasks, showing an accordance of human judgment to the topics produced by the LDA model. Each topic was turned into a timeseries by creating a frequency count over the dataset’s two-year timespan. This time-series was changed into an intensity count to highlight the days of increased phishing activity. All phishing topics were analyzed and reviewed for influencing events. After the review, ten topics were identified to have external events that could have possibly influenced their respective intensities. After performing the intervention analysis, none of the selected topics were found to correlate with the identified external event. The analysis stopped here, and no predictive classifiers were pursued. With this dataset, temporal patterns coupled with external events were not able to predict the likelihood of a phishing attack

    Automatic generation of meta classifiers with large levels for distributed computing and networking

    Full text link
    This paper is devoted to a case study of a new construction of classifiers. These classifiers are called automatically generated multi-level meta classifiers, AGMLMC. The construction combines diverse meta classifiers in a new way to create a unified system. This original construction can be generated automatically producing classifiers with large levels. Different meta classifiers are incorporated as low-level integral parts of another meta classifier at the top level. It is intended for the distributed computing and networking. The AGMLMC classifiers are unified classifiers with many parts that can operate in parallel. This make it easy to adopt them in distributed applications. This paper introduces new construction of classifiers and undertakes an experimental study of their performance. We look at a case study of their effectiveness in the special case of the detection and filtering of phishing emails. This is a possible important application area for such large and distributed classification systems. Our experiments investigate the effectiveness of combining diverse meta classifiers into one AGMLMC classifier in the case study of detection and filtering of phishing emails. The results show that new classifiers with large levels achieved better performance compared to the base classifiers and simple meta classifiers classifiers. This demonstrates that the new technique can be applied to increase the performance if diverse meta classifiers are included in the system

    Clustering and Topic Modelling: A New Approach for Analysis of National Cyber security Strategies

    Get PDF
    The consequences of cybersecurity attacks can be severe for nation states and their people. Recently many nations have revisited their national cybersecurity strategies (NCSs) to ensure that their cybersecurity capabilities is sufficient to protect their citizens and cyberspace. This study is an initial attempt to compare NCSs by using clustering and topic modelling methods to investigate the similarity and differences between them. We also aimed to identify underlying topics that are appeared in NCSs. We have collected and examined 60 NCSs that have been developed during 2003-2016. By relying on institutional theories, we found that memberships in the international intuitions could be a determinant factor for harmonization and integration between NCSs. By applying hierarchical clustering method, we noticed a stronger similarities between NCSs that are developed by the EU or NATO members. We also found that public-private partnerships, protection of critical infrastructure, and defending citizen and public IT systems are among those topics that have been received considerable attention in the majority of NCSs. We also argue that topic modeling method, LDA, can be used as an automated technique for analysis and understanding of textual documents by policy makers and governments during the development and reviewing of national strategies and policies

    Informing, simulating experience, or both: A field experiment on phishing risks

    Get PDF
    Cybersecurity cannot be ensured with mere technical solutions. Hackers often use fraudulent emails to simply ask people for their password to breach into organizations. This technique, called phishing, is a major threat for many organizations. A typical prevention measure is to inform employees but is there a better way to reduce phishing risks? Experience and feedback have often been claimed to be effective in helping people make better decisions. In a large field experiment involving more than 10,000 employees of a Dutch ministry, we tested the effect of information provision, simulated experience, and their combination to reduce the risks of falling into a phishing attack. Both approaches substantially reduced the proportion of employees giving away their password. Combining both interventions did not have a larger impact

    Avoiding the Phishing Bait: The Need for Conventional Countermeasures for Mobile Users

    Get PDF
    According to the international Anti-Phishing Work Group (APWG), phishing activities have significantly risen over the last few years, and users are becoming more susceptible to online and mobile fraud. Machine Learning (ML) techniques have the potential for building technical anti-phishing models, a majority of them have yet to be applied in a real-time environment. ML models also require domain experts to interpret the results. This gives conventional techniques a vital role as supportive tools for a wider audience, especially novice users, in order to reduce the rate of phishing attacks. Our paper aims at raising awareness and educating users on phishing in general and mobile phishing in particular from a conventional perspective, unlike existing reviews that are based on data mining and machine learning. This will equip individuals with knowledge and skills that may prevent phishing on a wider context within the mobile users’ community

    Phishing detection : methods based on natural language processing

    Get PDF
    Tese (doutorado)—Universidade de Brasília, Faculdade de Tecnologia, Departamento de Engenharia Elétrica, 2020.Nas tentativas de phishing, o criminoso finge ser uma pessoa ou entidade confiável e, por meio dessa falsa representação, tenta obter informações confidenciais de um alvo. Um exemplo típico é aquele em que golpistas tentam passar por uma instituição conhecida, alegando a necessidade de atualização de um cadastro ou de uma ação imediata do lado do cliente e, para isso, são solicitados dados pessoais e financeiros. Uma variedade de recursos, como páginas da web falsas, instalação de código malicioso ou preenchimento de formulários, são empregados junto com o próprio e-mail para executar esse tipo de ação. Geralmente uma campanha de phishing começa com um e-mail. Portanto, a detecção desse tipo de e-mail é crítica. Uma vez que o phishing pretende parecer uma mensagem legítima, as técnicas de detecção baseadas apenas em regras de filtragem, como regras de listas e heurística, têm eficácia limitada, além de potencialmente poderem ser forjadas. Desta forma, através de processamento de texto, atributos podem ser extraídos do corpo e do cabeçalho de e-mails, por meio de técnicas que expliquem as relações de semelhança e significância entre as palavras presentes em um determinado e-mail, bem como em todo o conjunto de amostras de mensagens. A abordagem mais comum para este tipo de engenharia de recursos é baseada em Modelos de Espaço Vetorial (VSM), mas como o VSM derivada da Matriz de Documentos por Termos (DTM) tem tantas dimensões quanto o número de termos utilizado em um corpus, e dado o fato de que nem todos os termos estão presentes em cada um dos emails, a etapa de engenharia de recursos do processo de detecção de e-mails de phishing tem que lidar e resolver questões relacionadas à "Maldição da Dimensionalidade", à esparsidade e às informações que podem ser obtidas do contexto textual. Esta tese propõe uma abordagem que consiste em quatro métodos para detectar phishing. Eles usam técnicas combinadas para obter recursos mais representativos dos textos de e-mails que são utilizados como atributos de entrada para os algoritmos de classificação para detectar e-mails de phishing corretamente. Eles são baseadas em processamento de linguagem natural (NLP) e aprendizado de máquina (ML), com estratégias de engenharia de features que aumentam a precisão, recall e acurácia das previsões dos algoritmos adotados, e abordam os problemas relacionados à representação VSM/DTM. O método 1 usa todos os recursos obtidos da DTM nos algoritmos de classificação, enquanto os outros métodos usam diferentes estratégias de redução de dimensionalidade para lidar com as questões apontadas. O método 2 usa a seleção de recursos por meio das vii medidas de qui-quadrado e informação mútua para tratar esses problemas. O Método 3 implementa a extração de recursos por meio das técnicas de Análise de Componentes Prin- cipais (PCA), Análise Semântica Latente (LSA) e Alocação Latente de Dirichlet (LDA). Enquanto o Método 4 é baseado na incorporação de palavras, e suas representações são obtidas a partir das técnicas Word2Vec, Fasttext e Doc2Vec. Foram empregados três conjuntos de dados (Dataset 1 - o conjunto de dados principal, Dataset 2 e Dataset 3). Usando o Dataset 1, em seus respectivos melhores resultados, uma pontuação F1 de 99,74% foi alcançada pelo Método 1, enquanto os outros três métodos alcançaram uma medida notável de 100% em todas as medidas de utilidade utilizadas, ou seja até onde sabemos, o mais alto resultado em pesquisas de detecção de phishing para um conjunto de dados credenciado com base apenas no corpo dos e-mails. Os métodos/perspectivas que obtiveram 100% no Dataset 1 (perspectiva Qui-quadrado do Método 2 - usando cem features, perspectiva LSA do Método 3 - usando vinte e cinco features, perspectiva Word2Vec e perspectiva FastText do Método 4) foram avaliados em dois contextos diferentes. Considerando tanto o corpo do e-mail quanto o cabeçalho, utilizando o primeiro dataset adicional proposto (Dataset 2), onde, em sua melhor nota, foi obtido 99,854% F1 Score na perspectiva Word2Vec, superando o melhor resultado atual para este dataset. Utilizando apenas os corpos de e-mail, como feito para o Dataset 1, a avaliação com o Dataset 3 também se mostrou com os melhores resultados para este dataset. Todas as quatro perspectivas superam os resultados do estado da arte, com uma pontuação F1 de 98,43%, através da perspectiva FastText, sendo sua melhor nota. Portanto, para os dois conjuntos de dados adicionais, esses resultados são os mais elevados na pesquisa de detecção de phishing para esses datasets. Os resultados demonstrados não são apenas devido ao excelente desempenho dos algoritmos de classificação, mas também devido à combinação de técnicas proposta, composta de processos de engenharia de features, de técnicas de aprendizagem apri- moradas para reamostragem e validação cruzada, e da estimativa de configuração de hiperparâmetros. Assim, os métodos propostos, suas perspectivas e toda a sua estraté- gia demonstraram um desempenho relevante na detecção de phishing. Eles também se mostraram uma contribuição substancial para outras pesquisas de NLP que precisam lidar com os problemas da representação VSM/DTM, pois geram uma representação densa e de baixa dimensão para os textos avaliados.In phishing attempts, the attacker pretends to be a trusted person or entity and, through this false impersonation, tries to obtain sensitive information from a target. A typical example is one in which a scammer tries to pass off as a known institution, claiming the need to update a register or take immediate action from the client-side, and for this, personal and financial data are requested. A variety of features, such as fake web pages, the installation of malicious code, or form filling are employed along with the e-mail itself to perform this type of action. A phishing campaign usually starts with an e-mail. Therefore, the detection of this type of e-mail is critical. Since phishing aims to appear being a legitimate message, detection techniques based only on filtering rules, such as blacklisting and heuristics, have limited effectiveness, in addition to being potentially forged. Therefore, with the use of data-driven techniques, mainly those focused on text processing, features can be extracted from the e-mail body and header that explain the similarity and significance of the words in a specific e-mail, as well as for the entire set of message samples. The most common approach for this type of feature engineering is based on Vector Space Models (VSM). However, since VSMs derived from the Document- Term Matrix (DTM) have as many dimensions as the number of terms in used in a corpus, in addition to the fact that not all terms are present in each of the e-mails, the feature engineering step of the phishing e-mail detection process has to deal with and address issues related to the "Curse of Dimensionality"; the sparsity and the information that can be obtained from the context (how to improve it, and reveal its latent features). This thesis proposes an approach to detect phishing that consists of four methods. They use combined techniques to obtain more representative features from the e-mail texts that feed ML classification algorithms to correctly detect phishing e-mails. They are based on natural language processing (NLP) and machine learning (ML), with feature engineering strategies that increase the precision, recall, and accuracy of the predictions of the adopted algorithms and that address the VSM/DTM problems. Method 1 uses all the features obtained from the DTM in the classification algorithms, while the other methods use different dimensionality reduction strategies to deal with the posed issues. Method 2 uses feature selection through the Chi-Square and Mutual Information measures to address these problems. Method 3 implements feature extraction through the Principal Components Analysis (PCA), Latent Semantic Analysis (LSA), and Latent Dirichlet Allocation (LDA) techniques. Method 4 is based on word embedding, and its representations are obtained from the Word2Vec, Fasttext, and Doc2Vec techniques. ix Our approach was employed on three datasets (Dataset 1 - the main dataset, Dataset 2, and Dataset 3). All four proposed methods had excellent marks. Using the main proposed dataset (Dataset 1), on the respective best results of the four methods, a F1 Score of 99.74% was achieved by Method 1, whereas the other three methods attained a remarkable measure of 100% in all main utility measures which is, to the best of our knowledge, the highest result obtained in phishing detection research for an accredited dataset based only on the body of the e-mails. The methods/perspectives that obtained 100% in Dataset 1 (perspective Chi-Square of Method 2 - using one-hundred features, perspective LSA of Method 3 - using twenty-five features, perspectiveWord2Vec and perspective FastText of Method 4) were evaluated in two different contexts. Considering both the e-mail bodies and headers, using the first additional proposed dataset (Dataset 2), a 99.854% F1 Score was obtained using the perspective Word2Vec, which was its best mark, surpassing the current best result. Using just the e-mail bodies, as done for Dataset 1, the evaluation employing Dataset 3 also proved to reach the best marks for this data collection. All four perspectives outperformed the state-of-the-art results, with an F1 Score of 98.43%, through the FastText perspective, being its best mark. Therefore, for both additional datasets, these results, to the best of our knowledge, are the highest in phishing detection research for these accredited datasets. The results obtained by these measurements are not only due to the excellent perfor- mance of the classification algorithms, but also to the combined techniques of feature engineering proposed process such as text processing procedures (for instance, the lemma- tization step), improved learning techniques for re-sampling and cross-validation, and hyper-parameter configuration estimation. Thus, the proposed methods, their perspectives, and the complete plan of action demonstrated relevant performance when distinguishing between ham and phishing e-mails. The methods also proved to substantially contribute to this area of research and other natural language processing research that need to address or avoid problems related to VSM/DTM representation, since the methods generate a dense and low-dimension representation of the evaluated texts

    StoryNet: A 5W1H-based knowledge graph to connect stories

    Get PDF
    Title from PDF of title page viewed January 19, 2022Thesis advisor: Yugyung LeeVitaIncludes bibliographical references (page 149-164)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2021Stories are a powerful medium through which the human community has exchanged information since the dawn of the information age. They have taken multiple forms like articles, movies, books, plays, short films, magazines, mythologies, etc. With the ever-growing complexity of information representation, exchange, and interaction, it became highly important to find ways that convey the stories more effectively. With a world that is diverging more and more, it is harder to draw parallels and connect the information from all around the globe. Even though there have been efforts to consolidate the information on a large scale like Wikipedia, Wiki Data, etc, they are devoid of any real-time happenings. With the recent advances in Natural Language Processing (NLP), we propose a framework to connect these stories together making it easier to find the links between them thereby helping us understand and explore the links between the stories and possibilities that revolve around them. Our framework is based on the 5W + 1H (What, Who, Where, When, Why, and How) format that represents stories in a format that is both easily understandable by humans and accurately generated by the deep learning models. We have used 311 calls and cyber security datasets as case studies for which a few NLP techniques like classification, Topic Modelling, Question Answering, and Question Generation were used along with the 5W1H framework to segregate the stories into clusters. This is a generic framework and can be used to apply to any field. We have evaluated two approaches for generating results - training-based and rule-based. For the rule-based approach, we used Stanford NLP parsers to identify patterns for the 5W + 1H terms, and for the training based approach, BERT embeddings were used and both were compared using an ensemble score (average of CoLA, SST-2, MRPC, QQP, STS-B, MNLI, QNLI, and RTE) along with BLEU and ROUGE scores. A few approaches are studied for training-based analysis - using BERT, Roberta, XLNet, ALBERT, ELECTRA, and AllenNLP Transformer QA with the datasets - CVE, NVD, SQuAD v1.1, and SQuAD v2.0, and compared them with custom annotations for identifying 5W + 1H. We've presented the performance and accuracy of both approaches in the results section. Our method gave a boost in the score from 30% (baseline) to 91% when trained on the 5W+1H annotations.Introduction -- Related work -- The 5W1H Framework and the models included -- StoryNet Application: Evaluation and Results -- Conclusion and Future Wor

    Real-Time Client-Side Phishing Prevention

    Get PDF
    In the last decades researchers and companies have been working to deploy effective solutions to steer users away from phishing websites. These solutions are typically based on servers or blacklisting systems. Such approaches have several drawbacks: they compromise user privacy, rely on off-line analysis, are not robust against adaptive attacks and do not provide much guidance to the users in their warnings. To address these limitations, we developed a fast real-time client-side phishing prevention software that implements a phishing detection technique recently developed by Marchal et al. It extracts information from the visited webpage and detects if it is a phish to warn the user. It is also able to detect the website that the phish is trying to mimic and propose a redirection to the legitimate domain. Furthermore, to attest the validity of our solution we performed two user studies to evaluate the usability of the interface and the program's impact on user experience

    Misperceptions of Uncertainty and Their Applications to Prevention

    Get PDF
    This thesis studies how people misperceive risk and uncertainty, and how this cognitive bias affects individuals' preventive actions. Chapter 1, in a lab experiment, shows that how we present rare events affects how big people perceive those events. I show by means of a lab experiment that people perceive rare events bigger than what they actually are when those events are presented to them separately rather than all together. Chapter 2 shows theoretically that it is actually the same phenomenon that makes people both overinsure and prevent little, namely probability weighting. Chapter 3, with an application to cybersecurity, analyses an intervention aiming at increasing prevention at the organizational level in a field experiment. I test whether communicating information in a more effective way or letting employees experience a simulated phishing attack help to reduce falling for phishing attacks. Chapter 4 deals with the issue that people’s judgements of risk might differ in different contexts. In a lab experiment, it shows that sexual context has an impact on ambiguity attitudes
    corecore