1,313 research outputs found
Spam Filtering using Support Vector Machine
The traditional anti-spam techniques like Black and White List is not up to the mark in current scenario. The goal of Spam Classification is to distinguish between spam and legitimate mail message. But with the popularization of the Internet, it is challenging to develop spam filters that can effectively eliminate the increasing volumes of unwanted mails automatically before they enter a user\u27s mailbox. Many researchers have been trying to separate spam from legitimate emails using machine learning algorithms based on statistical learning methods. In this paper, we evaluate the performance of Non Linear SVM based classifiers with various kernel functions over Enron Dataset
Personalized question-based cybersecurity recommendation systems
En ces temps de pandémie Covid19, une énorme quantité de l’activité humaine est modifiée pour se faire à distance, notamment par des moyens électroniques. Cela rend plusieurs personnes et services vulnérables aux cyberattaques, d’où le besoin d’une éducation généralisée ou du moins accessible sur la cybersécurité. De nombreux efforts sont entrepris par les chercheurs, le gouvernement et les entreprises pour protéger et assurer la sécurité des individus contre les pirates et les cybercriminels. En raison du rôle important joué par les systèmes de recommandation dans la vie quotidienne de l'utilisateur, il est intéressant de voir comment nous pouvons combiner les systèmes de cybersécurité et de recommandation en tant que solutions alternatives pour aider les utilisateurs à comprendre les cyberattaques auxquelles ils peuvent être confrontés. Les systèmes de recommandation sont couramment utilisés par le commerce électronique, les réseaux sociaux et les plateformes de voyage, et ils sont basés sur des techniques de systèmes de recommandation traditionnels.
Au vu des faits mentionnés ci-dessus, et le besoin de protéger les internautes, il devient important de fournir un système personnalisé, qui permet de partager les problèmes, d'interagir avec un système et de trouver des recommandations.
Pour cela, ce travail propose « Cyberhelper », un système de recommandation de cybersécurité personnalisé basé sur des questions pour la sensibilisation à la cybersécurité.
De plus, la plateforme proposée est équipée d'un algorithme hybride associé à trois différents algorithmes basés sur la connaissance, les utilisateurs et le contenu qui garantit une recommandation personnalisée optimale en fonction du modèle utilisateur et du contexte. Les résultats expérimentaux montrent que la précision obtenue en appliquant l'algorithme proposé est bien supérieure à la précision obtenue en utilisant d'autres mécanismes de système de recommandation traditionnels. Les résultats suggèrent également qu'en adoptant l'approche proposée, chaque utilisateur peut avoir une expérience utilisateur unique, ce qui peut l'aider à comprendre l'environnement de cybersécurité.With the proliferation of the virtual universe and the multitude of services provided by the World Wide Web, a major concern arises: Security and privacy have never been more in jeopardy. Nowadays, with the Covid 19 pandemic, the world faces a new reality that pushed the majority of the workforce to telecommute. This thereby creates new vulnerabilities for cyber attackers to exploit. It’s important now more than ever, to educate and offer guidance towards good cybersecurity hygiene. In this context, a major effort has been dedicated by researchers, governments, and businesses alike to protect people online against hackers and cybercriminals.
With a focus on strengthening the weakest link in the cybersecurity chain which is the human being, educational and awareness-raising tools have been put to use. However, most researchers focus on the “one size fits all” solutions which do not focus on the intricacies of individuals. This work aims to overcome that by contributing a personalized question-based recommender system. Named “Cyberhelper”, this work benefits from an existing mature body of research on recommender system algorithms along with recent research on non-user-specific question-based recommenders.
The reported proof of concept holds potential for future work in adapting Cyberhelper as an everyday assistant for different types of users and different contexts
Clustering and Classification of Email Contents
Information users depend heavily on emails\u27 system as one of the major sources of communication. Its importance and usage are continuously growing despite the evolution of mobile applications, social networks, etc. Emails are used on both the personal and professional levels. They can be considered as official documents in communication among users. Emails\u27 data mining and analysis can be conducted for several purposes such as: Spam detection and classification, subject classification, etc. In this paper, a large set of personal emails is used for the purpose of folder and subject classifications. Algorithms are developed to perform clustering and classification for this large text collection. Classification based on NGram is shown to be the best for such large text collection especially as text is Bi-language (i.e. with English and Arabic content)
Knowledge-Augmented Large Language Models for Personalized Contextual Query Suggestion
Large Language Models (LLMs) excel at tackling various natural language
tasks. However, due to the significant costs involved in re-training or
fine-tuning them, they remain largely static and difficult to personalize.
Nevertheless, a variety of applications could benefit from generations that are
tailored to users' preferences, goals, and knowledge. Among them is web search,
where knowing what a user is trying to accomplish, what they care about, and
what they know can lead to improved search experiences. In this work, we
propose a novel and general approach that augments an LLM with relevant context
from users' interaction histories with a search engine in order to personalize
its outputs. Specifically, we construct an entity-centric knowledge store for
each user based on their search and browsing activities on the web, which is
then leveraged to provide contextually relevant LLM prompt augmentations. This
knowledge store is light-weight, since it only produces user-specific aggregate
projections of interests and knowledge onto public knowledge graphs, and
leverages existing search log infrastructure, thereby mitigating the privacy,
compliance, and scalability concerns associated with building deep user
profiles for personalization. We then validate our approach on the task of
contextual query suggestion, which requires understanding not only the user's
current search context but also what they historically know and care about.
Through a number of experiments based on human evaluation, we show that our
approach is significantly better than several other LLM-powered baselines,
generating query suggestions that are contextually more relevant, personalized,
and useful
Analyzing Social and Stylometric Features to Identify Spear phishing Emails
Spear phishing is a complex targeted attack in which, an attacker harvests
information about the victim prior to the attack. This information is then used
to create sophisticated, genuine-looking attack vectors, drawing the victim to
compromise confidential information. What makes spear phishing different, and
more powerful than normal phishing, is this contextual information about the
victim. Online social media services can be one such source for gathering vital
information about an individual. In this paper, we characterize and examine a
true positive dataset of spear phishing, spam, and normal phishing emails from
Symantec's enterprise email scanning service. We then present a model to detect
spear phishing emails sent to employees of 14 international organizations, by
using social features extracted from LinkedIn. Our dataset consists of 4,742
targeted attack emails sent to 2,434 victims, and 9,353 non targeted attack
emails sent to 5,912 non victims; and publicly available information from their
LinkedIn profiles. We applied various machine learning algorithms to this
labeled data, and achieved an overall maximum accuracy of 97.76% in identifying
spear phishing emails. We used a combination of social features from LinkedIn
profiles, and stylometric features extracted from email subjects, bodies, and
attachments. However, we achieved a slightly better accuracy of 98.28% without
the social features. Our analysis revealed that social features extracted from
LinkedIn do not help in identifying spear phishing emails. To the best of our
knowledge, this is one of the first attempts to make use of a combination of
stylometric features extracted from emails, and social features extracted from
an online social network to detect targeted spear phishing emails.Comment: Detection of spear phishing using social media feature
Combating User Misbehavior on Social Media
Social media encourages user participation and facilitates user’s self-expression like never before. While enriching user behavior in a spectrum of means, many social media platforms have become breeding grounds for user misbehavior. In this dissertation we focus on understanding and combating three specific threads of user misbehaviors that widely exist on social media — spamming, manipulation, and distortion.
First, we address the challenge of detecting spam links. Rather than rely on traditional blacklist-based or content-based methods, we examine the behavioral factors of both who is posting the link and who is clicking on the link. The core intuition is that these behavioral signals may be more difficult to manipulate than traditional signals. We find that this purely behavioral approach can achieve good performance for robust behavior-based spam link detection.
Next, we deal with uncovering manipulated behavior of link sharing. We propose a four-phase approach to model, identify, characterize, and classify organic and organized groups who engage in link sharing. The key motivating insight is that group-level behavioral signals can distinguish manipulated user groups. We find that levels of organized behavior vary by link type and that the proposed approach achieves good performance measured by commonly-used metrics.
Finally, we investigate a particular distortion behavior: making bullshit (BS) statements on social media. We explore the factors impacting the perception of BS and what leads users to ultimately perceive and call a post BS. We begin by preparing a crowdsourced collection of real social media posts that have been called BS. We then build a classification model that can determine what posts are more likely to be called BS. Our experiments suggest our classifier has the potential of leveraging linguistic cues for detecting social media posts that are likely to be called BS.
We complement these three studies with a cross-cutting investigation of learning user topical profiles, which can shed light into what subjects each user is associated with, which can benefit the understanding of the connection between user and misbehavior. Concretely, we propose a unified model for learning user topical profiles that simultaneously considers multiple footprints and we show how these footprints can be embedded in a generalized optimization framework.
Through extensive experiments on millions of real social media posts, we find our proposed models can effectively combat user misbehavior on social media
Syntactic and Semantic Analysis and Visualization of Unstructured English Texts
People have complex thoughts, and they often express their thoughts with complex sentences using natural languages. This complexity may facilitate efficient communications among the audience with the same knowledge base. But on the other hand, for a different or new audience this composition becomes cumbersome to understand and analyze. Analysis of such compositions using syntactic or semantic measures is a challenging job and defines the base step for natural language processing.
In this dissertation I explore and propose a number of new techniques to analyze and visualize the syntactic and semantic patterns of unstructured English texts.
The syntactic analysis is done through a proposed visualization technique which categorizes and compares different English compositions based on their different reading complexity metrics. For the semantic analysis I use Latent Semantic Analysis (LSA) to analyze the hidden patterns in complex compositions. I have used this technique to analyze comments from a social visualization web site for detecting the irrelevant ones (e.g., spam). The patterns of collaborations are also studied through statistical analysis.
Word sense disambiguation is used to figure out the correct sense of a word in a sentence or composition. Using textual similarity measure, based on the different word similarity measures and word sense disambiguation on collaborative text snippets from social collaborative environment, reveals a direction to untie the knots of complex hidden patterns of collaboration
- …