2,045 research outputs found

    A Fake Profile Detection Model Using Multistage Stacked Ensemble Classification

    Get PDF
    Fake profile identification on social media platforms is essential for preserving a reliable online community. Previous studies have primarily used conventional classifiers for fake account identification on social networking sites, neglecting feature selection and class balancing to enhance performance. This study introduces a novel multistage stacked ensemble classification model to enhance fake profile detection accuracy, especially in imbalanced datasets. The model comprises three phases: feature selection, base learning, and meta-learning for classification. The novelty of the work lies in utilizing chi-squared feature-class association-based feature selection, combining stacked ensemble and cost-sensitive learning. The research findings indicate that the proposed model significantly enhances fake profile detection efficiency. Employing cost-sensitive learning enhances accuracy on the Facebook, Instagram, and Twitter spam datasets with 95%, 98.20%, and 81% precision, outperforming conventional and advanced classifiers. It is demonstrated that the proposed model has the potential to enhance the security and reliability of online social networks, compared with existing models

    Causal Strategic Classification: A Tale of Two Shifts

    Full text link
    When users can benefit from certain predictive outcomes, they may be prone to act to achieve those outcome, e.g., by strategically modifying their features. The goal in strategic classification is therefore to train predictive models that are robust to such behavior. However, the conventional framework assumes that changing features does not change actual outcomes, which depicts users as "gaming" the system. Here we remove this assumption, and study learning in a causal strategic setting where true outcomes do change. Focusing on accuracy as our primary objective, we show how strategic behavior and causal effects underlie two complementing forms of distribution shift. We characterize these shifts, and propose a learning algorithm that balances between these two forces and over time, and permits end-to-end training. Experiments on synthetic and semi-synthetic data demonstrate the utility of our approach

    Teksto analizės įrankio „Voyant Tools“ panaudojimas mokslinės informacijos analizei

    Get PDF
    This article describes the use of “Voyant Tools”, an open access text analysis application, to examine a corpus of articles from open access journals, dealing with the topic of digital humanities. The corpus consisted of 404 articles recorded in the “Clarivate Analytics Web of Science” and “Scopus ScienceDirect” databases. The authors discuss how “Voyant Tools” aids to identify the dominant fields of research through quantitative methods and to reveal the main discourse themes using distant reading and interactive reading capabilities. They also identify some problems encountered during the analyses, and also discuss the usefulness of data visualization for research and interpretation. Computer tools can be useful for experienced researchers who are interested in quantitative text analysis, as well as for beginners, as it provides an opportunity to acquire basic knowledge that will lead to a deeper interest in textual analysis methods.Straipsnyje pristatomos mokslinės informacijos analizės galimybės taikant kompiuterinę tekstų analizės programą „Voyant Tools“. Nagrinėjamas tekstynas, sudarytas iš 404 „Clarivate Analytics Web of Science“ ir „Scopus ScienseDirect“ duomenų bazėse publikuotų atvirosios prieigos straipsnių, skirtų skaitmeninės humanitarikos problematikai. Straipsnyje aptariami kiekybiniai teksto analizės metodai, atsietojo ir interaktyviojo skaitymo galimybės, kurias suteikia atvirosios prieigos „Voyant Tools“ platformoje integruoti tekstų sisteminimo įrankiai. Straipsnio autoriai pristato problemas, su kuriomis susidūrė atlikdami teksto analizę, taip pat – įvertina analizės rezultatų vizualizavimo naudingumą tyrimui ir interpretacijų paieškai. Kompiuteriniai įrankiai gali pasitarnauti patyrusiems tyrėjams, kurie domisi kiekybiniais teksto analizės metodais, o pradedantiems tyrėjams atsiranda galimybė įgyti pradinių žinių, kurios paskatins giliau domėtis kompiuterine tekstų analize

    The Looming Threat of Fake and LLM-generated LinkedIn Profiles: Challenges and Opportunities for Detection and Prevention

    Full text link
    In this paper, we present a novel method for detecting fake and Large Language Model (LLM)-generated profiles in the LinkedIn Online Social Network immediately upon registration and before establishing connections. Early fake profile identification is crucial to maintaining the platform's integrity since it prevents imposters from acquiring the private and sensitive information of legitimate users and from gaining an opportunity to increase their credibility for future phishing and scamming activities. This work uses textual information provided in LinkedIn profiles and introduces the Section and Subsection Tag Embedding (SSTE) method to enhance the discriminative characteristics of these data for distinguishing between legitimate profiles and those created by imposters manually or by using an LLM. Additionally, the dearth of a large publicly available LinkedIn dataset motivated us to collect 3600 LinkedIn profiles for our research. We will release our dataset publicly for research purposes. This is, to the best of our knowledge, the first large publicly available LinkedIn dataset for fake LinkedIn account detection. Within our paradigm, we assess static and contextualized word embeddings, including GloVe, Flair, BERT, and RoBERTa. We show that the suggested method can distinguish between legitimate and fake profiles with an accuracy of about 95% across all word embeddings. In addition, we show that SSTE has a promising accuracy for identifying LLM-generated profiles, despite the fact that no LLM-generated profiles were employed during the training phase, and can achieve an accuracy of approximately 90% when only 20 LLM-generated profiles are added to the training set. It is a significant finding since the proliferation of several LLMs in the near future makes it extremely challenging to design a single system that can identify profiles created with various LLMs.Comment: 33rd ACM Conference on Hypertext and Social Media (HT '23

    Mining Butterflies in Streaming Graphs

    Get PDF
    This thesis introduces two main-memory systems sGrapp and sGradd for performing the fundamental analytic tasks of biclique counting and concept drift detection over a streaming graph. A data-driven heuristic is used to architect the systems. To this end, initially, the growth patterns of bipartite streaming graphs are mined and the emergence principles of streaming motifs are discovered. Next, the discovered principles are (a) explained by a graph generator called sGrow; and (b) utilized to establish the requirements for efficient, effective, explainable, and interpretable management and processing of streams. sGrow is used to benchmark stream analytics, particularly in the case of concept drift detection. sGrow displays robust realization of streaming growth patterns independent of initial conditions, scale and temporal characteristics, and model configurations. Extensive evaluations confirm the simultaneous effectiveness and efficiency of sGrapp and sGradd. sGrapp achieves mean absolute percentage error up to 0.05/0.14 for the cumulative butterfly count in streaming graphs with uniform/non-uniform temporal distribution and a processing throughput of 1.5 million data records per second. The throughput and estimation error of sGrapp are 160x higher and 0.02x lower than baselines. sGradd demonstrates an improving performance over time, achieves zero false detection rates when there is not any drift and when drift is already detected, and detects sequential drifts in zero to a few seconds after their occurrence regardless of drift intervals

    Predictive model for detecting fake reviews: Exploring the possible enhancements of using word embeddings

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced Analytics, specialization in Data ScienceFake data contaminates the insights that can be obtained about a product or service and ultimately hurts both businesses and consumers. Being able to correctly identify the truthful reviews will ensure consumers are able to more effectively find products that suit their needs. The following paper aims to develop a predictive model for detecting fake hotel reviews using Natural Language Processing techniques and applying various Machine Learning models. The current research in this area has primarily focused on sentiment analysis and the detection of fake reviews using various text mining methods including bag of words, tokenization, POS tagging and TF-IDF. The research mostly looks at some combination of quantitative and qualitative information. The text component is only analyzed with regards to which words appear in the review, while the semantic relationship is ignored. This research attempts to develop a higher level of performance by implementing pretrained word embeddings during the preprocessing of the text data. The goal is to introduce some context to the text data and see how each model’s performance changes. Traditional text mining models were applied to the dataset to provide a benchmark. Subsequently, GloVe, Word2Vec and BERT word embeddings were implemented and the performance of 8 models was reviewed. The analysis shows a somewhat lower performance obtained by the word embeddings. It seems that in texts of short length, the appearance of words is more indicative of a fake review than the semantic meaning of those words

    An improved dandelion optimizer algorithm for spam detection next-generation email filtering system

    Get PDF
    Spam emails have become a pervasive issue in recent years, as internet users receive increasing amounts of unwanted or fake emails. To combat this issue, automatic spam detection methods have been proposed, which aim to classify emails into spam and non-spam categories. Machine learning techniques have been utilized for this task with considerable success. In this paper, we introduce a novel approach to spam email detection by presenting significant advancements to the Dandelion Optimizer (DO) algorithm. DO is a relatively new nature-inspired optimization algorithm inspired by the flight of dandelion seeds. While DO shows promise, it faces challenges, especially in high-dimensional problems such as feature selection for spam detection. Our primary contributions focus on enhancing the DO algorithm. Firstly, we introduce a new local search algorithm based on flipping (LSAF), designed to improve DO's ability to find the best solutions. Secondly, we propose a reduction equation that streamlines the population size during algorithm execution, reducing computational complexity. To showcase the effectiveness of our modified DO algorithm, which we refer to as Improved DO (IDO), we conduct a comprehensive evaluation using the Spam base dataset from the UCI repository. However, we emphasize that our primary objective is to advance the DO algorithm, with spam email detection serving as a case study application. Comparative analysis against several popular algorithms, including Particle Swarm Optimization (PSO), Genetic Algorithm (GA), Generalized Normal Distribution Optimization (GNDO), Chimp Optimization Algorithm (ChOA), Grasshopper Optimization Algorithm (GOA), Ant Lion Optimizer (ALO), and Dragonfly Algorithm (DA), demonstrates the superior performance of our proposed IDO algorithm. It excels in accuracy, fitness, and the number of selected features, among other metrics. Our results clearly indicate that IDO overcomes the local optima problem commonly associated with the standard DO algorithm, owing to the incorporation of LSAF and the reduction equation methods. In summary, our paper underscores the significant advancement made in the form of the IDO al-gorithm, which represents a promising approach for solving high-dimensional optimization prob-lems, with a keen focus on practical applications in real-world systems. While we employ spam email detection as a case study, our primary contribution lies in the improved DO algorithm, which is efficient, accurate, and outperforms several state-of-the-art algorithms in various metrics. This work opens avenues for enhancing optimization techniques and their applications in machine learning

    A Longitudinal Study of Factors that Affect User Interactions with Social Media and Email Spam

    Get PDF
    Given the rapid growth of social media and the increasing prevalence of spam, it is crucial to understand users’ interactions with unsolicited content to develop effective countermeasures against spam. This thesis focuses on exploring the factors that influence users’ decisions to interact with spam on social media and email. It builds upon prior work, which serves as a foundation for further research and conducting a longitudinal analysis. Our results are based on the analysis of 221 responses collected through an online survey. The survey not only gathered demographic information such as age, gender, and race but also collected data on education, spam training, interaction with spam, and experiences of being a victim of spam. With about 87% of respondents stating they sometimes, often, or always encounter spam on social media, only 23% interact with it sometimes, often, or always before knowing it was spam, and 10% sometimes, often, or always interact with social media spam after knowing it was spam. Of the 75% of the respondents who stated that they sometimes, often, or always encounter email spam, approximately 13% of the respondents stated that they sometimes, often, or always interact with email spam before knowing it is spam, and 6%s stated that they sometimes, often, or always interact with email spam after knowing it is spam. Although only 38% of the users stated that they may have been victims of social media spam and 21% stated that they may have been victims of email spam. Among the factors analyzed, only age had an effect on reporting email spam, but not social media spam. A STEM education was found to reduce the likelihood of being a victim of both social media and email spam, as well as reduce the likelihood of interacting with both email and social media spam, but only before users knew they were interacting with spam. Interestingly, formal spam training did not show any statistical significance in determining how users interact with, report, or become victims of social media spam, although there was an effect when observing the identification of email spam. To quantify the effect of different factors on individuals falling victim to spam on social media and email, a logistic regression analysis was performed. The research findings suggest that individuals with a higher attained degree and a STEM background are the least likely to be victims of spam

    Emotional Tendency Analysis of Twitter Data Streams

    Get PDF
    The web now seems to be an alive and dynamic arena in which billions of people across the globe connect, share, publish, and engage in a broad range of everyday activities. Using social media, individuals may connect and communicate with each other at any time and from any location. More than 500 million individuals across the globe post their thoughts and opinions on the internet every day. There is a huge amount of information created from a variety of social media platforms in a variety of formats and languages throughout the globe. Individuals define emotions as powerful feelings directed toward something or someone as a result of internal or external events that have a personal meaning. Emotional recognition in text has several applications in human-computer interface and natural language processing (NLP). Emotion classification has previously been studied using bag-of words classifiers or deep learning methods on static Twitter data. For real-time textual emotion identification, the proposed model combines a mix of keyword-based and learning-based models, as well as a real-time Emotional Tendency Analysi
    corecore