97 research outputs found

    Spam Detection Using Machine Learning and Deep Learning

    Get PDF
    Text messages are essential these days; however, spam texts have contributed negatively to the success of this communication mode. The compromised authenticity of such messages has given rise to several security breaches. Using spam messages, malicious links have been sent to either harm the system or obtain information detrimental to the user. Spam SMS messages as well as emails have been used as media for attacks such as masquerading and smishing ( a phishing attack through text messaging), and this has threatened both the user and service providers. Therefore, given the waves of attacks, the need to identify and remove these spam messages is important. This dissertation explores the process of text classification from data input to embedded representation of the words in vector form and finally the classification process. Therefore, we have applied different embedding methods to capture both the linguistic and semantic meanings of words. Static embedding methods that are used include Word to Vector (Word2Vec) and Global Vectors (GloVe), while for dynamic embedding the transfer learning of the Bidirectional Encoder Representations from Transformers (BERT) was employed. For classification, both machine learning and deep learning techniques were used to build an efficient and sensitive classification model with good accuracy and low false positive rate. Our result established that the combination of BERT for embedding and machine learning for classification produced better classification results than other combinations. With these results, we developed models that combined the self-feature extraction advantage of deep learning and the effective classification of machine learning. These models were tested on four different datasets, namely: SMS Spam dataset, Ling dataset, Spam Assassin dataset and Enron dataset. BERT+SVC (hybrid model) produced the result with highest accuracy and lowest false positive rate

    Machine Learning for Mental Health Detection

    Get PDF
    Our project goal was to develop a depression sensing application that leverages multi-modal data sources collected from a smartphone, focusing on features extracted from audio, text messages, social media data, as well as GPS modalities. We conducted extensive experiments to study the effectiveness of these features to improve our machine learning model. We deployed our EMU app on Amazon Mechanical Turk for crowd-sourced data collection and incorporated feature extraction techniques and machine learning algorithms to reliably predict levels of depression

    An Overview on Application of Machine Learning Techniques in Optical Networks

    Get PDF
    Today's telecommunication networks have become sources of enormous amounts of widely heterogeneous data. This information can be retrieved from network traffic traces, network alarms, signal quality indicators, users' behavioral data, etc. Advanced mathematical tools are required to extract meaningful information from these data and take decisions pertaining to the proper functioning of the networks from the network-generated data. Among these mathematical tools, Machine Learning (ML) is regarded as one of the most promising methodological approaches to perform network-data analysis and enable automated network self-configuration and fault management. The adoption of ML techniques in the field of optical communication networks is motivated by the unprecedented growth of network complexity faced by optical networks in the last few years. Such complexity increase is due to the introduction of a huge number of adjustable and interdependent system parameters (e.g., routing configurations, modulation format, symbol rate, coding schemes, etc.) that are enabled by the usage of coherent transmission/reception technologies, advanced digital signal processing and compensation of nonlinear effects in optical fiber propagation. In this paper we provide an overview of the application of ML to optical communications and networking. We classify and survey relevant literature dealing with the topic, and we also provide an introductory tutorial on ML for researchers and practitioners interested in this field. Although a good number of research papers have recently appeared, the application of ML to optical networks is still in its infancy: to stimulate further work in this area, we conclude the paper proposing new possible research directions

    Understanding Bots on Social Media - An Application in Disaster Response

    Get PDF
    abstract: Social media has become a primary platform for real-time information sharing among users. News on social media spreads faster than traditional outlets and millions of users turn to this platform to receive the latest updates on major events especially disasters. Social media bridges the gap between the people who are affected by disasters, volunteers who offer contributions, and first responders. On the other hand, social media is a fertile ground for malicious users who purposefully disturb the relief processes facilitated on social media. These malicious users take advantage of social bots to overrun social media posts with fake images, rumors, and false information. This process causes distress and prevents actionable information from reaching the affected people. Social bots are automated accounts that are controlled by a malicious user and these bots have become prevalent on social media in recent years. In spite of existing efforts towards understanding and removing bots on social media, there are at least two drawbacks associated with the current bot detection algorithms: general-purpose bot detection methods are designed to be conservative and not label a user as a bot unless the algorithm is highly confident and they overlook the effect of users who are manipulated by bots and (unintentionally) spread their content. This study is trifold. First, I design a Machine Learning model that uses content and context of social media posts to detect actionable ones among them; it specifically focuses on tweets in which people ask for help after major disasters. Second, I focus on bots who can be a facilitator of malicious content spreading during disasters. I propose two methods for detecting bots on social media with a focus on the recall of the detection. Third, I study the characteristics of users who spread the content of malicious actors. These features have the potential to improve methods that detect malicious content such as fake news.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Novel graph analytics for enhancing data insight

    No full text
    Graph analytics is a fast growing and significant field in the visualization and data mining community, which is applied on numerous high-impact applications such as, network security, finance, and health care, providing users with adequate knowledge across various patterns within a given system. Although a series of methods have been developed in the past years for the analysis of unstructured collections of multi-dimensional points, graph analytics has only recently been explored. Despite the significant progress that has been achieved recently, there are still many open issues in the area, concerning not only the performance of the graph mining algorithms, but also producing effective graph visualizations in order to enhance human perception. The current thesis deals with the investigation of novel methods for graph analytics, in order to enhance data insight. Towards this direction, the current thesis proposes two methods so as to perform graph mining and visualization. Based on previous works related to graph mining, the current thesis suggests a set of novel graph features that are particularly efficient in identifying the behavioral patterns of the nodes on the graph. The specific features proposed, are able to capture the interaction of the neighborhoods with other nodes on the graph. Moreover, unlike previous approaches, the graph features introduced herein, include information from multiple node neighborhood sizes, thus capture long-range correlations between the nodes, and are able to depict the behavioral aspects of each node with high accuracy. Experimental evaluation on multiple datasets, shows that the use of the proposed graph features for the graph mining procedure, provides better results than the use of other state-of-the-art graph features. Thereafter, the focus is laid on the improvement of graph visualization methods towards enhanced human insight. In order to achieve this, the current thesis uses non-linear deformations so as to reduce visual clutter. Non-linear deformations have been previously used to magnify significant/cluttered regions in data or images for reducing clutter and enhancing the perception of patterns. Extending previous approaches, this work introduces a hierarchical approach for non-linear deformation that aims to reduce visual clutter by magnifying significant regions, and leading to enhanced visualizations of one/two/three-dimensional datasets. In this context, an energy function is utilized, which aims to determine the optimal deformation for every local region in the data, taking the information from multiple single-layer significance maps into consideration. The problem is subsequently transformed into an optimization problem for the minimization of the energy function under specific spatial constraints. Extended experimental evaluation provides evidence that the proposed hierarchical approach for the generation of the significance map surpasses current methods, and manages to effectively identify significant regions and deliver better results. The thesis is concluded with a discussion outlining the major achievements of the current work, as well as some possible drawbacks and other open issues of the proposed approaches that could be addressed in future works.Open Acces

    Advanced Machine Learning Techniques and Meta-Heuristic Optimization for the Detection of Masquerading Attacks in Social Networks

    Get PDF
    According to the report published by the online protection firm Iovation in 2012, cyber fraud ranged from 1 percent of the Internet transactions in North America Africa to a 7 percent in Africa, most of them involving credit card fraud, identity theft, and account takeover or h¼acking attempts. This kind of crime is still growing due to the advantages offered by a non face-to-face channel where a increasing number of unsuspecting victims divulges sensitive information. Interpol classifies these illegal activities into 3 types: • Attacks against computer hardware and software. • Financial crimes and corruption. • Abuse, in the form of grooming or “sexploitation”. Most research efforts have been focused on the target of the crime developing different strategies depending on the casuistic. Thus, for the well-known phising, stored blacklist or crime signals through the text are employed eventually designing adhoc detectors hardly conveyed to other scenarios even if the background is widely shared. Identity theft or masquerading can be described as a criminal activity oriented towards the misuse of those stolen credentials to obtain goods or services by deception. On March 4, 2005, a million of personal and sensitive information such as credit card and social security numbers was collected by White Hat hackers at Seattle University who just surfed the Web for less than 60 minutes by means of the Google search engine. As a consequence they proved the vulnerability and lack of protection with a mere group of sophisticated search terms typed in the engine whose large data warehouse still allowed showing company or government websites data temporarily cached. As aforementioned, platforms to connect distant people in which the interaction is undirected pose a forcible entry for unauthorized thirds who impersonate the licit user in a attempt to go unnoticed with some malicious, not necessarily economic, interests. In fact, the last point in the list above regarding abuses has become a major and a terrible risk along with the bullying being both by means of threats, harassment or even self-incrimination likely to drive someone to suicide, depression or helplessness. California Penal Code Section 528.5 states: “Notwithstanding any other provision of law, any person who knowingly and without consent credibly impersonates another actual person through or on an Internet Web site or by other electronic means for purposes of harming, intimidating, threatening, or defrauding another person is guilty of a public offense punishable pursuant to subdivision [...]”. IV Therefore, impersonation consists of any criminal activity in which someone assumes a false identity and acts as his or her assumed character with intent to get a pecuniary benefit or cause some harm. User profiling, in turn, is the process of harvesting user information in order to construct a rich template with all the advantageous attributes in the field at hand and with specific purposes. User profiling is often employed as a mechanism for recommendation of items or useful information which has not yet considered by the client. Nevertheless, deriving user tendency or preferences can be also exploited to define the inherent behavior and address the problem of impersonation by detecting outliers or strange deviations prone to entail a potential attack. This dissertation is meant to elaborate on impersonation attacks from a profiling perspective, eventually developing a 2-stage environment which consequently embraces 2 levels of privacy intrusion, thus providing the following contributions: • The inference of behavioral patterns from the connection time traces aiming at avoiding the usurpation of more confidential information. When compared to previous approaches, this procedure abstains from impinging on the user privacy by taking over the messages content, since it only relies on time statistics of the user sessions rather than on their content. • The application and subsequent discussion of two selected algorithms for the previous point resolution: – A commonly employed supervised algorithm executed as a binary classifier which thereafter has forced us to figure out a method to deal with the absence of labeled instances representing an identity theft. – And a meta-heuristic algorithm in the search for the most convenient parameters to array the instances within a high dimensional space into properly delimited clusters so as to finally apply an unsupervised clustering algorithm. • The analysis of message content encroaching on more private information but easing the user identification by mining discriminative features by Natural Language Processing (NLP) techniques. As a consequence, the development of a new feature extraction algorithm based on linguistic theories motivated by the massive quantity of features often gathered when it comes to texts. In summary, this dissertation means to go beyond typical, ad-hoc approaches adopted by previous identity theft and authorship attribution research. Specifically it proposes tailored solutions to this particular and extensively studied paradigm with the aim at introducing a generic approach from a profiling view, not tightly bound to a unique application field. In addition technical contributions have been made in the course of the solution formulation intending to optimize familiar methods for a better versatility towards the problem at hand. In summary: this Thesis establishes an encouraging research basis towards unveiling subtle impersonation attacks in Social Networks by means of intelligent learning techniques

    Explainable NLP for Human-AI Collaboration

    Get PDF
    With more data and computing resources available these days, we have seen many novel Natural Language Processing (NLP) models breaking one performance record after another. Some of them even outperform human performance in some specific tasks. Meanwhile, many researchers have revealed weaknesses and irrationality of such models, e.g., having biases against some sub-populations, producing inconsistent predictions, and failing to work effectively in the wild due to overfitting. Therefore, in real applications, especially in high-stakes domains, humans cannot rely carelessly on predictions of NLP models, but they need to work closely with the models to ensure that every final decision made is accurate and benevolent. In this thesis, we devise and utilize explainable NLP techniques to support human-AI collaboration using text classification as a target task. Overall, our contributions can be divided into three main parts. First, we study how useful explanations are for humans according to three different purposes: revealing model behavior, justifying model predictions, and helping humans investigate uncertain predictions. Second, we propose a framework that enables humans to debug simple deep text classifiers informed by model explanations. Third, leveraging on computational argumentation, we develop a novel local explanation method for pattern-based logistic regression models that align better with human judgements and effectively assist humans to perform an unfamiliar task in real-time. Altogether, our contributions are paving the way towards the synergy of profound knowledge of human users and the tireless power of AI machines.Open Acces
    • …
    corecore