89 research outputs found

    A deep learning method for automatic SMS spam classification: Performance of learning algorithms on indigenous dataset

    Get PDF
    SMS, one of the most popular and fast-growing GSM value-added services worldwide, has attracted unwanted SMS, also known as SMS spam. The effects of SMS spam are significant as it affects both the users and the service providers, causing a massive gap in trust among both parties. This article presents a deep learning model based on BiLSTM. Further, it compares our results with some of the states of the art machine learning (ML) algorithm on two datasets: our newly collected dataset and the popular UCI SMS dataset. This study aims to evaluate the performance of diverse learning models and compare the result of the new dataset expanded (ExAIS_SMS) using the following metrics the true positive (TP), false positive (FP), F-measure, recall, precision, and overall accuracy. The average accuracy for the BiLSTSM model achieved moderately improved results compared to some of the ML classifiers. The experimental results achieved significant improvement from the ground truth results after effective fine-tuning of some of the parameters. The BiLSTM model using the ExAIS_SMS dataset attained an accuracy of 93.4% and 98.6% for UCI datasets. Further comparison of the two datasets on the state-of-the-art ML classifiers gave an accuracy of Naive Bayes, BayesNet, SOM, decision tree, C4.5, J48 is 89.64%, 91.11%, 88.24%, 75.76%, 80.24%, and 79.2% respectively for ExAIS_SMS datasets. In conclusion, our proposed BiLSTM model showed significant improvement over traditional ML classifiers. To further validate the robustness of our model, we applied the UCI datasets, and our results showed optimal performance while classifying SMS spam messages based on some metrics: accuracy, precision, recall, and F-measure.publishedVersio

    A discrete hidden Markov model for SMS spam detection

    Get PDF
    Many machine learning methods have been applied for short messaging service (SMS) spam detection, including traditional methods such as naive Bayes (NB), vector space model (VSM), and support vector machine (SVM), and novel methods such as long short-term memory (LSTM) and the convolutional neural network (CNN). These methods are based on the well-known bag of words (BoW) model, which assumes documents are unordered collection of words. This assumption overlooks an important piece of information, i.e., word order. Moreover, the term frequency, which counts the number of occurrences of each word in SMS, is unable to distinguish the importance of words, due to the length limitation of SMS. This paper proposes a new method based on the discrete hidden Markov model (HMM) to use the word order information and to solve the low term frequency issue in SMS spam detection. The popularly adopted SMS spam dataset from the UCI machine learning repository is used for performance analysis of the proposed HMM method. The overall performance is compatible with deep learning by employing CNN and LSTM models. A Chinese SMS spam dataset with 2000 messages is used for further performance evaluation. Experiments show that the proposed HMM method is not language-sensitive and can identify spam with high accuracy on both datasets

    E-mail Marketing System Adoption In E-commerce Startups

    Get PDF
    More and more leading-edge information technology has penetrated the market in conjunction with the arrival of information times. However, there is still a reluctance to adopt some of these best practices. Furthermore, the rate of diffusion has remained below expectations. Take for example the explosive growth of the adoption of e-mail marketing systems in Chinese business-to-consumer e-commerce startups whose target markets are overseas. At the same time there has been a relatively sluggish response to this commercial communication technology in those sectors of e-commerce startups that focus on local consumers and business-to-business segments. This disparity reveals the fact that the acceptance and diffusion of communication innovation are subject to many negative external factors and contexts. To e-mail marketing system providers in Chinese market, it is essential to understand the process and to identify and contrast the distinct barriers perceived by marketers across the types of e-commerce mentioned above in order to discuss the reasons behind adoption inconsistencies. The research in this thesis is conducted through a multi-method interpretive approach. The author uses prior theories in order to draw on a combination to build a new framework. A series of interviews with e-mail advertising system users is conducted and based on the results, the new framework is validated and external negative variables emerged. The nature of these negative factors are then discussed by a group of experts to account for their existence. Finally, different hindrances impacting the adoption across clusters of e-commerce marketers are identified. This thesis posits a theoretical framework, a combination of technology acceptance model and media choice factors. The key factors considered in this new framework include perceived usefulness, perceived ease-of-use, critical mass, perceived accessibility and social influences. This study contributes to the existing literature by creating a new streamlined technology acceptance model that can be used as a theoretical framework to analyze the adoption of the e-mail marketing system. Moreover, it also addresses explicitly the hindrances presenting different barriers that may affect e-commerce startups’ adoption of this technology

    Artificial immune system for the Internet

    Get PDF
    We investigate the usability of the Artificial Immune Systems (AIS) approach for solving selected problems in computer networks. Artificial immune systems are created by using the concepts and algorithms inspired by the theory of how the Human Immune System (HIS) works. We consider two applications: detection of routing misbehavior in mobile ad hoc networks, and email spam filtering. In mobile ad hoc networks the multi-hop connectivity is provided by the collaboration of independent nodes. The nodes follow a common protocol in order to build their routing tables and forward the packets of other nodes. As there is no central control, some nodes may defect to follow the common protocol, which would have a negative impact on the overall connectivity in the network. We build an AIS for the detection of routing misbehavior by directly mapping the standard concepts and algorithms used for explaining how the HIS works. The implementation and evaluation in a simulator shows that the AIS mimics well most of the effects observed in the HIS, e.g. the faster secondary reaction to the already encountered misbehavior. However, its effectiveness and practical usability are very constrained, because some particularities of the problem cannot be accounted for by the approach, and because of the computational constrains (reported also in AIS literature) of the used negative selection algorithm. For the spam filtering problem, we apply the AIS concepts and algorithms much more selectively and in a less standard way, and we obtain much better results. We build the AIS for antispam on top of a standard technique for digest-based collaborative email spam filtering. We notice un advantageous and underemphasized technological difference between AISs and the HIS, and we exploit this difference to incorporate the negative selection in an innovative and computationally efficient way. We also improve the representation of the email digests used by the standard collaborative spam filtering scheme. We show that this new representation and the negative selection, when used together, improve significantly the filtering performance of the standard scheme on top of which we build our AIS. Our complete AIS for antispam integrates various innate and adaptive AIS mechanisms, including the mentioned specific use of the negative selection and the use of innate signalling mechanisms (PAMP and danger signals). In this way the AIS takes into account users' profiles, implicit or explicit feedback from the users, and the bulkiness of spam. We show by simulations that the overall AIS is very good both in detecting spam and in avoiding misdetection of good emails. Interestingly, both the innate and adaptive mechanisms prove to be crucial for achieving the good overall performance. We develop and test (within a simulator) our AIS for collaborative spam filtering in the case of email communications. The solution however seems to be well applicable to other types of Internet communications: Internet telephony, chat/sms, forum, news, blog, or web. In all these cases, the aim is to allow the wanted communications (content) and prevent those unwanted from reaching the end users and occupying their time and communication resources. The filtering problems, faced or likely to be faced in the near future by these applications, have or are likely to have the settings similar to those that we have in the email case: need for openness to unknown senders (creators of content, initiators of the communication), bulkiness in receiving spam (many recipients are usually affected by the same spam content), tolerance of the system to a small damage (to small amounts of unfiltered spam), possibility to implicitly or explicitly and in a cheap way obtain a feedback from the recipients about the damage (about spam that they receive), need for strong tolerance to wanted (non-spam) content. Our experiments with the email spam filtering show that our AIS, i.e. the way how we build it, is well fitted to such problem settings

    Examining Application Components to Reveal Android Malware

    Get PDF
    Smartphones are becoming ubiquitous in everyday life and malware is exploiting these devices. Therefore, a means to identify the threats of malicious applications is necessary. This paper presents a method to classify and analyze Android malware through application component analysis. The experiment parses select portions from Android packages to collect features using byte sequences and permissions of the application. Multiple machine learning algorithms classify the samples of malware based on these features. The experiment utilizes instance based learner, naive Bayes, decision trees, sequential minimal optimization, boosted naive Bayes, and boosted decision trees to identify the best components that reveal malware characteristics. The best case classifies malicious applications with an accuracy of 99.24% and an area under curve of 0.9890 utilizing boosted decision trees. This method does not require scanning the entire application and provides high true positive rates. This thesis investigates the components to provide malware classification

    Program Similarity Analysis for Malware Classification and its Pitfalls

    Get PDF
    Malware classification, specifically the task of grouping malware samples into families according to their behaviour, is vital in order to understand the threat they pose and how to protect against them. Recognizing whether one program shares behaviors with another is a task that requires semantic reasoning, meaning that it needs to consider what a program actually does. This is a famously uncomputable problem, due to Rice\u2019s theorem. As there is no one-size-fits-all solution, determining program similarity in the context of malware classification requires different tools and methods depending on what is available to the malware defender. When the malware source code is readily available (or at least, easy to retrieve), most approaches employ semantic \u201cabstractions\u201d, which are computable approximations of the semantics of the program. We consider this the first scenario for this thesis: malware classification using semantic abstractions extracted from the source code in an open system. Structural features, such as the control flow graphs of programs, can be used to classify malware reasonably well. To demonstrate this, we build a tool for malware analysis, R.E.H.A. which targets the Android system and leverages its openness to extract a structural feature from the source code of malware samples. This tool is first successfully evaluated against a state of the art malware dataset and then on a newly collected dataset. We show that R.E.H.A. is able to classify the new samples into their respective families, often outperforming commercial antivirus software. However, abstractions have limitations by virtue of being approximations. We show that by increasing the granularity of the abstractions used to produce more fine-grained features, we can improve the accuracy of the results as in our second tool, StranDroid, which generates fewer false positives on the same datasets. The source code of malware samples is not often available or easily retrievable. For this reason, we introduce a second scenario in which the classification must be carried out with only the compiled binaries of malware samples on hand. Program similarity in this context cannot be done using semantic abstractions as before, since it is difficult to create meaningful abstractions from zeros and ones. Instead, by treating the compiled programs as raw data, we transform them into images and build upon common image classification algorithms using machine learning. This led us to develop novel deep learning models, a convolutional neural network and a long short-term memory, to classify the samples into their respective families. To overcome the usual obstacle of deep learning of lacking sufficiently large and balanced datasets, we utilize obfuscations as a data augmentation tool to generate semantically equivalent variants of existing samples and expand the dataset as needed. Finally, to lower the computational cost of the training process, we use transfer learning and show that a model trained on one dataset can be used to successfully classify samples in different malware datasets. The third scenario explored in this thesis assumes that even the binary itself cannot be accessed for analysis, but it can be executed, and the execution traces can then be used to extract semantic properties. However, dynamic analysis lacks the formal tools and frameworks that exist in static analysis to allow proving the effectiveness of obfuscations. For this reason, the focus shifts to building a novel formal framework that is able to assess the potency of obfuscations against dynamic analysis. We validate the new framework by using it to encode known analyses and obfuscations, and show how these obfuscations actually hinder the dynamic analysis process

    Crowdsource Annotation and Automatic Reconstruction of Online Discussion Threads

    Get PDF
    Modern communication relies on electronic messages organized in the form of discussion threads. Emails, IMs, SMS, website comments, and forums are all composed of threads, which consist of individual user messages connected by metadata and discourse coherence to messages from other users. Threads are used to display user messages effectively in a GUI such as an email client, providing a background context for understanding a single message. Many messages are meaningless without the context provided by their thread. However, a number of factors may result in missing thread structure, ranging from user mistake (replying to the wrong message), to missing metadata (some email clients do not produce/save headers that fully encapsulate thread structure; and, conversion of archived threads from over repository to another may also result in lost metadata), to covert use (users may avoid metadata to render discussions difficult for third parties to understand). In the field of security, law enforcement agencies may obtain vast collections of discussion turns that require automatic thread reconstruction to understand. For example, the Enron Email Corpus, obtained by the Federal Energy Regulatory Commission during its investigation of the Enron Corporation, has no inherent thread structure. In this thesis, we will use natural language processing approaches to reconstruct threads from message content. Reconstruction based on message content sidesteps the problem of missing metadata, permitting post hoc reorganization and discussion understanding. We will investigate corpora of email threads and Wikipedia discussions. However, there is a scarcity of annotated corpora for this task. For example, the Enron Emails Corpus contains no inherent thread structure. Therefore, we also investigate issues faced when creating crowdsourced datasets and learning statistical models of them. Several of our findings are applicable for other natural language machine classification tasks, beyond thread reconstruction. We will divide our investigation of discussion thread reconstruction into two parts. First, we explore techniques needed to create a corpus for our thread reconstruction research. Like other NLP pairwise classification tasks such as Wikipedia discussion turn/edit alignment and sentence pair text similarity rating, email thread disentanglement is a heavily class-imbalanced problem, and although the advent of crowdsourcing has reduced annotation costs, the common practice of crowdsourcing redundancy is too expensive for class-imbalanced tasks. As the first contribution of this thesis, we evaluate alternative strategies for reducing crowdsourcing annotation redundancy for class-imbalanced NLP tasks. We also examine techniques to learn the best machine classifier from our crowdsourced labels. In order to reduce noise in training data, most natural language crowdsourcing annotation tasks gather redundant labels and aggregate them into an integrated label, which is provided to the classifier. However, aggregation discards potentially useful information from linguistically ambiguous instances. For the second contribution of this thesis, we show that, for four of five natural language tasks, filtering of the training dataset based on crowdsource annotation item agreement improves task performance, while soft labeling based on crowdsource annotations does not improve task performance. Second, we investigate thread reconstruction as divided into the tasks of thread disentanglement and adjacency recognition. We present the Enron Threads Corpus, a newly-extracted corpus of 70,178 multi-email threads with emails from the Enron Email Corpus. In the original Enron Emails Corpus, emails are not sorted by thread. To disentangle these threads, and as the third contribution of this thesis, we perform pairwise classification, using text similarity measures on non-quoted texts in emails. We show that i) content text similarity metrics outperform style and structure text similarity metrics in both a class-balanced and class-imbalanced setting, and ii) although feature performance is dependent on the semantic similarity of the corpus, content features are still effective even when controlling for semantic similarity. To reconstruct threads, it is also necessary to identify adjacency relations among pairs. For the forum of Wikipedia discussions, metadata is not available, and dialogue act typologies, helpful for other domains, are inapplicable. As our fourth contribution, via our experiments, we show that adjacency pair recognition can be performed using lexical pair features, without a dialogue act typology or metadata, and that this is robust to controlling for topic bias of the discussions. Yet, lexical pair features do not effectively model the lexical semantic relations between adjacency pairs. To model lexical semantic relations, and as our fifth contribution, we perform adjacency recognition using extracted keyphrases enhanced with semantically related terms. While this technique outperforms a most frequent class baseline, it fails to outperform lexical pair features or tf-idf weighted cosine similarity. Our investigation shows that this is the result of poor word sense disambiguation and poor keyphrase extraction causing spurious false positive semantic connections. In concluding this thesis, we also reflect on open issues and unanswered questions remaining after our research contributions, discuss applications for thread reconstruction, and suggest some directions for future work

    Quantitative Assessment of Factors in Sentiment Analysis

    Get PDF
    Sentiment can be defined as a tendency to experience certain emotions in relation to a particular object or person. Sentiment may be expressed in writing, in which case determining that sentiment algorithmically is known as sentiment analysis. Sentiment analysis is often applied to Internet texts such as product reviews, websites, blogs, or tweets, where automatically determining published feeling towards a product, or service is very useful to marketers or opinion analysts. The main goal of sentiment analysis is to identify the polarity of natural language text. This thesis sets out to examine quantitatively the factors that have an effect on sentiment analysis. The factors that are commonly used in sentiment analysis are text features, sentiment lexica or resources, and the machine learning algorithms employed. The main aim of this thesis is to investigate systematically the interaction between sentiment analysis factors and machine learning algorithms in order to improve sentiment analysis performance as compared to the opinions of human assessors. A software system known as TJP was designed and developed to support this investigation. The research reported here has three main parts. Firstly, the role of data pre-processing was investigated with TJP using a combination of features together with publically available datasets. This considers the relationship and relative importance of superficial text features such as emoticons, n-grams, negations, hashtags, repeated letters, special characters, slang, and stopwords. The resulting statistical analysis suggests that a combination of all of these features achieves better accuracy with the dataset, and had a considerable effect on system performance. Secondly, the effect of human marked up training data was considered, since this is required by supervised machine learning algorithms. The results gained from TJP suggest that training data greatly augments sentiment analysis performance. However, the combination of training data and sentiment lexica seems to provide optimal performance. Nevertheless, one particular sentiment lexicon, AFINN, contributed better than others in the absence of training data, and therefore would be appropriate for unsupervised approaches to sentiment analysis. Finally, the performance of two sophisticated ensemble machine learning algorithms was investigated. Both the Arbiter Tree and Combiner Tree were chosen since neither of them has previously been used with sentiment analysis. The objective here was to demonstrate their applicability and effectiveness compared to that of the leading single machine learning algorithms, Naïve Bayes, and Support Vector Machines. The results showed that whilst either can be applied to sentiment analysis, the Arbiter Tree ensemble algorithm achieved better accuracy performance than either the Combiner Tree or any single machine learning algorithm

    From Bugs to Decision Support – Leveraging Historical Issue Reports in Software Evolution

    Get PDF
    Software developers in large projects work in complex information landscapes and staying on top of all relevant software artifacts is an acknowledged challenge. As software systems often evolve over many years, a large number of issue reports is typically managed during the lifetime of a system, representing the units of work needed for its improvement, e.g., defects to fix, requested features, or missing documentation. Efficient management of incoming issue reports requires the successful navigation of the information landscape of a project. In this thesis, we address two tasks involved in issue management: Issue Assignment (IA) and Change Impact Analysis (CIA). IA is the early task of allocating an issue report to a development team, and CIA is the subsequent activity of identifying how source code changes affect the existing software artifacts. While IA is fundamental in all large software projects, CIA is particularly important to safety-critical development. Our solution approach, grounded on surveys of industry practice as well as scientific literature, is to support navigation by combining information retrieval and machine learning into Recommendation Systems for Software Engineering (RSSE). While the sheer number of incoming issue reports might challenge the overview of a human developer, our techniques instead benefit from the availability of ever-growing training data. We leverage the volume of issue reports to develop accurate decision support for software evolution. We evaluate our proposals both by deploying an RSSE in two development teams, and by simulation scenarios, i.e., we assess the correctness of the RSSEs' output when replaying the historical inflow of issue reports. In total, more than 60,000 historical issue reports are involved in our studies, originating from the evolution of five proprietary systems for two companies. Our results show that RSSEs for both IA and CIA can help developers navigate large software projects, in terms of locating development teams and software artifacts. Finally, we discuss how to support the transfer of our results to industry, focusing on addressing the context dependency of our tool support by systematically tuning parameters to a specific operational setting
    • …
    corecore