1,078 research outputs found
A corpus-based investigation of junk emails
Almost everyone who has an email account receives from time to time unwanted emails. These emails can be jokes from friends or commercial product offers from unknown people. In this paper we focus on these unwanted messages which try to promote a product or service, or to offer some “hot” business opportunities. These messages are called junk emails. Several methods to filter junk emails were proposed, but none considers the linguistic characteristics of junk emails. In this paper, we investigate the linguistic features of a corpus of junk emails, and try to decide if they constitute a distinct genre. Our corpus of junk emails was build from the messages received by the authors over a period of time. Initially, the corpus consisted of 1563, but after eliminating the duplications automatically we kept only 673 files, totalising just over 373,000 tokens. In order to decide if the junk emails constitute a different genre, a comparison with a corpus of leaflets extracted from BNC and with the whole BNC corpus is carried out. Several characteristics at the lexical and grammatical levels were identified
Using online linear classifiers to filter spam Emails
The performance of two online linear classifiers - the Perceptron and Littlestone’s Winnow – is explored for two anti-spam filtering benchmark corpora - PU1 and Ling-Spam. We study the performance for varying numbers of features, along with three different feature selection methods: Information Gain (IG), Document Frequency (DF) and Odds Ratio. The size of the training set and the number of training iterations are also investigated for both classifiers. The experimental results show that both the Perceptron and Winnow perform much better when using IG or DF than using Odds Ratio. It is further demonstrated that when using IG or DF, the classifiers are insensitive to the number of features and the number of training iterations, and not greatly sensitive to the size of training set. Winnow is shown to slightly outperform the Perceptron. It is also demonstrated that both of these online classifiers perform much better than a standard Naïve Bayes method. The theoretical and implementation computational complexity of these two classifiers are very low, and they are very easily adaptively updated. They outperform most of the published results, while being significantly easier to train and adapt. The analysis and promising experimental results indicate that the Perceptron and Winnow are two very competitive classifiers for anti-spam filtering
The Discourse of Digital Deceptions and ‘419’ Emails
This study applies a computer-mediated discourse analysis
(CMDA) to the study of discourse structures and functions of ‘419’ emails – the Nigerian term for online/financial fraud. The hoax mails are in the form of online lottery winning announcements, and email ‘business proposals’
involving money transfers/claims of dormant bank accounts overseas. Data comprise 68 email samples collected from the researcher’s inboxes and colleagues’ and students’ mail boxes between January 2008 and March 2009 in Ota, Nigeria. The study reveals that the writers of the mails apply
discourse/pragmatic strategies such as socio-cultural greeting formulas,self-identification, reassurance/confidence building, narrativity and action
prompting strategies to sustain the interest of the receivers. The study also shows that this genre of computer-mediated communication (CMC) has become a regular part of our Internet experience, and is not likely to be extinct in the near future as previous studies of email hoaxes have predicted. It is believed that as the global economy witnesses a recession, chances are that more creative and complex ways of combating the situation will arise.
Economic hardship has been blamed for fraud/online scams, inadvertently prompting youths to engage in various anti-social activities. K E Y W O R D S : computer-media communication, deceptions, discourse,
email, ‘419’, fraud, hoax
Machine Learning Approaches for Modeling Spammer Behavior
Spam is commonly known as unsolicited or unwanted email messages in the
Internet causing potential threat to Internet Security. Users spend a valuable
amount of time deleting spam emails. More importantly, ever increasing spam
emails occupy server storage space and consume network bandwidth. Keyword-based
spam email filtering strategies will eventually be less successful to model
spammer behavior as the spammer constantly changes their tricks to circumvent
these filters. The evasive tactics that the spammer uses are patterns and these
patterns can be modeled to combat spam. This paper investigates the
possibilities of modeling spammer behavioral patterns by well-known
classification algorithms such as Na\"ive Bayesian classifier (Na\"ive Bayes),
Decision Tree Induction (DTI) and Support Vector Machines (SVMs). Preliminary
experimental results demonstrate a promising detection rate of around 92%,
which is considerably an enhancement of performance compared to similar spammer
behavior modeling research.Comment: 12 pages, 3 figures, 5 tables, Submitted to AIRS 201
EMail Data Mining: An Approach to Construct an Organization Position-wise Structure While Performing EMail Analysis
In this age of social networking, it is necessary to define the relationships among the members of a social network. Various techniques are already available to define user- to-user relationships across the network. Over time, many algorithms and machine learning techniques were applied to find relationships over social networks, yet very few techniques and information are available to define a relation directly over raw email data. Few educational societies have developed a way to mine the email log files and have found the inter-relation between the users by means of clusters. Again, there is no solid technique available that can accurately predict the ranking of each user within an organization by mining through their email transaction logs. The author in this report presents a technique to mine the email data log files in order to figure out the position wise structure of an organization. The author also discusses send-receive analysis, statistical analysis, semantic analysis and temporal analysis over the data, and has applied them to test cases. Throughout the research the author has used the Enron employees email log files, which was made public on 2001
CANELC: constructing an e-language corpus
This paper reports on the construction of CANELC: the Cambridge and Nottingham e-language Corpus.3 CANELC is a one million word corpus of digital communication in English, taken from online discussion boards, blogs, tweets, emails and SMS messages. The paper outlines the approaches used when planning the corpus: obtaining consent; collecting the data and compiling the corpus database.
This is followed by a detailed analysis of some of the patterns of language used in the corpus. The analysis includes a discussion of the key words and phrases used as well as the common themes and semantic associations connected with the data. These discussions form the basis of an investigation of how e-language operates in both similar and different ways to spoken and written records of communication (as evidenced by the BNC - British National Corpus).
3 CANELC stands for Cambridge and Nottingham e-language Corpus. This corpus has been built as part of a collaborative project between The University of Nottingham and Cambridge University Press with whom sole copyright of the annotated corpus resides. CANELC comprises one-million words of digital English taken from SMS messages, blogs, tweets, discussion board content and private/business emails. Plans to extend the corpus are under discussion. The legal dimension to corpus ‘ownership’ of some forms of unannotated data is a complex one and is under constant review. At the present time the annotated corpus is only available to authors and researchers working for CUP and is not more generally available
SPAM detection: Naïve bayesian classification and RPN expression-based LGP approaches compared
An investigation is performed of a machine learning algorithm and the Bayesian classifier in the spam-filtering context. The paper shows the advantage of the use of Reverse Polish Notation (RPN) expressions with feature extraction compared to the traditional Naïve Bayesian classifier used for spam detection assuming the same features. The performance of the two is investigated using a public corpus and a recent private spam collection, concluding that the system based on RPN LGP (Linear Genetic Programming) gave better results compared to two popularly used open source Bayesian spam filters. © Springer International Publishing Switzerland 2016
Modeling Spammer Behavior: Artificial Neural Network vs. Naïve Bayesian Classifier
The exponential growth of spam emails in recent years is a fact of life. Internet subscribers world-wide are unwittingly paying an estimated €10 billion a year in connection costs just to receive “junk” emails, according to a study undertaken for the European Commission. Though there is no universal definition of spam, unwanted and unsolicited commercial email as a mass mailing to a large number of recipients is basically known as the junk email or spam to the internet community. Spams are considered to be potential threat to Internet Security. Spam's direct effects include the consumption of computer and network resources and the cost in human time and attention of dismissing unwanted messages. More importantly, these ever increasing spams are taking various forms and finding home not only in mailboxes but also in newsgroups, discussion forums etc without the consent of the recipients. Overflowing mailboxes are overwhelming users, causing newsgroups and discussion forums to be flooded with irrelevant or inappropriate messages. As a consequence, users are getting discouraged not to use them anymore though these systems can provide numerous benefits to them.Full Tex
- …