1,527 research outputs found
A COMPARISON OF MACHINE LEARNING TECHNIQUES: E-MAIL SPAM FILTERING FROM COMBINED SWAHILI AND ENGLISH EMAIL MESSAGES
The speed of technology change is faster now compared to the past ten to fifteen years. It changes the way people live and force them to use the latest devices to match with the speed. In communication perspectives nowadays, use of electronic mail (e-mail) for people who want to communicate with friends, companies or even the universities cannot be avoided. This makes it to be the most targeted by the spammer and hackers and other bad people who want to get the benefit by sending spam emails. The report shows that the amount of emails sent through the internet in a day can be more than 10 billion among these 45% are spams. The amount is not constant as sometimes it goes higher than what is noted here. This indicates clearly the magnitude of the problem and calls for the need for more efforts to be applied to reduce this amount and also minimize the effects from the spam messages.
Various measures have been taken to eliminate this problem. Once people used social methods, that is legislative means of control and now they are using technological methods which are more effective and timely in catching spams as these work by analyzing the messages content. In this paper we compare the performance of machine learning algorithms by doing the experiment for testing English language dataset, Swahili language dataset individual and combined two dataset to form one, and results from combined dataset compared them with the Gmail classifier. The classifiers which the researcher used are Naïve Bayes (NB), Sequential Minimal Optimization (SMO) and k-Nearest Neighbour (k-NN).
The results for combined dataset shows that SMO classifier lead the others by achieve 98.60% of accuracy, followed by k-NN classifier which has 97.20% accuracy, and Naïve Bayes classifier has 92.89% accuracy. From this result the researcher concludes that SMO classifier can work better in dataset that combined English and Swahili languages. In English dataset shows that SMO classifier leads other algorism, it achieved 97.51% of accuracy, followed by k-NN with average accuracy of 93.52% and the last but also good accuracy is Naïve Bayes that come with 87.78%. Swahili dataset Naïve Bayes lead others by getting 99.12% accuracy followed by SMO which has 98.69% and the last was k-NN which has 98.47%
Application of Big Data Technology, Text Classification, and Azure Machine Learning for Financial Risk Management Using Data Science Methodology
Data science plays a crucial role in enabling organizations to optimize data-driven opportunities within financial risk management. It involves identifying, assessing, and mitigating risks, ultimately safeguarding investments, reducing uncertainty, ensuring regulatory compliance, enhancing decision-making, and fostering long-term sustainability. This thesis explores three facets of Data Science projects: enhancing customer understanding, fraud prevention, and predictive analysis, with the goal of improving existing tools and enabling more informed decision-making. The first project examined leveraged big data technologies, such as Hadoop and Spark, to enhance financial risk management by accurately predicting loan defaulters and their repayment likelihood. In the second project, we investigated risk assessment and fraud prevention within the financial sector, where Natural Language Processing and machine learning techniques were applied to classify emails into categories like spam, ham, and phishing. After training various models, their performance was rigorously evaluated. In the third project, we explored the utilization of Azure machine learning to identify loan defaulters, emphasizing the comparison of different machine learning algorithms for predictive analysis. The results aimed to determine the best-performing model by evaluating various performance metrics for the dataset. This study is important because it offers a strategy for enhancing risk management, preventing fraud, and encouraging innovation in the financial industry, ultimately resulting in better financial outcomes and enhanced customer protection
Security Evaluation of Support Vector Machines in Adversarial Environments
Support Vector Machines (SVMs) are among the most popular classification
techniques adopted in security applications like malware detection, intrusion
detection, and spam filtering. However, if SVMs are to be incorporated in
real-world security systems, they must be able to cope with attack patterns
that can either mislead the learning algorithm (poisoning), evade detection
(evasion), or gain information about their internal parameters (privacy
breaches). The main contributions of this chapter are twofold. First, we
introduce a formal general framework for the empirical evaluation of the
security of machine-learning systems. Second, according to our framework, we
demonstrate the feasibility of evasion, poisoning and privacy attacks against
SVMs in real-world security problems. For each attack technique, we evaluate
its impact and discuss whether (and how) it can be countered through an
adversary-aware design of SVMs. Our experiments are easily reproducible thanks
to open-source code that we have made available, together with all the employed
datasets, on a public repository.Comment: 47 pages, 9 figures; chapter accepted into book 'Support Vector
Machine Applications
Review of steganalysis of digital images
Steganography is the science and art of embedding hidden messages into cover multimedia such as text, image, audio and video. Steganalysis is the counterpart of steganography, which wants to identify if there is data hidden inside a digital medium. In this study, some specific steganographic schemes such as HUGO and LSB are studied and the steganalytic schemes developed to steganalyze the hidden message are studied. Furthermore, some new approaches such as deep learning and game theory, which have seldom been utilized in steganalysis before, are studied. In the rest of thesis study some steganalytic schemes using textural features including the LDP and LTP have been implemented
Spam Classification Using Machine Learning Techniques - Sinespam
Most e-mail readers spend a non-trivial amount of time regularly deleting junk e-mail (spam)
messages, even as an expanding volume of such e-mail occupies server storage space and
consumes network bandwidth. An ongoing challenge, therefore, rests within the development
and refinement of automatic classifiers that can distinguish legitimate e-mail from spam. Some
published studies have examined spam detectors using Naïve Bayesian approaches and large
feature sets of binary attributes that determine the existence of common keywords in spam,
and many commercial applications also use Naïve Bayesian techniques.
Spammers recognize these attempts to prevent their messages and have developed tactics to
circumvent these filters, but these evasive tactics are themselves patterns that human readers
can often identify quickly. This work had the objectives of developing an alternative approach
using a neural network (NN) classifier brained on a corpus of e-mail messages from several
users. The features selection used in this work is one of the major improvements, because the
feature set uses descriptive characteristics of words and messages similar to those that a
human reader would use to identify spam, and the model to select the best feature set, was
based on forward feature selection. Another objective in this work was to improve the spam
detection near 95% of accuracy using Artificial Neural Networks; actually nobody has reached
more than 89% of accuracy using ANN
Feature generation for optimization of marketing campaign
Abstract. Utilizing the gaming data for optimizing the entire gaming paradigm has revolutionized the thought process of developers and gamers alike. The significance of the gaming data can be judged from the fact that it is being used productively by the marketing agencies to develop algorithms that could predict the behavior of a certain gamer and the reaction to updates. The core idea behind the solution proposed and implemented in this thesis is focused on making the marketing campaigns more impactful. According to the facts from credible online resources, i.e., Statista.com, the business-to-business (B2B) organizations spent over $12.3 billion on marketing campaigns. Since one of the major aims of a marketing campaign is customer acquisition, which is also referred to as demand generation, measuring the success rate of the marketing campaign is also of great importance. Besides, the conventional Customer Relation Managers (CRMs) don’t have such features using which, the businesses can monitor the effectiveness of the marketing campaigns. The system this thesis proposes aims to analyze the gaming data, which can be used to extract features for refined marketing campaigns. To analyze and precisely classify the gaming data, this thesis proposes an algorithm running behind a full-fledged marketing campaign that can yield optimal results and which can be further refined to predict the future purchase behavior of the users in such marketing campaigns. To accomplish this task, the Random Forest Classifier is the one, which this thesis proposes and has been implemented to optimize feature selection in order to enhance the profit revenue of the business. The promising results of empirical research and studies have proven the capability of the random forest classifier, and after employing it in the research, it has been established that the mentioned classifier is absolutely capable of extracting significant features on the basis of the gaming data sets that were provided. More importantly, this study has indicated that the Random Forest classifier gives better results in predicting the purchase likelihood, which is an essential milestone for our project. It should be noted that the solution we have proposed does not only serve to predict the purchase likelihood, but it can also be preferably utilized for other aims and objectives which are related to optimizing the marketing campaigns
Learning Concept Drift Using Adaptive Training Set Formation Strategy
We live in a dynamic world, where changes are a part of everyday ‘s life. When there is a shift in data, the classification or prediction models need to be adaptive to the changes. In data mining the phenomenon of change in data distribution over time is known as concept drift. In this research, we propose an adaptive supervised learning with delayed labeling methodology. As a part of this methodology, we introduce an adaptive training set formation algorithm called SFDL, which is based on selective training set formation. Our proposed solution considered as the first systematic training set formation approach that take into account delayed labeling problem. It can be used with any base classifier without the need to change the implementation or setting of this classifier. We test our algorithm implementation using synthetic and real dataset from various domains which might have different drift types (sudden, gradual, incremental recurrences) with different speed of change. The experimental results confirm improvement in classification accuracy as compared to ordinary classifier for all drift types. Our approach is able to increase the classifications accuracy with 20% in average and 56% in the best cases of our experimentations and it has not been worse than the ordinary classifiers in any case. Finally a comparison study with other four related methods to deal with changing in user interest over time and handle recurrence drift is performed. Results indicate the effectiveness of the proposed method over other methods in terms of classification accuracy
- …