7 research outputs found

    Using online linear classifiers to filter spam Emails

    Get PDF
    The performance of two online linear classifiers - the Perceptron and Littlestone’s Winnow – is explored for two anti-spam filtering benchmark corpora - PU1 and Ling-Spam. We study the performance for varying numbers of features, along with three different feature selection methods: Information Gain (IG), Document Frequency (DF) and Odds Ratio. The size of the training set and the number of training iterations are also investigated for both classifiers. The experimental results show that both the Perceptron and Winnow perform much better when using IG or DF than using Odds Ratio. It is further demonstrated that when using IG or DF, the classifiers are insensitive to the number of features and the number of training iterations, and not greatly sensitive to the size of training set. Winnow is shown to slightly outperform the Perceptron. It is also demonstrated that both of these online classifiers perform much better than a standard Naïve Bayes method. The theoretical and implementation computational complexity of these two classifiers are very low, and they are very easily adaptively updated. They outperform most of the published results, while being significantly easier to train and adapt. The analysis and promising experimental results indicate that the Perceptron and Winnow are two very competitive classifiers for anti-spam filtering

    A Survey of Existing E-mail Spam Filtering Methods Considering Machine Learning Techniques

    Get PDF
    E-mail is one of the most secure medium for online communication and transferring data or messages through the web. An overgrowing increase in popularity, the number of unsolicited data has also increased rapidly. To filtering data, different approaches exist which automatically detect and remove these untenable messages. There are several numbers of email spam filtering technique such as Knowledge-based technique, Clustering techniques, Learningbased technique, Heuristic processes and so on. This paper illustrates a survey of different existing email spam filtering system regarding Machine Learning Technique (MLT) such as Naive Bayes, SVM, K-Nearest Neighbor, Bayes Additive Regression, KNN Tree, and rules. However, here we present the classification, evaluation and comparison of different email spam filtering system and summarize the overall scenario regarding accuracy rate of different existing approache

    Evaluating Cost-Sensitive Unsolicited Bulk Email Categorization

    No full text
    In this paper, we discuss cost-sensitive Text Categorization methods for UBE filtering. In concrete, we have evaluated a range of Machine Learning methods for the task (C4.5, Naive Bayes, PART, Support Vector Machines and Rocchio), made cost sensitive through several methods (Threshold optimization, Weighting, and MetaCost). For the evaluation, we have used the Receiver Operating Characteristic Convex Hull method, that best suits classification problems in which target conditions are not known, as it is the case. Our results do not show a dominant algorithm nor method for making algorithms cost-sensitive, but are the best reported on the test collection used, and approach real-world manual classifiers accuracy

    Determinants of ICT adoption by small and medium enterprises in Pietermaritzburg.

    Get PDF
    Masters Degree. University of KwaZulu-Natal, Pietermaritzburg.Information and Communication Technology (ICT) has been a major contributor to world economic growth. ICT plays a vital role when it comes to the growth of Small and Medium Enterprises (SMEs). In developed countries, SMEs are making use of ICTs to support their business functions although this has not been the case in most developing countries. The Global Entrepreneurship Monitor (GEM) argues that the survival rate of start-up businesses is generally poor with SMEs in developing countries performing even worse than the standard survival rates. ICT can be used as a tool to improve the performance and survival rate of SMEs in developing countries. SMEs in developing countries are lacking behind when it comes to the adoption of ICT. This study aims to investigate the determinants that influence the intention to adopt ICT by SMEs in Pietermaritzburg, South Africa. The study made use of quantitative methods as its fundamental research approach. 227 SME owners in Pietermaritzburg were surveyed using a closed-ended questionnaire. The Technology, Organisation and Environment framework was used as a lens through which to understand the study. The TOE theoretical framework is largely used as a process to study the adoption of innovation at a firm level. Structural Equation Modelling (SEM) approach was applied in order to analyse the data from the respondents. The study revealed that Technology Context and Organisation Context (-0.221) are significant determinants that influence the intention to adopt ICT amongst SMEs. Technology Context is the most influential determinant with a regression weight of 0.938, and the Environment Context is an insignificant determinant due to the lack of government support. The study contributes towards the understanding on the important determinants that influences the adoption of ICTs in Pietermaritzburg. The results of this study can assist service providers and government on how to help uplift SMEs. It further shines the light on the lack of the government support towards SMEs.List of Abbreviations on page iv

    Spam Filtering: How The Dimensionality Reduction Affects The Accuracy Of Naive Bayes Classifiers

    No full text
    E-mail spam has become an increasingly important problem with a big economic impact in society. Fortunately, there are different approaches allowing to automatically detect and remove most of those messages, and the best-known techniques are based on Bayesian decision theory. However, such probabilistic approaches often suffer from a well-known difficulty: the high dimensionality of the feature space. Many term-selection methods have been proposed for avoiding the curse of dimensionality. Nevertheless, it is still unclear how the performance of Naive Bayes spam filters depends on the scheme applied for reducing the dimensionality of the feature space. In this paper, we study the performance of many term-selection techniques with several different models of Naive Bayes spam filters. Our experiments were diligently designed to ensure statistically sound results. Moreover, we perform an analysis concerning the measurements usually employed to evaluate the quality of spam filters. Finally, we also investigate the benefits of using the Matthews correlation coefficient as a measure of performance. © The Brazilian Computer Society 2010.13183200Almeida, T., Yamakami, A., Content-based spam filtering (2010) Proceedings of the 23rd IEEE international joint conference on neural networks, pp. 1-7. , Spain, BarcelonaAlmeida, T., Yamakami, A., Almeida, J., Evaluation of approaches for dimensionality reduction applied with Naive Bayes anti-spam filters (2009) Proceedings of the 8th IEEE international conference on machine learning and applications, pp. 517-522. , Miami, FL, USAAlmeida, T., Yamakami, A., Almeida, J., Filtering spams using the minimum description length principle (2010) Proceedings of the 25th ACM symposium on applied computing, pp. 1856-1860. , Sierre, SwitzerlandAlmeida, T., Yamakami, A., Almeida, J., Probabilistic antispam filtering with dimensionality reduction (2010) Proceedings of the 25th ACM symposium on applied computing, pp. 1802-1806. , Sierre, SwitzerlandAndroutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., Spyropoulos, C., An evaluation of Naive Bayesian anti-spam filtering (2000) Proceedings of the 11st European conference on machine learning, pp. 9-17. , Barcelona, SpainAndroutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P., Learning to filter spam e-mail: a comparison of a Naive Bayesian and a memory-based approach (2000) Proceedings of the 4th European conference on principles and practice of knowledge discovery in databases, pp. 1-13. , Lyon, FranceAndroutsopoulos, I., Paliouras, G., Michelakis, E., (2004) Learning to filter unsolicited commercial e-mail, , Technical Report 2004/2, National Centre for Scientific, Research "Demokritos", Athens, GreeceBaldi, P., Brunak, S., Chauvin, Y., Andersen, C., Nielsen, H., Assessing the accuracy of prediction algorithms for classification: an overview (2000) Bioinformatics, 16 (5), pp. 412-424Bratko, A., Cormack, G., Filipic, B., Lynam, T., Zupan, B., Spam filtering using statistical data compression models (2006) J Mach Learn Res, 7, pp. 2673-2698Carpinter, J., Hunt, R., Tightening the Net: a review of current and next generation spam filtering tools (2006) Comput Secur, 25 (8), pp. 566-578Carreras, X., Marquez, L., Boosting trees for anti-spam email filtering (2001) Proceedings of the 4th international conference on recent advances in natural language processing, pp. 58-64. , Tzigov Chark, BulgariaCohen, W., Fast effective rule induction (1995) Proceedings of 12nd international conference on machine learning, pp. 115-123. , Tahoe City, CA, USACohen, W., Learning rules that classify e-mail (1996) Proceedings of the AAAI spring symposium on machine learning in information access, pp. 18-25. , Stanford, CA, USACormack, G., Email spam filtering: a systematic review (2008) Found Trends Inf Retr, 1 (4), pp. 335-455Cormack, G., Lynam, T., Online supervised spam filter evaluation (2007) ACM Trans Inf Syst, 25 (3), pp. 1-11Cunningham, P., Nowlan, N., Delany, S., Haahr, M., A casebased approach to spam filtering that can track concept drift (2003) Proceedings of the 5th international conference on case based reasoning, pp. 115-123. , Trondheim, NorwayDemsar, J., Statistical comparisons of classifiers over multiple data sets (2006) J Mach Learn Res, 7, pp. 1-30Drucker, H., Wu, D., Vapnik, V., Support vector machines for spam categorization (1999) IEEE Trans Neural Netw, 10 (5), pp. 1048-1054Forman, G., An extensive empirical study of feature selection metrics for text classification (2003) J Mach Learn Res, 3, pp. 1289-1305Forman, G., Kirshenbaum, E., Extremely fast text feature extraction for classification and indexing (2008) Proceedings of 17th ACM conference on information and knowledge management, pp. 1221-1230. , Napa Valley, CA, USAForman, G., Scholz, M., Rajaram, S., Feature shaping for linear SVM classifiers (2000) Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 299-308. , Paris, FranceFriedman, N., Geiger, D., Goldszmidt, M., Bayesian network classifiers (1997) Mach Learn, 29 (3), pp. 131-163Fuhr, N., Buckley, C., A probabilistic learning approach for document indexing (1991) ACM Trans Inf Syst, 9 (3), pp. 223-248Galavotti, L., Sebastiani, F., Simi, M., Experiments on the use of feature selection and negative evidence in automated text categorization (2000) Proceedings of 4th European conference on research and advanced technology for digital libraries, pp. 59-68. , Lisbon, PortugalGuzella, T., Caminhas, W., A review of machine learning approaches to spam filtering (2000) Exp Syst Appl, 36 (7), pp. 10206-10222Hidalgo, J., Evaluating cost-sensitive unsolicited bulk email categorization (2002) Proceedings of the 17th ACM symposium on applied computing, pp. 615-620. , Madrid, SpainJoachims, T., A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization (1997) Proceedings of 14th international conference on machine learning, pp. 143-151. , Nashville, TN, USAJohn, G., Langley, P., Estimating continuous distributions in Bayesian classifiers (1995) Proceedings of the 11st international conference on uncertainty in artificial intelligence, pp. 338-345. , Montreal, CanadaJohn, G., Kohavi, R., Pfleger, K., Irrelevant features and the subset selection problem (1994) Proceedings of 11st international conference on machine learning, pp. 121-129. , New Brunswick, NJ, USAKira, K., Rendell, L., A practical approach to feature selection (1992) Proceedings of the 9th international workshop on machine learning, pp. 249-256. , Aberdeen, Scotland, UKKolcz, A., Alspector, J., SVM-based filtering of e-mail spam with content-specific misclassification costs (2001) Proceedings of the 1st international conference on data mining, pp. 1-14. , San Jose, CA, USAKoprinska, I., Poon, J., Clark, J., Chan, J., Learning to classify e-mail (2007) Inf Sci, 177 (10), pp. 2167-2187Lemire, D., Scale and translation invariant collaborative filtering systems (2005) Inf Retr, 8 (1), pp. 129-150Losada, D., Azzopardi, L., Assessing multivariate Bernoulli models for information retrieval (2008) ACM Trans Inf Syst, 26 (3), pp. 1-46Marsono, M., El-Kharashi, N., Gebali, F., Targeting spam control on middleboxes: spam detection based on layer-3 e-mail content classification (2009) Comput Netw, 53 (6), pp. 835-848Matthews, B., Comparison of the predicted and observed secondary structure of T4 phage lysozyme (1975) Biochim Biophys Acta, 405 (2), pp. 442-451McCallum, A., Nigam, K., A comparison of event models for Naive Bayes text classification (1998) Proceedings of the 15th AAAI workshop on learning for text categorization, pp. 41-48. , Menlo Park, CA, USAMetsis, V., Androutsopoulos, I., Paliouras, G., Spam filtering with Naive Bayes-which Naive Bayes (2006) Proceedings of the 3rd international conference on email and anti-spam, pp. 1-5. , Mountain View, CA, USAMitchell, T., (1997) Machine learning, , McCraw-Hill, New YorkSahami, M., Dumais, S., Hecherman, D., Horvitz, E., A Bayesian approach to filtering junk e-mail (1998) Proceedings of the 15th national conference on artificial intelligence, pp. 55-62. , Madison, WI, USASchapire, R., Singer, Y., Singhal, A., Boosting and Rocchio applied to text filtering (1998) Proceedings of the 21st annual international conference on information retrieval, pp. 215-223. , Melbourne, AustraliaSchneider, K., A comparison of event models for Naive Bayes anti-spam e-mail filtering (2003) Proceedings of the 10th conference of the European chapter of the association for computational linguistics, pp. 307-314. , Budapest, HungarySchneider, K., On word frequency information and negative evidence in Naive Bayes text classification (2004) Proceedings of the 4th international conference on advances in natural language processing, pp. 474-485. , Alicante, SpainSebastiani, F., Machine learning in automated text categorization (2002) ACM Comput Surv, 34 (1), pp. 1-47Seewald, A., An evaluation of Naive Bayes variants in content-based learning for spam filtering (2007) Int Data Anal, 11 (5), pp. 497-524Song, Y., Kolcz, A., Gilez, C., Better Naive Bayes classification for high-precision spam detection (2009) Softw Pract Exp, 39 (11), pp. 1003-1024Van Rijsbergen, C., (1979) Information retrieval, , 2nd edn. Butterworths, LondonYang, Y., Pedersen, J., A comparative study on feature selection in text categorization (1997) Proceedings of the 14th international conference on machine learning, pp. 412-420. , Nashville, TN, USAZadeh, L., Fuzzy sets (1965) Inf Control, 8 (3), pp. 338-35
    corecore