9 research outputs found

    Feature extraction and classification of spam emails

    Get PDF

    Content based hybrid sms spam filtering system

    Get PDF
    World has changed. Everybody is connected. Almost each and everyone have a mobile phone. Millions of SMSs are going around the world over mobile networks in every second. But about 113 of them are spam. SMS spam has become a crucial problem with the increase of mobile penetration around the world. SMS spam filtering is a relatively new task which inherits many issues and solutions from email spam filtering. However it poses its own specific challenges. Server based approaches and Mobile application based approaches are accommodate content based and content less mechanism to do the SMS spam filtering. Though there are approaches, still there is a lack of a hybrid solution which can do general filtering at server level while user specific filtering can be done on mobile level. This paper presents a hybrid solution for SMS spam filtering where both feature phone users as well as smart phone users get benefited. Feature phone users can experience the general filter while smart phone users can configure and filter SMSs based on their own preferences rather than sticking in to a general filter. Server level solution consists of a neural network along with a Bayesian filter and device level filter consists of a Bayesian filter. We have evaluated the accuracy of neural network using spam huge dataset along with some randomly used personal SMSs

    Comparison of spam detection methods

    Get PDF
    Spämm on tänapäeval üks suurimatest probleemidest, millega kasutajad iga päev kokku puutuvad, kuid mitte kõik ei tea, kuidas rämpspostiga õigesti võidelda. Käesolevas bakalaureusetöös võrreldakse erinevaid rämpsposti tõrjumise meetodeid ja otsustatakse, milline neist on kõige efektiivsem ja millist on kõige mõistlikum kasutada. Seejärel viiakse läbi mitu erinevat katset, kusjuures iga meetodit hinnatakse konkreetsete kriteeriumite järgi ning nende põhjal tehakse järeldused. Lõpuks otsustatakse, millist meetodit on kõige mõistlikum kasutada ning milline neist on efektiivsem.Nowadays, spam is one the biggest problems that users face in their everyday life, but not all of them know how to handle it. The purpose of this paper is to compare different spam detection methods, decide which is the best one and when to use them. After that, each method is assessed according to the specific criteria to make conclusion. Finnaly, it is decided which of those methods is the most rational for specific purposes and which is the most effective

    Bibliometric Survey on Incremental Learning in Text Classification Algorithms for False Information Detection

    Get PDF
    The false information or misinformation over the web has severe effects on people, business and society as a whole. Therefore, detection of misinformation has become a topic of research among many researchers. Detecting misinformation of textual articles is directly connected to text classification problem. With the massive and dynamic generation of unstructured textual documents over the web, incremental learning in text classification has gained more popularity. This survey explores recent advancements in incremental learning in text classification and review the research publications of the area from Scopus, Web of Science, Google Scholar, and IEEE databases and perform quantitative analysis by using methods such as publication statistics, collaboration degree, research network analysis, and citation analysis. The contribution of this study in incremental learning in text classification provides researchers insights on the latest status of the research through literature survey, and helps the researchers to know the various applications and the techniques used recently in the field

    Using skipgrams and POS-based feature selection for patent classification

    Get PDF
    Contains fulltext : 116289.pdf (publisher's version ) (Open Access)19 p

    Combining winnow and orthogonal sparse bigrams for incremental spam filtering

    No full text
    Abstract. Spam filtering is a text categorization task that has attracted significant attention due to the increasingly huge amounts of junk email on the Internet. While current best-practice systems use Naive Bayes filtering and other probabilistic methods, we propose using a statistical, but non-probabilistic classifier based on the Winnow algorithm. The feature space considered by most current methods is either limited in expressivity or imposes a large computational cost. We introduce orthogonal sparse bigrams (OSB) as a feature combination technique that overcomes both these weaknesses. By combining Winnow and OSB with refined preprocessing and tokenization techniques we are able to reach an accuracy of 99.68 % on a difficult test corpus, compared to 98.88% previously reported by the CRM114 classifier on the same test corpus

    A real-time system for abusive network traffic detection

    Get PDF
    Abusive network traffic--to include unsolicited e-mail, malware propagation, and denial-of-service attacks--remains a constant problem in the Internet. Despite extensive research in, and subsequent deployment of, abusive-traffic detection infrastructure, none of the available techniques addresses the problem effectively or completely. The fundamental failing of existing methods is that spammers and attack perpetrators rapidly adapt to and circumvent new mitigation techniques. Analyzing network traffic by exploiting transport-layer characteristics can help remedy this and provide effective detection of abusive traffic. Within this framework, we develop a real-time, online system that integrates transport layer characteristics into the existing SpamAssasin tool for detecting unsolicited commercial e-mail (spam). Specifically, we implement the previously proposed, but undeveloped, SpamFlow technique. We determine appropriate algorithms based on classification performance, training required, adaptability, and computational load. We evaluate system performance in a virtual test bed and live environment and present analytical results. Finally, we evaluate our system in the context of Spam Assassin's auto-learning mode, providing an effective method to train the system without explicit user interaction or feedback.http://archive.org/details/arealtimesystemf109455754Outstanding ThesisApproved for public release; distribution is unlimited

    Finding Microblog Posts of User Interest

    Get PDF
    Microblogging is an increasingly popular form of social media. One of the most popular microblogging services is Twitter. The number of messages posted to Twitter on a daily basis is extremely large. Accordingly, it becomes hard for users to sort through these messages and find ones that interest them. Twitter offers search mechanisms but they are relatively simple and accordingly the results can be lacklustre. Through participation in the 2011 Text Retrieval Conference's Microblog Track, this thesis examines real-time ad hoc search using standard information retrieval approaches without microblog or Twitter specific modifications. It was found that using pseudo-relevance feedback based upon a language model derived from Twitter posts, called tweets, in conjunction with standard ranking methods is able to perform competitively with advanced retrieval systems as well as microblog and Twitter specific retrieval systems. Furthermore, possible modifications both Twitter specific and otherwise are discussed that would potentially increase retrieval performance. Twitter has also spawned an interesting phenomenon called hashtags. Hashtags are used by Twitter users to denote that their message belongs to a particular topic or conversation. Unfortunately, tweets have a 140 characters limit and accordingly all relevant hashtags cannot always be present in tweet. Thus, Twitter users cannot easily find tweets that do not contain hashtags they are interested in but should contain them. This problem is investigated in this thesis in three ways using learning methods. First, learning methods are used to determine if it is possible to discriminate between two topically different sets of a tweets. This thesis then investigates whether or not it is possible for tweets without a particular hashtag, but discusses the same topic as the hashtag, to be separated from random tweets. This case mimics the real world scenario of users having to sift through random tweets to find tweets that are related to a topic they are interested in. This investigation is performed by removing hashtags from tweets and attempting to distinguish those tweets from random tweets. Finally, this thesis investigates whether or not topically similar tweets can also be distinguished based upon a sub-topic. This was investigated in almost an identical manner to the second case. This thesis finds that topically distinct tweets can be distinguished but more importantly that standard learning methods are able to determine that a tweet with a hashtag removed should have that hashtag. In addition, this hashtag reconstruction can be performed well with very few examples of what a tweet with and without the particular hashtag should look like. This provides evidence that it may be possible to separate tweets a user may be interested from random tweets only using hashtags they are interested in. Furthermore, the success of the hashtag reconstruction also provides evidence that users do not misuse or abuse hashtags since hashtag presence was taken to be the ground truth in all experiments. Finally, the applicability of the hashtag reconstruction results to the TREC Microblog Track and a mobile application is presented

    Occam's Razor-based Spam Filter

    Get PDF
    Nowadays e-mail spam is not a novelty, but it is still an important rising problem with a big economic impact in society. Spammers manage to circumvent current spam filters and harm the communication system by consuming several resources, damaging the reliability of e-mail as a communication instrument and tricking recipients to react to spam messages. Consequently, spam filtering poses a special problem in text categorization, of which the defining characteristic is that filters face an active adversary, which constantly attempts to evade filtering. In this paper, we present a novel approach to spam filtering based on theminimum description length principle. Furthermore, we have conducted an empirical experiment on six public and real non-encoded datasets. The results indicate that the proposed filter is fast to construct, incrementally updateable and clearly outperforms the state-of-the-art spam filters. © The Brazilian Computer Society 2012.33245253Almeida, T., Yamakami, A., Content-based spam filtering (2010) Proceedings of the 23rd IEEE International Joint Conference On Neural Networks, pp. 1-7. , Barcelona, SpainAlmeida, T., Yamakami, A., Redução de Dimensionalidade Aplicada na Classificação de Spams Usando Filtros Bayesianos (2011) Revista Brasileira De Computação Aplicada, 3 (1), pp. 16-29Almeida, T., Yamakami, A., Almeida, J., Evaluation of approaches for dimensionality reduction applied with Naive Bayes anti-spam filters (2009) Proceedings of the 8th IEEE International Conference On Machine Learning and Applications, pp. 517-522. , Miami, FL, USAAlmeida, T., Yamakami, A., Almeida, J., Filtering spams using the minimum description length principle (2010) Proceedings of the 25th ACM Symposium On Applied Computing, pp. 1856-1860. , Sierre, SwitzerlandAlmeida, T., Yamakami, A., Almeida, J., Probabilistic antispam filtering with dimensionality reduction (2010) Proceedings of the 25th ACM Symposium On Applied Computing, pp. 1804-1808. , Sierre, SwitzerlandAlmeida, T., Hidalgo, J.G., Yamakami, A., Contributions to the study of SMS spam filtering: New collection and results (2011) Proceedings of the 2011 ACM Symposium On Document Engineering, pp. 259-262. , Mountain View, CA, USAAlmeida, T., Almeida, J., Yamakami, A., Spam filtering: How the dimensionality reduction affects the accuracy of Naive Bayes classifiers (2011) J Internet Serv Appl, 1 (3), pp. 183-200Almeida, T.A., Yamakami, A., Advances in spam filtering techniques (2012) Com Putational Intelligence For Privacy and Security. Studies In Computational Intelligence, 394, pp. 199-214. , In: Elizondo D, Solanas A,Martinez-Balleste A (eds), Springer, BerlinAlmeida, T.A., Yamakami, A., Facing the spammers: A very effective approach to avoid junk e-mails (2012) Expert Syst Appl, pp. 1-5Anagnostopoulos, A., Broder, A., Punera, K., Effective and efficient classification on a search-engine model (2008) Knowl Inf Syst, 16 (2), pp. 129-154Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., Spyropoulos C (2000a) An evalutation of Naive Bayesian anti-spam filtering Proceedings of the 11th European Conference On Machine Learning, pp. 9-17. , Barcelona, SpainAndroutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P., Learning to filter spam e-mail: A comparison of a Naive Bayesian and a memory-based approach (2000) Proceedings of the 4th European Conference On Principles and Practice of Knowledge Discovery In Databases, pp. 1-13. , Lyon, FranceAndroutsopoulos, I., Paliouras, G., Michelakis, E., (2004) Learning to Filter Unsolicited Commercial E-mail, , Technical Report 2004/2, National Centre for Scientific Research "Demokritos", Athens, GreeceBaldi, P., Brunak, S., Chauvin, Y., Andersen, C., Nielsen, H., Assessing the accuracy of prediction algorithms for classification: An overview (2000) Bioinformatics, 16 (5), pp. 412-424Barron, A., Rissanen, J., Yu, B., The minimum description length principle in coding and modeling (1998) IEEE Trans Inf Theory, 44 (6), pp. 2743-2760Blanzieri, E., Bryl, A., A survey of learning-based techniques of email spam filtering (2008) Artif Intell Rev, 29 (1), pp. 335-455Bordes, A., Ertekin, S., Weston, J., Bottou, L., Fast kernel classifiers with online and active learning (2005) J Mach Learn Res, 6, pp. 1579-1619Bratko, A., Cormack, G., Filipic, B., Lynam, T., Zupan, B., Spam filtering using statistical data compression models (2006) J Mach Learn Res, 7, pp. 2673-2698Carreras, X., Marquez, L., Boosting trees for anti-spam email filtering (2001) Proceedings of the 4th International Conference On Recent Advances In Natural Language Processing, pp. 58-64. , Tzigov Chark, BulgariaCohen, W., Fast effective rule induction (1995) Proceedings of 12th International Conference On Machine Learning, pp. 115-123. , Tahoe City, CA, USACohen, W., Learning rules that classify e-mail (1996) Proceedings of the AAAI Spring Symposium On Machine Learning In Information Access, pp. 18-25. , CA, USA, StanfordCormack, G., Email spam filtering: A systematic review (2008) Found Trends Inf Retr, 1 (4), pp. 335-455Cormack, G., Lynam, T., Online supervised spam filter evaluation (2007) ACM Trans Inf Syst, 25 (3), pp. 1-11Czarnowski, I., Cluster-based instance selection for machine classification (2011) Knowl Inf SystDrucker, H., Wu, D., Vapnik, V., Support vector machines for spam categorization (1999) IEEE Trans Neural Netw, 10 (5), pp. 1048-1054Forman, G., Scholz, M., Rajaram, S., Feature shaping for linear SVM classifiers (2009) Proceedings of the 15th ACM SIGKDD International Conference On Knowledge Discovery and Data Mining, pp. 299-308. , France, ParisFrank, E., Chui, C., Witten, I., Text categorization using compression models (2000) Proceedings of the 10th Data Compression Conference, pp. 555-565. , Snowbird, UT, USAGrünwald, P., Atutorial introduction to theminimum description length principle (2005) Advances In Minimum Description Length: Theory and Applications, pp. 3-81. , In: Grünwald P, Myung I, Pitt M (eds), MIT Press, CambridgeGuzella, T., Caminhas, W., A review of machine learning approaches to spam filtering (2009) Expert Syst Appl, 36 (7), pp. 10206-10222Hidalgo, J., Evaluating cost-sensitive unsolicited bulk mail categorization (2002) Proceedings of the 17th ACM Symposium On Applied Computing, pp. 615-620. , Madrid, SpainJoachims, T., A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization (1997) Proceedings of 14th International Conference On Machine Learning, pp. 143-151. , Nashville, TN, USAJohn, G., Langley, P., Estimating continuous distributions in Bayesian classifiers (1995) Proceedings of the 11th International Conference OnUncertainty In Artificial Intelligence, pp. 338-345. , Montreal,CanadaKatakis, I., Tsoumakas, G., Vlahavas, I., Tracking recurring contexts using ensemble classifiers: An application to email filtering (2009) Knowl Inf Syst, 22 (3), pp. 371-391Kolcz, A., Alspector, J., SVM-based filtering of e-mail spam with content-specific misclassification costs (2001) Proceedings of the 1st International Conference On Data Mining, pp. 1-14. , San Jose, CA, USALosada, D., Azzopardi, L., Assessing multivariate Bernoulli models for information retrieval (2008) ACM Trans Inf Syst, 26 (3), pp. 1-46Matthews, B., Comparison of the predicted and observed secondary structure of T4 phage lysozyme (1975) Biochimica Et Biophysica Acta, 405 (2), pp. 442-451McCallum, A., Nigam, K., A comparison of event models for Naive Bayes text classication (1998) Proceedings of the 15th AAAI Workshop On Learning For Text Categorization, pp. 41-48. , Menlo Park, CA, USAMetsis, V., Androutsopoulos, I., Paliouras, G., Spam filtering with Naive Bayes-which Naive Bayes? (2006) Proceedings of the 3rd International Conference On Email and Anti-Spam, pp. 1-5. , Mountain View, CA, USAPeng, T., Zuo, W., He, F., SVM based adaptive learning method for text classification from positive and unlabeled documents (2008) Knowl Inf Syst, 16 (3), pp. 281-301Reddy, C., Park, J.-H., Multi-resolution boosting for classification and regression problems (2010) Knowl Inf SystRissanen, J., Modeling by shortest data description (1978) Automatica, 14, pp. 465-471Sahami, M., Dumais, S., Hecherman, D., Horvitz, E., A Bayesian approach to filtering junk e-mail (1998) Proceedings of the 15th NationalConference On Artificial Intelligence, pp. 55-62. , Madison, WI,USASchapire, R., Singer, Y., Singhal, A., Boosting and Rocchio applied to text filtering (1998) Proceedings of the 21st Annual International Conference On Information Retrieval, pp. 215-223. , Melbourne, AustraliaSchneider, K., On word frequency information and negative evidence in Naive Bayes text classification (2004) Proceedings of the 4th International Conference On Advances In Natural Language Processing, pp. 474-485. , Alicante, SpainSiefkes, C., Assis, F., Chhabra, S., Yerazunis, W., Combining winnow and orthogonal sparse bigrams for incremental spam filtering (2004) Proceedings of the 8th European Conference On Principles and Practice of Knowledge Discovery In Databases, pp. 410-421. , Pisa, ItalySong, Y., Kolcz, A., Gilez, C., Better Naive Bayes classification for high-precision spam detection (2009) Softw Pract Experience, 39 (11), pp. 1003-1024Teahan, W., Harper, D., Using compression-based language models for text categorization (2001) Proceedings of the 2001 Workshop On Language Modeling and Information Retrieval, pp. 1-5. , Pittsburgh, PA, USAWozniak, M., A hybrid decision tree training method using data streams (2010) Knowl Inf SystWu, X., Kumar, V., Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G., Steinberg, D., Top 10 algorithms in data mining (2008) Knowl Inf Syst, 14 (1), pp. 1-37Zhang, J., Kang, D., Silvescu, A., Honavar, V., Learning accurate and concise Naive Bayes classifiers from attribute value taxonomies and data (2006) Knowl Inf Syst, 9 (2), pp. 157-179Zhang, L., Zhu, J., Yao, T., An evaluation of statistical spam filtering techniques (2004) ACMTrans Asian Lang Inf Process, 3 (4), pp. 243-26
    corecore