Search CORE

1,185 research outputs found

Support Vector Machines for Image Spam Analysis

Author: Chavda Aneri
Di Troia Fabio
Potika Katerina
Stamp Mark
Publication venue: SJSU ScholarWorks
Publication date: 01/01/2018
Field of study

Email is one of the most common forms of digital communication. Spam is unsolicited bulk email, while image spam consists of spam text embedded inside an image. Image spam is used as a means to evade text-based spam filters, and hence image spam poses a threat to email-based communication. In this research, we analyze image spam detection using support vector machines (SVMs), which we train on a wide variety of image features. We use a linear SVM to quantify the relative importance of the features under consideration. We also develop and analyze a realistic “challenge” dataset that illustrates the limitations of current image spam detection techniques

SJSU ScholarWorks

Spam Classification Using Machine Learning Techniques - Sinespam

Author: Norte Sosa José
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2010
Field of study

Most e-mail readers spend a non-trivial amount of time regularly deleting junk e-mail (spam) messages, even as an expanding volume of such e-mail occupies server storage space and consumes network bandwidth. An ongoing challenge, therefore, rests within the development and refinement of automatic classifiers that can distinguish legitimate e-mail from spam. Some published studies have examined spam detectors using Naïve Bayesian approaches and large feature sets of binary attributes that determine the existence of common keywords in spam, and many commercial applications also use Naïve Bayesian techniques. Spammers recognize these attempts to prevent their messages and have developed tactics to circumvent these filters, but these evasive tactics are themselves patterns that human readers can often identify quickly. This work had the objectives of developing an alternative approach using a neural network (NN) classifier brained on a corpus of e-mail messages from several users. The features selection used in this work is one of the major improvements, because the feature set uses descriptive characteristics of words and messages similar to those that a human reader would use to identify spam, and the model to select the best feature set, was based on forward feature selection. Another objective in this work was to improve the spam detection near 95% of accuracy using Artificial Neural Networks; actually nobody has reached more than 89% of accuracy using ANN

Recommended from our members

Internet: An Overview of Key Technology Policy Issues Affecting Its Use and Growth

Author: Kruger Lennard G.
McLoughlin Glenn J.
Moteff John D.
Seifert Jeffrey W.
Smith Marcia S.
Publication venue: Library of Congress. Congressional Research Service.
Publication date: 20/08/2004
Field of study

This report gives an overview of key technology policy issues of internet affecting its use and growth

UNT Digital Library

Improving Feature Selection Techniques for Machine Learning

Author: Anu G. Bourgeois
Feng Tan
Feng Tan
Under Direction
Publication venue: ScholarWorks @ Georgia State University
Publication date: 01/01/2007
Field of study

As a commonly used technique in data preprocessing for machine learning, feature selection identifies important features and removes irrelevant, redundant or noise features to reduce the dimensionality of feature space. It improves efficiency, accuracy and comprehensibility of the models built by learning algorithms. Feature selection techniques have been widely employed in a variety of applications, such as genomic analysis, information retrieval, and text categorization. Researchers have introduced many feature selection algorithms with different selection criteria. However, it has been discovered that no single criterion is best for all applications. We proposed a hybrid feature selection framework called based on genetic algorithms (GAs) that employs a target learning algorithm to evaluate features, a wrapper method. We call it hybrid genetic feature selection (HGFS) framework. The advantages of this approach include the ability to accommodate multiple feature selection criteria and find small subsets of features that perform well for the target algorithm. The experiments on genomic data demonstrate that ours is a robust and effective approach that can find subsets of features with higher classification accuracy and/or smaller size compared to each individual feature selection algorithm. A common characteristic of text categorization tasks is multi-label classification with a great number of features, which makes wrapper methods time-consuming and impractical. We proposed a simple filter (non-wrapper) approach called Relation Strength and Frequency Variance (RSFV) measure. The basic idea is that informative features are those that are highly correlated with the class and distribute most differently among all classes. The approach is compared with two well-known feature selection methods in the experiments on two standard text corpora. The experiments show that RSFV generate equal or better performance than the others in many cases

CiteSeerX

Cyberlaw 2.0

Author: Geist Michael
Publication venue: Digital Commons @ Boston College Law School
Publication date: 01/03/2003
Field of study

This Article outlines two versions of cyberlaw, The first, characteristic of the scholarship of the late 1990s, is typified by a borclerless Internet and national laws that cease to have effect at their real-space borders, the regulatory power of code, and the virtue of selfregulatory solutions to Internet and e-commerce issues. In Cybet\u27law 2.0, the borderless Internet becomes bordered, bordered laws become borderless. the regulation of code becomes regulated code, and selfregulation becomes industry consultation, as government shifts toward a more traditional regulatory approach. The Article assesses each of these changes, calling attention to recent developments in copyright law, domain name dispute resolution, privacy, and Internet governance. At the heart of each is the question of the appropriate governmental role in Internet regulation and the need for cyberlaw to reconcile how government and regulation fit within the tensions of ever-changing technologies

A Survey to Fix the Threshold and Implementation for Detecting Duplicate Web Documents

Author: Bhimireddy Manojreddy
Gandi Krishna Pavan
Hicks Reuven
Veeramachaneni Bhargav Roy
Publication venue: OPUS Open Portal to University Scholarship
Publication date: 01/10/2015
Field of study

The drastic development in the information accessible on the World Wide Web has made the employment of automated tools to locate the information resources of interest, and for tracking and analyzing the same a certainty. Web Mining is the branch of data mining that deals with the analysis of World Wide Web. The concepts from various areas such as Data Mining, Internet technology and World Wide Web, and recently, Semantic Web can be said as the origin of web mining. Web mining can be defined as the procedure of determining hidden yet potentially beneficial knowledge from the data accessible in the web. Web mining comprise the sub areas: web content mining, web structure mining, and web usage mining. Web content mining is the process of mining knowledge from the web pages besides other web objects. The process of mining knowledge about the link structure linking web pages and some other web objects is defined as Web structure mining. Web usage mining is defined as the process of mining the usage patterns created by the users accessing the web pages. The search engine technology has led to the development of World Wide. The search engines are the chief gateways for access of information in the web. The ability to locate contents of particular interest amidst a huge heap has turned businesses beneficial and productive. The search engines respond to the queries by employing the process of web crawling that populates an indexed repository of web pages. The programs construct a confined repository of the segment of the web that they visit by navigating the web graph and retrieving pages. There are two main types of crawling, namely, Generic and Focused crawling. Generic crawlers crawls documents and links of diverse topics. Focused crawlers limit the number of pages with the aid of some prior obtained specialized knowledge. The systems that index, mine, and otherwise analyze pages (such as, the search engines) are provided with inputs from the repositories of web pages built by the web crawlers. The drastic development of the Internet and the growing necessity to incorporate heterogeneous data is accompanied by the issue of the existence of near duplicate data. Even if the near duplicate data don’t exhibit bit wise identical nature they are remarkably similar. The duplicate and near duplicate web pages either increase the index storage space or slow down or increase the serving costs which annoy the users, thus causing huge problems for the web search engines. Hence it is inevitable to design algorithms to detect such pages

Governors State University