6 research outputs found

    Model for optimising the execution of anti-spam filters

    Get PDF
    The establishment of the first interconnection between two remote hosts in 1969 originated the beginning of one of the most important technological phenomena of humanity, Internet. In fact, Internet has become an essential part of life for many people inhabiting the most industrialized nations, reaching a percentage of penetration during 2014 of 40% of the world population. One of the reasons that propitiated the massive proliferation of Internet is attributable to the e-mail. This service allows an easy and fast (nearly instantaneous) communication between users by sending messages. This fact has meant that e-mail service acquired a surprising popularity. However, the uncontrolled nature of Internet has turned e-mail communications into the best framework for the promotion of illegal advertisements (such as those about drugs selling), the delivery of phishing e-mails, the virus propagation and other forms of electronic scam (also called spam). Although the amount of spam e-mail deliveries undergoes continuous fluctuations, current statistics show that more than 60% of the e-mail transferred through Internet are spam. This spam ratio is supported by newest communication advances such as 4G new generation networks, ensures a quick an easy Internet connection almost everywhere. Under these circunstances, the use of spam filtering services and products is the most effective mechanism to fight against spam. However, the massive amount of e-mail deliveries per day (an average of 125 billion in 2015) has encouraged the need of improving spam filtering services in order to adapt them to the current needs. In this research work, is introduce a new filtering model able to enhance speed and accuracy while maintaining the same philosophy and anti-spam techniques used in the most popular anti-spam filtering systems. This goal has been achieved through improving several aspects including: (i) the design and development of small technical improvements to enhance overall filter throughput, (ii) the application of genetic algorithms in order to enhance the filter accuracy and finally, (iii) the use of scheduling algorithms to increase speed filtering.Durante la 煤ltima d茅cada, Internet se convirti贸 en una herramienta esencial para la comunicaci贸n entre personas. Las ventajas introducidas por Internet fueron r谩pidamente aprovechadas por millones de usuarios de la red para hacer realidad servicios como el comercio electr贸nico, la banca online, las redes sociales, etc. El aprovechamiento de este entorno tambi茅n fue perseguido por aquellos que desean hacer uso de las novas tecnolog铆as para comercializar productos ilegales o de dudosa reputaci贸n, o publicar/enviar contenidos molestos para los usuarios de la red. As铆, aparecieron los spammers y los contenidos SPAM que ya se extienden por las redes sociales, correo electr贸nico, foros, blogs, etc. Para filtrar y eliminar los contenidos SPAM es necesario contar con software o servicios que permitan su detecci贸n. En la actualidad, la eliminaci贸n de contenidos antispam se distribuye como un servicio. Actualmente resulta habitual y efectiva la contrataci贸n de servicios de filtrado antispam que se componen de un software o hardware espec铆fico de filtrado y de un servicio de actualizaci贸n del comportamiento del filtro que permite la adaptaci贸n a las variaciones que se pueden producir en los correos distribuidos. En la actualidad, estos servicios de filtrado se basan en la utilizaci贸n de un software SpamAssassin que, por sus caracter铆sticas, permite el modelado del comportamiento del filtro de forma din谩mica y la distribuci贸n de estos filtros al software de filtrado instalado en los clientes. La posibilidad de modelar los filtros de contenidos fue, sin duda la caracter铆stica m谩s valorada de SpamAssassin y que motiv贸 a que esta soluci贸n fuera adoptada incluso por grandes empresas como Symantec (Symantec Brightmail) ou McAfee (McAfee SpamKiller).Durante a 煤ltima d茅cada, Internet converteuse nunha ferramenta esencial para a comunicaci贸n entre persoas. As vantaxes introducidas por Internet foron r谩pidamente aproveitadas por milleiros de usuarios da rede para facer realidade servizos como o comercio electr贸nico, a banca online, as redes sociais, etc. O aproveitamento deste entorno tam茅n foi perseguido por aqueles que desexaron facer uso das novas tecnolox铆as para comercializar productos ilegais ou de dudosa reputaci贸n ou publicar/enviar contidos molestos para os usuarios da rede. As铆, apareceron os spammers e os contidos SPAM que xa se extenden por redes sociais, correo electr贸nico, foros, blogs, etc. Para filtrar e eliminar os contidos SPAM es necesario contar con software ou servizos que permitan a sua detecci贸n. Na actualidade, a eliminaci贸n de contidos antispam distrib煤ese como un servizo. Actualmente resulta habitual e efectiva a contrataci贸n de servizos de filtrado antispam que se compo帽en dun software ou hardware espec铆fico de filtrado e dun servizo de actualizaci贸n do comportamento do filtro que permite a adaptaci贸n 谩s variaci贸ns que se poden producir nos correos distribu铆dos. Na actualidade, estes servizos de filtrado conf贸rmanse mediante a utilizaci贸n dun software SpamAssassin que, polas s煤as caracter铆sticas, permiten o modelado do comportamento do filtro de forma din谩mca e a s煤a distribuci贸n destes filtros ao software de filtrado instalado nos clientes. A posibilidade de modelar os filtros de contidos foi, sen d煤bida a caracter铆stica m谩is valorada de SpamAssassin que motivou que esta soluci贸n fora adoitada incluso por grandes empresas como Symantec (Symantec Brightmail) ou McAfee (McAfee SpamKiller).Xunta de Galicia | Ref. 08TIC041EXunta de Galicia | Ref. 09TIC028

    Wirebrush4SPAM Trend Filters

    No full text
    <p>Contains two filter definition (positive and negative trend filter) for Wirebrush4SPAM filtering platform</p

    Corpus 200 Emails

    No full text
    <p>Corpus of 200 multilingual emails (Spanish, English and Portuguese) equally distributed (100 ham and 100 spam).</p

    Corpus 200 Emails

    No full text
    <p>Corpus containing 200 multilingual emails (Spanish, English and Portuguese) structured according to the RFC2822 specification.</p

    A new semantic-based feature selection method for spam filtering

    Get PDF
    The Internet emerged as a powerful infrastructure for the worldwide communication and interaction of people. Some unethical uses of this technology (for instance spam or viruses) generated challenges in the development of mechanisms to guarantee an affordable and secure experience concerning its usage. This study deals with the massive delivery of unwanted content or advertising campaigns without the accordance of target users (also known as spam). Currently, words (tokens) are selected by using feature selection schemes; they are then used to create feature vectors for training different Machine Learning (ML) approaches. This study introduces a new feature selection method able to take advantage of a semantic ontology to group words into topics and use them to build feature vectors. To this end, we have compared the performance of nine well-known ML approaches in conjunction with (i) Information Gain, the most popular feature selection method in the spam-filtering domain and (ii) Latent Dirichlet Allocation, a generative statistical model that allows sets of observations to be explained by unobserved groups that describe why some parts of the data are similar, and (iii) our semantic-based feature selection proposal. Results have shown the suitability and additional benefits of topic-driven methods to develop and deploy high-performance spam filters.Xunta de Galicia | Ref. ED481B 2017/018Xunta de Galicia | Ref. ED431C2016-040Agencia Estatal de Investigaci贸n | Ref. MTM2017-89422-PSMEIC/SRA/ERDF | Ref. TIN2017-84658-C2-1-

    Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text Sources

    No full text
    During the last years, big data analysis has become a popular means of taking advantage of multiple (initially valueless) sources to find relevant knowledge about real domains. However, a large number of big data sources provide textual unstructured data. A proper analysis requires tools able to adequately combine big data and text-analysing techniques. Keeping this in mind, we combined a pipelining framework (BDP4J (Big Data Pipelining For Java)) with the implementation of a set of text preprocessing techniques in order to create NLPA (Natural Language Preprocessing Architecture), an extendable open-source plugin implementing preprocessing steps that can be easily combined to create a pipeline. Additionally, NLPA incorporates the possibility of generating datasets using either a classical token-based representation of data or newer synset-based datasets that would be further processed using semantic information (i.e., using ontologies). This work presents a case study of NLPA operation covering the transformation of raw heterogeneous big data into different dataset representations (synsets and tokens) and using the Weka application programming interface (API) to launch two well-known classifiers
    corecore