Search CORE

7 research outputs found

An Efficient feature selection algorithm for the spam email classification

Author: Saleh Hadeel M.
Publication venue: 'International University of Sarajevo'
Publication date: 11/08/2021
Field of study

The existing spam email classification systems are suffering from the problems of low accuracy due to the high dimensionality of the associated feature selection (FS) process. But being a global optimization process in machine learning, FS is mainly aimed at reducing the redundancy of dataset to create a set of acceptable and accurate results. This study presents the combination of Chaotic Particle Swarm Optimization (PSO) algorithm with Artificial Bees Colony (ABC) for the reduction of features dimensionality in a bid to improve spam emails classification accuracy. The features for each particle in this work were represented in a binary form, meaning that they were transformed into binary using a sigmoid function. The features selection was based on a fitness function that depended on the obtained accuracy using SVM. The proposed system was evaluated for performance by considering the performance of the classifier and the selected features vectors dimension which served as the input to the classifier; this evaluation was done using the Spam Base dataset and from the results, the PSO-ABC classifier performed well in terms of FS even with a small set of selected features

Periodicals of Engineering and Natural Sciences (PEN - International University of Sarajevo)

An Improved Transformer-based Model for Detecting Phishing, Spam, and Ham: A Large Language Model Approach

Author: Jamal Suhaima
Wimmer Hayden
Publication venue
Publication date: 12/11/2023
Field of study

Phishing and spam detection is long standing challenge that has been the subject of much academic research. Large Language Models (LLM) have vast potential to transform society and provide new and innovative approaches to solve well-established challenges. Phishing and spam have caused financial hardships and lost time and resources to email users all over the world and frequently serve as an entry point for ransomware threat actors. While detection approaches exist, especially heuristic-based approaches, LLMs offer the potential to venture into a new unexplored area for understanding and solving this challenge. LLMs have rapidly altered the landscape from business, consumers, and throughout academia and demonstrate transformational potential for the potential of society. Based on this, applying these new and innovative approaches to email detection is a rational next step in academic research. In this work, we present IPSDM, our model based on fine-tuning the BERT family of models to specifically detect phishing and spam email. We demonstrate our fine-tuned version, IPSDM, is able to better classify emails in both unbalanced and balanced datasets. This work serves as an important first step towards employing LLMs to improve the security of our information systems

arXiv.org e-Print Archive

Information Retrieval using applied Supervised Learning for Personalized E-Commerce

Author: Hellum Kjell Arne
Publication venue: University of Stavanger, Norway
Publication date: 15/06/2017
Field of study

Master's thesis in Computer SciencePersonalized E-Commerce Search Challenge issued by the International Conference on Information and Knowledge Management. By analyzing historical data containing browsing logs, queries, user interactions, and static data in the domain of an online retail service, we attempt to extract patterns and derive features from the data collection that will subsequently improve prediction of relevant products. A selection of supervised learning models will utilize an assembly of these features to be trained for prediction of test data. Prediction is performed on the queries given by the data collection, paired with each product item originally appearing in the query. We experiment with the possible assemblies of features along with the models and compare the results to achieve maximum prediction power. Lastly, the quality of the predictions are evaluated towards a ground truth to yield scores.submittedVersio

UiS Brage

Mustahattuhakukoneoptimointi

Author: Piili Tommi
Publication venue
Publication date: 02/05/2023
Field of study

Hakukoneoptimoinnin tarkoituksena on lisätä verkkosivun näkyvyyttä hakukoneiden tulossivuilla. Mustahattuhakukoneoptimointi on hakukoneyhtiöiden laatimien ohjesääntöjen vastaisten hakukoneoptimointimenetelmien hyödyntämistä. Tämän tutkielman tavoitteena on selvittää kirjallisuuskatsauksen pohjalta, mitä mustahattuhakukoneoptimoinnin menetelmiä ja vastatoimia tieteellisessä kirjallisuudessa on tutkittu. Tutkimusstrategiaksi valittiin kirjallisuuskatsaus, jotta tutkittava informaatio perustuisi tieteelliseen aineistoon. Kirjallisuuskatsauksen systemaattinen haku suoritettiin tietyin hakuehdoin neljään tietojenkäsittelytieteiden alan tietokantaan, minkä lisäksi suoritettiin täydentäviä lisähakuja kahteen muuhun tietokantaan. Tutkielman tavoitteena oli sisällyttää kirjallisuuskatsaukseen vähintään neljäkymmentä lähdettä sopivan laajuuden saavuttamiseksi. Aineisto valikoitui relevanssiin perustuen, joka arvioitiin tutkimalla artikkeleiden tiivistelmä ja tekemällä yleiskatsaus artikkeleiden sisältöön. Mustahattumenetelmät jaetaan sivun sisäisiin menetelmiin, linkkiperustaisiin menetelmiin ja muihin menetelmiin sekä tehostusmenetelmiin ja piilotusmenetelmiin. Lisäksi mustahattumenetelmiä voidaan hyödyntää verkkosivun hakutulossijoituksen tahalliseen alentamiseen. Liiallinen avainsanojen käyttö ja cloaking-menetelmä esiintyvät usein kirjallisuudessa. Mustahattuhakukoneoptimointi kuluttaa hakukoneiden resursseja, aiheuttaa verkkohakutuloksien laadun heikkenemistä ja voi edistää haitallisen verkkosisällön leviämistä. Hakukoneyhtiöt voivat antaa mustahattumenetelmiä hyödyntävälle verkkosivulle varoituksen, heikentää verkkosivun hakutulossijoitusta, tai poistaa verkkosivun hakuindeksistä. Mustahattuhakukoneoptimointia hyödyntävän verkkosivun toiminta voidaan pyrkiä lopettamaan myös oikeusteitse. Automaattiset menetelmät tehostavat mustahattuhakukoneoptimointia hyödyntävien verkkosivujen havainnointia. Mustahattumenetelmien kehittyessä myös vastatoimien on kehityttävä, jotta mustahattuhakukoneoptimoinnin vaikutuksia voidaan vähentää

Trepo - Institutional Repository of Tampere University

Structural investigations with high pressure techniques and multicomponent systems

Author: Connor Lauren Evelyn.
Publication venue
Publication date: 01/01/2018
Field of study

This thesis illustrates the use of high pressure crystallography techniques for the discovery and investigation of solid-state forms and probes the relationship between molecular structure and compression of both single and multicomponent systems. As well as investigating a data-driven approach to directing experimental co-crystallisation attempts.;Single crystal X-ray diffraction techniques are a highlight in all areas of this study, as well as computational approaches which were used in the evaluation of the interactions of small molecule systems. Data-mining of the Cambridge Structural Database made the comparison of the compression studies richer.;The pharmaceutical co-crystal, indomethacin and saccharin was analysed with respect to increasing pressure. The system is an example of a homomolecular synthon co-crystal allowing investigation of the component dimers free of strong interaction with surrounding molecules. The ambient pressure structure remains stable but investigation showed that the saccharin dimer sits in a pocket made by indomethacin allowing the dimer to lie further apart than in the pure compound.;To follow, a structural compression study of the single component saccharin using synchrotron radiation lead to the structural characterisation of the first new polymorph of saccharin. The hydrogen bonding pattern of the new phase remains consistent however Pixel calculations revealed that the biggest difference in packing arises due to the reduction of an interlayer distance.;To further explore multicomponent systems, two stoichiometric ratios of benzoic acid and isonicotinamide (2:1 & 1:1) were investigated. The rate of compression in these systems are almost identical despite the different molecular packing in each of the stoichiometric ratios. Through the investigation of materials in these initial chapters, the rate of compression in particular supramolecular synthons, e.g. amide-dimers, is demonstrated to be consistent despite the difference in the molecular make-up of the materials under study and their packing arrangements.;Lastly, a data-driven approach was applied in directing the discovery of a new solid-state entity. Following previous failed attempts, machine learning was employed to direct experimental co-crystallisations which led to a new co-crystal of Artemisinin and 1-Napthol. Pixel calculations revealed that the largest contribution to crystal stabilisation comes from dispersion energy and enabled the identification of dominant intermolecular interactions in the crystal structures.This thesis illustrates the use of high pressure crystallography techniques for the discovery and investigation of solid-state forms and probes the relationship between molecular structure and compression of both single and multicomponent systems. As well as investigating a data-driven approach to directing experimental co-crystallisation attempts.;Single crystal X-ray diffraction techniques are a highlight in all areas of this study, as well as computational approaches which were used in the evaluation of the interactions of small molecule systems. Data-mining of the Cambridge Structural Database made the comparison of the compression studies richer.;The pharmaceutical co-crystal, indomethacin and saccharin was analysed with respect to increasing pressure. The system is an example of a homomolecular synthon co-crystal allowing investigation of the component dimers free of strong interaction with surrounding molecules. The ambient pressure structure remains stable but investigation showed that the saccharin dimer sits in a pocket made by indomethacin allowing the dimer to lie further apart than in the pure compound.;To follow, a structural compression study of the single component saccharin using synchrotron radiation lead to the structural characterisation of the first new polymorph of saccharin. The hydrogen bonding pattern of the new phase remains consistent however Pixel calculations revealed that the biggest difference in packing arises due to the reduction of an interlayer distance.;To further explore multicomponent systems, two stoichiometric ratios of benzoic acid and isonicotinamide (2:1 & 1:1) were investigated. The rate of compression in these systems are almost identical despite the different molecular packing in each of the stoichiometric ratios. Through the investigation of materials in these initial chapters, the rate of compression in particular supramolecular synthons, e.g. amide-dimers, is demonstrated to be consistent despite the difference in the molecular make-up of the materials under study and their packing arrangements.;Lastly, a data-driven approach was applied in directing the discovery of a new solid-state entity. Following previous failed attempts, machine learning was employed to direct experimental co-crystallisations which led to a new co-crystal of Artemisinin and 1-Napthol. Pixel calculations revealed that the largest contribution to crystal stabilisation comes from dispersion energy and enabled the identification of dominant intermolecular interactions in the crystal structures

STAX (Strathclyde Repository)

From the Occam's Razor to a simple, efficient and robust text categorization approach

Author: Silva Renato Moraes, 1988-
Publication venue: [s.n.]
Publication date: 01/09/2018
Field of study

Orientadores: Akebo Yamakami, Tiago Agostinho de AlmeidaTese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: Categorização de textos é um problema que tem recebido muita atenção nos últimos anos devido ao aumento expressivo no volume de informações textuais. O processo manual de categorizar documentos de texto é cansativo, tedioso, demorado e muitas vezes impraticável quando o volume de dados é muito grande. Portanto, existe uma grande demanda para que esse processo seja realizado de maneira automática através de métodos computacionais. Embora vários métodos já tenham sido propostos, muitos sofrem com o problema da maldição da dimensionalidade ou apresentam alto custo computacional, inviabilizando seu uso em cenários reais. Diante disso, esta tese apresenta um método de categorização de texto baseado no princípio da descrição mais simples, nomeado MDLText, que é eficiente, rápido, escalável e multiclasse. Ele possui aprendizado rápido, incremental e é suficientemente robusto para evitar o problema de superajustamento aos dados, o que é altamente desejável em problemas reais, dinâmicos, online e de grande porte. Experimentos realizados com bases de dados reais, grandes e públicas, seguidos por uma análise estatística dos resultados, indicam que o MDLText oferece um excelente balanceamento entre poder preditivo e custo computacional. Diante desses bons resultados, foi proposta uma generalização inicial do método para lidar também com problemas não-textuais, o que resultou em um método de classificação, nomeado MDLClass, que é simples, rápido e pode ser aplicado em problemas binários e multiclasses. A análise estatística dos resultados indicou que ele é equivalente à maioria dos métodos considerados o estado-da-arte em classificaçãoAbstract: ext categorization has received attention in recent years because of the ever-increasing volume of text information. For large number of documents, a manual classification is tiresome, tedious, time-consuming, and impractical, making computational methods attractive to deal with this task. The available methods that address this problem suffer from their computational burden and the curse of dimensionality, undermining their applicability in real scenarios. To overcome this limitation, we propose a simpler, faster, scalable and more efficient classification method based on the minimum description length principle, named MDLText. Its incremental and faster learning process makes it suitable to cope with data overfitting, which is desirable for real and large-scale problems. Experiments performed on real, public, and large-scale datasets followed by statistical analyses indicate that the MDLText provides an excellent trade-off between predictive capability and computational cost. Motivated by these results, we propose a generalized method, named MDLClass, to encompass non-textual problems. Similar to MDLText, this extension is simple and fast, and can also be applied to binary and multiclass classification problems. Statistical analyses show that MDLClass is equivalent to most of the state-of-the-art classification methodsDoutoradoAutomaçãoDoutor em Engenharia Elétrica141089/2013-0CNP

Repositorio da Producao Cientifica e Intelectual da Unicamp