    An Efficient feature selection algorithm for the spam email classification

    The existing spam email classification systems are suffering from the problems of low accuracy due to the high dimensionality of the associated feature selection (FS) process. But being a global optimization process in machine learning, FS is mainly aimed at reducing the redundancy of dataset to create a set of acceptable and accurate results. This study presents the combination of Chaotic Particle Swarm Optimization (PSO) algorithm with Artificial Bees Colony (ABC) for the reduction of features dimensionality in a bid to improve spam emails classification accuracy. The features for each particle in this work were represented in a binary form, meaning that they were transformed into binary using a sigmoid function. The features selection was based on a fitness function that depended on the obtained accuracy using SVM. The proposed system was evaluated for performance by considering the performance of the classifier and the selected features vectors dimension which served as the input to the classifier; this evaluation was done using the Spam Base dataset and from the results, the PSO-ABC classifier performed well in terms of FS even with a small set of selected features

    An Improved Transformer-based Model for Detecting Phishing, Spam, and Ham: A Large Language Model Approach

    Phishing and spam detection is long standing challenge that has been the subject of much academic research. Large Language Models (LLM) have vast potential to transform society and provide new and innovative approaches to solve well-established challenges. Phishing and spam have caused financial hardships and lost time and resources to email users all over the world and frequently serve as an entry point for ransomware threat actors. While detection approaches exist, especially heuristic-based approaches, LLMs offer the potential to venture into a new unexplored area for understanding and solving this challenge. LLMs have rapidly altered the landscape from business, consumers, and throughout academia and demonstrate transformational potential for the potential of society. Based on this, applying these new and innovative approaches to email detection is a rational next step in academic research. In this work, we present IPSDM, our model based on fine-tuning the BERT family of models to specifically detect phishing and spam email. We demonstrate our fine-tuned version, IPSDM, is able to better classify emails in both unbalanced and balanced datasets. This work serves as an important first step towards employing LLMs to improve the security of our information systems

    Information Retrieval using applied Supervised Learning for Personalized E-Commerce

    Master's thesis in Computer SciencePersonalized E-Commerce Search Challenge issued by the International Conference on Information and Knowledge Management. By analyzing historical data containing browsing logs, queries, user interactions, and static data in the domain of an online retail service, we attempt to extract patterns and derive features from the data collection that will subsequently improve prediction of relevant products. A selection of supervised learning models will utilize an assembly of these features to be trained for prediction of test data. Prediction is performed on the queries given by the data collection, paired with each product item originally appearing in the query. We experiment with the possible assemblies of features along with the models and compare the results to achieve maximum prediction power. Lastly, the quality of the predictions are evaluated towards a ground truth to yield scores.submittedVersio


    Hakukoneoptimoinnin tarkoituksena on lisätä verkkosivun näkyvyyttä hakukoneiden tulossivuilla. Mustahattuhakukoneoptimointi on hakukoneyhtiöiden laatimien ohjesääntöjen vastaisten hakukoneoptimointimenetelmien hyödyntämistä. Tämän tutkielman tavoitteena on selvittää kirjallisuuskatsauksen pohjalta, mitä mustahattuhakukoneoptimoinnin menetelmiä ja vastatoimia tieteellisessä kirjallisuudessa on tutkittu. Tutkimusstrategiaksi valittiin kirjallisuuskatsaus, jotta tutkittava informaatio perustuisi tieteelliseen aineistoon. Kirjallisuuskatsauksen systemaattinen haku suoritettiin tietyin hakuehdoin neljään tietojenkäsittelytieteiden alan tietokantaan, minkä lisäksi suoritettiin täydentäviä lisähakuja kahteen muuhun tietokantaan. Tutkielman tavoitteena oli sisällyttää kirjallisuuskatsaukseen vähintään neljäkymmentä lähdettä sopivan laajuuden saavuttamiseksi. Aineisto valikoitui relevanssiin perustuen, joka arvioitiin tutkimalla artikkeleiden tiivistelmä ja tekemällä yleiskatsaus artikkeleiden sisältöön. Mustahattumenetelmät jaetaan sivun sisäisiin menetelmiin, linkkiperustaisiin menetelmiin ja muihin menetelmiin sekä tehostusmenetelmiin ja piilotusmenetelmiin. Lisäksi mustahattumenetelmiä voidaan hyödyntää verkkosivun hakutulossijoituksen tahalliseen alentamiseen. Liiallinen avainsanojen käyttö ja cloaking-menetelmä esiintyvät usein kirjallisuudessa. Mustahattuhakukoneoptimointi kuluttaa hakukoneiden resursseja, aiheuttaa verkkohakutuloksien laadun heikkenemistä ja voi edistää haitallisen verkkosisällön leviämistä. Hakukoneyhtiöt voivat antaa mustahattumenetelmiä hyödyntävälle verkkosivulle varoituksen, heikentää verkkosivun hakutulossijoitusta, tai poistaa verkkosivun hakuindeksistä. Mustahattuhakukoneoptimointia hyödyntävän verkkosivun toiminta voidaan pyrkiä lopettamaan myös oikeusteitse. Automaattiset menetelmät tehostavat mustahattuhakukoneoptimointia hyödyntävien verkkosivujen havainnointia. Mustahattumenetelmien kehittyessä myös vastatoimien on kehityttävä, jotta mustahattuhakukoneoptimoinnin vaikutuksia voidaan vähentää

    Structural investigations with high pressure techniques and multicomponent systems

    This thesis illustrates the use of high pressure crystallography techniques for the discovery and investigation of solid-state forms and probes the relationship between molecular structure and compression of both single and multicomponent systems. As well as investigating a data-driven approach to directing experimental co-crystallisation attempts.;Single crystal X-ray diffraction techniques are a highlight in all areas of this study, as well as computational approaches which were used in the evaluation of the interactions of small molecule systems. Data-mining of the Cambridge Structural Database made the comparison of the compression studies richer.;The pharmaceutical co-crystal, indomethacin and saccharin was analysed with respect to increasing pressure. The system is an example of a homomolecular synthon co-crystal allowing investigation of the component dimers free of strong interaction with surrounding molecules. The ambient pressure structure remains stable but investigation showed that the saccharin dimer sits in a pocket made by indomethacin allowing the dimer to lie further apart than in the pure compound.;To follow, a structural compression study of the single component saccharin using synchrotron radiation lead to the structural characterisation of the first new polymorph of saccharin. The hydrogen bonding pattern of the new phase remains consistent however Pixel calculations revealed that the biggest difference in packing arises due to the reduction of an interlayer distance.;To further explore multicomponent systems, two stoichiometric ratios of benzoic acid and isonicotinamide (2:1 & 1:1) were investigated. The rate of compression in these systems are almost identical despite the different molecular packing in each of the stoichiometric ratios. Through the investigation of materials in these initial chapters, the rate of compression in particular supramolecular synthons, e.g. amide-dimers, is demonstrated to be consistent despite the difference in the molecular make-up of the materials under study and their packing arrangements.;Lastly, a data-driven approach was applied in directing the discovery of a new solid-state entity. Following previous failed attempts, machine learning was employed to direct experimental co-crystallisations which led to a new co-crystal of Artemisinin and 1-Napthol.     From the Occam's Razor to a simple, efficient and robust text categorization approach

    Orientadores: Akebo Yamakami, Tiago Agostinho de AlmeidaTese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: Categorização de textos é um problema que tem recebido muita atenção nos últimos anos devido ao aumento expressivo no volume de informações textuais. O processo manual de categorizar documentos de texto é cansativo, tedioso, demorado e muitas vezes impraticável quando o volume de dados é muito grande. Portanto, existe uma grande demanda para que esse processo seja realizado de maneira automática através de métodos computacionais. Embora vários métodos já tenham sido propostos, muitos sofrem com o problema da maldição da dimensionalidade ou apresentam alto custo computacional, inviabilizando seu uso em cenários reais. Diante disso, esta tese apresenta um método de categorização de texto baseado no princípio da descrição mais simples, nomeado MDLText, que é eficiente, rápido, escalável e multiclasse. Ele possui aprendizado rápido, incremental e é suficientemente robusto para evitar o problema de superajustamento aos dados, o que é altamente desejável em problemas reais, dinâmicos, online e de grande porte. Experimentos realizados com bases de dados reais, grandes e públicas, seguidos por uma análise estatística dos resultados, indicam que o MDLText oferece um excelente balanceamento entre poder preditivo e custo computacional. Diante desses bons resultados, foi proposta uma generalização inicial do método para lidar também com problemas não-textuais, o que resultou em um método de classificação, nomeado MDLClass, que é simples, rápido e pode ser aplicado em problemas binários e multiclasses. A análise estatística dos resultados indicou que ele é equivalente à maioria dos métodos considerados o estado-da-arte em classificaçãoAbstract: ext categorization has received attention in recent years because of the ever-increasing volume of text information. For large number of documents, a manual classification is tiresome, tedious, time-consuming, and impractical, making computational methods attractive to deal with this task. The available methods that address this problem suffer from their computational burden and the curse of dimensionality, undermining their applicability in real scenarios. To overcome this limitation, we propose a simpler, faster, scalable and more efficient classification method based on the minimum description length principle, named MDLText. Its incremental and faster learning process makes it suitable to cope with data overfitting, which is desirable for real and large-scale problems. Experiments performed on real, public, and large-scale datasets followed by statistical analyses indicate that the MDLText provides an excellent trade-off between predictive capability and computational cost. Motivated by these results, we propose a generalized method, named MDLClass, to encompass non-textual problems. Similar to MDLText, this extension is simple and fast, and can also be applied to binary and multiclass classification problems. Statistical analyses show that MDLClass is equivalent to most of the state-of-the-art classification methodsDoutoradoAutomaçãoDoutor em Engenharia Elétrica141089/2013-0CNP