241 research outputs found

    A Survey on Text Classification Algorithms: From Text to Predictions

    Get PDF
    In recent years, the exponential growth of digital documents has been met by rapid progress in text classification techniques. Newly proposed machine learning algorithms leverage the latest advancements in deep learning methods, allowing for the automatic extraction of expressive features. The swift development of these methods has led to a plethora of strategies to encode natural language into machine-interpretable data. The latest language modelling algorithms are used in conjunction with ad hoc preprocessing procedures, of which the description is often omitted in favour of a more detailed explanation of the classification step. This paper offers a concise review of recent text classification models, with emphasis on the flow of data, from raw text to output labels. We highlight the differences between earlier methods and more recent, deep learning-based methods in both their functioning and in how they transform input data. To give a better perspective on the text classification landscape, we provide an overview of datasets for the English language, as well as supplying instructions for the synthesis of two new multilabel datasets, which we found to be particularly scarce in this setting. Finally, we provide an outline of new experimental results and discuss the open research challenges posed by deep learning-based language models

    Data discovery through profile-based similarity metrics

    Get PDF
    The most essential step in a data integration process is to find the datasets whose combined information provides relevant insights. This task, defined as data discovery, is highly dependent on the definition of the similarity between the candidate attributes to join, which commonly involves assessing the closeness of the semantic concepts that the two attributes represent. Most of the state-of-the-art approaches to this issue rely on syntactic methodologies, that is, procedures in which the instances of the two columns are compared to determine whether they are similar or not. These approaches suffice when the two sets of instances share the same syntactic representation but fail to detect cases in which the same semantic idea is represented by different sets of values. This latter case is ever-increasing in proportion, given the characteristics of big-data environments and the lack of standardization of the data. The aim of this project is to develop a system that can solve this issue and facilitate the establishment of relationships between related data that do not share a syntactic relationship. The approach presented in this work leverages the extensively studied syntactic methodologies to data discovery in conjunction with a new formulation for semantic similarity: the resemblance of probability distributions. Additionally, this system will be made scalable and able to handle vast quantities of data

    A Dependable Hybrid Machine Learning Model for Network Intrusion Detection

    Full text link
    Network intrusion detection systems (NIDSs) play an important role in computer network security. There are several detection mechanisms where anomaly-based automated detection outperforms others significantly. Amid the sophistication and growing number of attacks, dealing with large amounts of data is a recognized issue in the development of anomaly-based NIDS. However, do current models meet the needs of today's networks in terms of required accuracy and dependability? In this research, we propose a new hybrid model that combines machine learning and deep learning to increase detection rates while securing dependability. Our proposed method ensures efficient pre-processing by combining SMOTE for data balancing and XGBoost for feature selection. We compared our developed method to various machine learning and deep learning algorithms to find a more efficient algorithm to implement in the pipeline. Furthermore, we chose the most effective model for network intrusion based on a set of benchmarked performance analysis criteria. Our method produces excellent results when tested on two datasets, KDDCUP'99 and CIC-MalMem-2022, with an accuracy of 99.99% and 100% for KDDCUP'99 and CIC-MalMem-2022, respectively, and no overfitting or Type-1 and Type-2 issues.Comment: Accepted in the Journal of Information Security and Applications (Scopus, Web of Science (SCIE) Journal, Quartile: Q1, Site Score: 7.6, Impact Factor: 4.96) on 7 December 202
    • …
    corecore