241 research outputs found
Recommended from our members
Combined supervised and unsupervised learning to identify subclasses of disease for better prediction
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonDisease subtyping, which aids in the development of personalised treatments, remains a challenge in data analysis because of the many different ways to group patients based upon their data. However, if I can identify subclasses of disease, this will help to develop better models that are more specific to individuals and should therefore improve prediction and understanding of the underlying characteristics of the disease in question. In addition, patients might suffer from multiple disease complications. Models that are tailored to individuals could improve both prediction of multiple complications and understanding of underlying disease characteristics. However, AI models can become outdated over time due to either sudden changes in the underlying data, such as those caused by new measurement methods, or incremental changes, such as the ageing of the study population. This thesis proposes a new algorithm that integrates consensus clustering methods with classification in order to overcome issues with sample bias. The method was tested on a freely available dataset of real-world breast cancer cases and data from a London hospital on systemic sclerosis, a rare and potentially fatal condition. The results show that nearest consensus clustering classification improves accuracy and prediction significantly when this algorithm is compared with competitive similar methods. In addition, this thesis proposes a new algorithm that integrates latent class models with classification. The new algorithm uses latent class models to cluster patients within groups; this results in improved classification and aids in the understanding of the underlying differences of the discovered groups. The method was tested on data from patients with systemic sclerosis (SSc), a rare and potentially fatal condition, and coronary heart disease. Results show that the latent class multi-label classification (MLC) model improves accuracy when compared with competitive similar methods. Finally, this thesis implemented the updated concept drift method (DDM) to monitor AI models over time and detect drifts when they occur. The method was tested on data from patients with SSc and patients with coronavirus disease (COVID)
A Survey on Text Classification Algorithms: From Text to Predictions
In recent years, the exponential growth of digital documents has been met by rapid progress in text classification techniques. Newly proposed machine learning algorithms leverage the latest advancements in deep learning methods, allowing for the automatic extraction of expressive features. The swift development of these methods has led to a plethora of strategies to encode natural language into machine-interpretable data. The latest language modelling algorithms are used in conjunction with ad hoc preprocessing procedures, of which the description is often omitted in favour of a more detailed explanation of the classification step. This paper offers a concise review of recent text classification models, with emphasis on the flow of data, from raw text to output labels. We highlight the differences between earlier methods and more recent, deep learning-based methods in both their functioning and in how they transform input data. To give a better perspective on the text classification landscape, we provide an overview of datasets for the English language, as well as supplying instructions for the synthesis of two new multilabel datasets, which we found to be particularly scarce in this setting. Finally, we provide an outline of new experimental results and discuss the open research challenges posed by deep learning-based language models
Data discovery through profile-based similarity metrics
The most essential step in a data integration process is to find the datasets whose combined information provides relevant insights. This task, defined as data discovery, is highly dependent on the definition of the similarity between the candidate attributes to join, which commonly involves assessing the closeness of the semantic concepts that the two attributes represent. Most of the state-of-the-art approaches to this issue rely on syntactic methodologies, that is, procedures in which the instances of the two columns are compared to determine whether they are similar or not. These approaches suffice when the two sets of instances share the same syntactic representation but fail to detect cases in which the same semantic idea is represented by different sets of values. This latter case is ever-increasing in proportion, given the characteristics of big-data environments and the lack of standardization of the data. The aim of this project is to develop a system that can solve this issue and facilitate the establishment of relationships between related data that do not share a syntactic relationship. The approach presented in this work leverages the extensively studied syntactic methodologies to data discovery in conjunction with a new formulation for semantic similarity: the resemblance of probability distributions. Additionally, this system will be made scalable and able to handle vast quantities of data
A Dependable Hybrid Machine Learning Model for Network Intrusion Detection
Network intrusion detection systems (NIDSs) play an important role in
computer network security. There are several detection mechanisms where
anomaly-based automated detection outperforms others significantly. Amid the
sophistication and growing number of attacks, dealing with large amounts of
data is a recognized issue in the development of anomaly-based NIDS. However,
do current models meet the needs of today's networks in terms of required
accuracy and dependability? In this research, we propose a new hybrid model
that combines machine learning and deep learning to increase detection rates
while securing dependability. Our proposed method ensures efficient
pre-processing by combining SMOTE for data balancing and XGBoost for feature
selection. We compared our developed method to various machine learning and
deep learning algorithms to find a more efficient algorithm to implement in the
pipeline. Furthermore, we chose the most effective model for network intrusion
based on a set of benchmarked performance analysis criteria. Our method
produces excellent results when tested on two datasets, KDDCUP'99 and
CIC-MalMem-2022, with an accuracy of 99.99% and 100% for KDDCUP'99 and
CIC-MalMem-2022, respectively, and no overfitting or Type-1 and Type-2 issues.Comment: Accepted in the Journal of Information Security and Applications
(Scopus, Web of Science (SCIE) Journal, Quartile: Q1, Site Score: 7.6, Impact
Factor: 4.96) on 7 December 202
- …