640 research outputs found

    Multiple-Domain Sentiment Classification for Cantonese Using a Combined Approach

    Get PDF
    In this study, we proposed a combined approach, which amalgamates machine learning and lexicon- based approaches for multiple-domain sentiment classification that supports Cantonese-based social media analysis. Our study contributes to the existing literature not only by investigating the effectiveness of the proposed combined approach for supporting social media analysis in the Cantonese context but also by verifying that the proposed method outperforms the baseline approaches, which are commonly used in the literature. We demonstrated that social media network-based classifiers can be general classifiers that support multiple-domain sentiment classification

    Improving spam email classification accuracy using ensemble techniques: a stacking approach

    Get PDF
    Spam emails pose a substantial cybersecurity danger, necessitating accurate classification to reduce unwanted messages and mitigate risks. This study focuses on enhancing spam email classification accuracy using stacking ensemble machine learning techniques.We trained and tested five classifiers: logistic regression, decision tree, K-nearest neighbors (KNN), Gaussian naive Bayes and AdaBoost. To address overfitting, two distinct datasets of spam emails were aggregated and balanced. Evaluating individual classifiers based on recall, precision and F1 score metrics revealed AdaBoost as the top performer. Considering evolving spam technology and new message types challenging traditional approaches, we propose a stacking method. By combining predictions from multiple base models, the stacking method aims to improve classification accuracy. The results demonstrate superior performance of the stacking method with the highest accuracy (98.8%), recall (98.8%) and F1 score (98.9%) among tested methods. Additional experiments validated our approach by varying dataset sizes and testing different classifier combinations. Our study presents an innovative combination of classifiers that significantly improves accuracy, contributing to the growing body of research on stacking techniques. Moreover, we compare classifier performances using a unique combination of two datasets, highlighting the potential of ensemble techniques, specifically stacking, in enhancing spam email classification accuracy. The implications extend beyond spam classification systems, offering insights applicable to other classification tasks. Continued research on emerging spam techniques is vital to ensure long-term effectiveness

    Development and evaluation of ensemble-based classification models for predicting unplanned hospital readmissions after hysterectomy

    Get PDF
    Unplanned hospital readmissions are a key indicator of quality in healthcare and can lead to high unnecessary costs for the hospital due to additional required resources or reduced payments by insurers or governments. Predictive analytics can support the identification of patients at high-risk for readmission early on to enable timely interventions. In Australia, hysterectomies present the 2nd highest observed readmission rates of all surgical procedures in public hospitals. Prior research so far only focuses on developing explanatory models to identify associated risk factors for past patients. In this study, we develop and compare 24 prediction models using state-of-the-art sampling and ensemble methods to counter common problems in readmission prediction, such as imbalanced data and poor performance of individual classifiers. The application and evaluation of these models are presented, resulting in an excellent predictive power with under- and oversampling and an additional slight increase in performance when combined with ensemble methods

    Unsupervised Intrusion Detection with Cross-Domain Artificial Intelligence Methods

    Get PDF
    Cybercrime is a major concern for corporations, business owners, governments and citizens, and it continues to grow in spite of increasing investments in security and fraud prevention. The main challenges in this research field are: being able to detect unknown attacks, and reducing the false positive ratio. The aim of this research work was to target both problems by leveraging four artificial intelligence techniques. The first technique is a novel unsupervised learning method based on skip-gram modeling. It was designed, developed and tested against a public dataset with popular intrusion patterns. A high accuracy and a low false positive rate were achieved without prior knowledge of attack patterns. The second technique is a novel unsupervised learning method based on topic modeling. It was applied to three related domains (network attacks, payments fraud, IoT malware traffic). A high accuracy was achieved in the three scenarios, even though the malicious activity significantly differs from one domain to the other. The third technique is a novel unsupervised learning method based on deep autoencoders, with feature selection performed by a supervised method, random forest. Obtained results showed that this technique can outperform other similar techniques. The fourth technique is based on an MLP neural network, and is applied to alert reduction in fraud prevention. This method automates manual reviews previously done by human experts, without significantly impacting accuracy

    Computer vision and machine learning for medical image analysis: recent advances, challenges, and way forward.

    Get PDF
    The recent development in the areas of deep learning and deep convolutional neural networks has significantly progressed and advanced the field of computer vision (CV) and image analysis and understanding. Complex tasks such as classifying and segmenting medical images and localising and recognising objects of interest have become much less challenging. This progress has the potential of accelerating research and deployment of multitudes of medical applications that utilise CV. However, in reality, there are limited practical examples being physically deployed into front-line health facilities. In this paper, we examine the current state of the art in CV as applied to the medical domain. We discuss the main challenges in CV and intelligent data-driven medical applications and suggest future directions to accelerate research, development, and deployment of CV applications in health practices. First, we critically review existing literature in the CV domain that addresses complex vision tasks, including: medical image classification; shape and object recognition from images; and medical segmentation. Second, we present an in-depth discussion of the various challenges that are considered barriers to accelerating research, development, and deployment of intelligent CV methods in real-life medical applications and hospitals. Finally, we conclude by discussing future directions

    Machine learning for network based intrusion detection: an investigation into discrepancies in findings with the KDD cup '99 data set and multi-objective evolution of neural network classifier ensembles from imbalanced data.

    Get PDF
    For the last decade it has become commonplace to evaluate machine learning techniques for network based intrusion detection on the KDD Cup '99 data set. This data set has served well to demonstrate that machine learning can be useful in intrusion detection. However, it has undergone some criticism in the literature, and it is out of date. Therefore, some researchers question the validity of the findings reported based on this data set. Furthermore, as identified in this thesis, there are also discrepancies in the findings reported in the literature. In some cases the results are contradictory. Consequently, it is difficult to analyse the current body of research to determine the value in the findings. This thesis reports on an empirical investigation to determine the underlying causes of the discrepancies. Several methodological factors, such as choice of data subset, validation method and data preprocessing, are identified and are found to affect the results significantly. These findings have also enabled a better interpretation of the current body of research. Furthermore, the criticisms in the literature are addressed and future use of the data set is discussed, which is important since researchers continue to use it due to a lack of better publicly available alternatives. Due to the nature of the intrusion detection domain, there is an extreme imbalance among the classes in the KDD Cup '99 data set, which poses a significant challenge to machine learning. In other domains, researchers have demonstrated that well known techniques such as Artificial Neural Networks (ANNs) and Decision Trees (DTs) often fail to learn the minor class(es) due to class imbalance. However, this has not been recognized as an issue in intrusion detection previously. This thesis reports on an empirical investigation that demonstrates that it is the class imbalance that causes the poor detection of some classes of intrusion reported in the literature. An alternative approach to training ANNs is proposed in this thesis, using Genetic Algorithms (GAs) to evolve the weights of the ANNs, referred to as an Evolutionary Neural Network (ENN). When employing evaluation functions that calculate the fitness proportionally to the instances of each class, thereby avoiding a bias towards the major class(es) in the data set, significantly improved true positive rates are obtained whilst maintaining a low false positive rate. These findings demonstrate that the issues of learning from imbalanced data are not due to limitations of the ANNs; rather the training algorithm. Moreover, the ENN is capable of detecting a class of intrusion that has been reported in the literature to be undetectable by ANNs. One limitation of the ENN is a lack of control of the classification trade-off the ANNs obtain. This is identified as a general issue with current approaches to creating classifiers. Striving to create a single best classifier that obtains the highest accuracy may give an unfruitful classification trade-off, which is demonstrated clearly in this thesis. Therefore, an extension of the ENN is proposed, using a Multi-Objective GA (MOGA), which treats the classification rate on each class as a separate objective. This approach produces a Pareto front of non-dominated solutions that exhibit different classification trade-offs, from which the user can select one with the desired properties. The multi-objective approach is also utilised to evolve classifier ensembles, which yields an improved Pareto front of solutions. Furthermore, the selection of classifier members for the ensembles is investigated, demonstrating how this affects the performance of the resultant ensembles. This is a key to explaining why some classifier combinations fail to give fruitful solutions

    Self-organizing maps in computer security

    Get PDF

    Sentiment analysis of noisy Malay text: State of art, challenges and future work

    Get PDF
    Sentiment analysis (SA) is a study where people's opinions and emotions are automatically extracted in the form of sentiments from the natural language text. In social media monitoring, it is very useful because it allows user to gain an overall picture of the extensive public opinion behind many topics. Most works on SA are for the English text. Only a few works focus on the Malay language. Currently, a review on SA for the Malay language only focus on the SA approaches and the dataset. Some major issues such as the pre-processing techniques used to normalize the noisy text, the most employed performance measures for Malay SA, and the challenges for Malay SA has not been reviewed. Malaysians tend not to fully follow any abbreviations rules when writing on social media. Thus, a lot of noisy text can be found in social media sites like Facebook and Twitter which create some issues to SA process. Hence, the aim of this study is to investigate the state of the art, challenges and future works of SA for Malay social media text. This study provides a review on various approaches, datasets, performance measures, and pre-processing techniques used in the previous works on SA of the Malay text. More than 700 articles from journals and conference proceedings have been identified using the search keywords, however, only 17 relevant articles published from year 2013 to 2018 were reviewed. The findings from this review focus on three commonly used SA approaches which are lexicon-based, machine learning, and hybrid
    corecore