613 research outputs found

    A comprehensive study on disease risk predictions in machine learning

    Get PDF
    Over recent years, multiple disease risk prediction models have been developed. These models use various patient characteristics to estimate the probability of outcomes over a certain period of time and hold the potential to improve decision making and individualize care. Discovering hidden patterns and interactions from medical databases with growing evaluation of the disease prediction model has become crucial. It needs many trials in traditional clinical findings that could complicate disease prediction. Comprehensive survey on different strategies used to predict disease is conferred in this paper. Applying these techniques to healthcare data, has improvement of risk prediction models to find out the patients who would get benefit from disease management programs to reduce hospital readmission and healthcare cost, but the results of these endeavours have been shifted

    A modified multi-class association rule for text mining

    Get PDF
    Classification and association rule mining are significant tasks in data mining. Integrating association rule discovery and classification in data mining brings us an approach known as the associative classification. One common shortcoming of existing Association Classifiers is the huge number of rules produced in order to obtain high classification accuracy. This study proposes s a Modified Multi-class Association Rule Mining (mMCAR) that consists of three procedures; rule discovery, rule pruning and group-based class assignment. The rule discovery and rule pruning procedures are designed to reduce the number of classification rules. On the other hand, the group-based class assignment procedure contributes in improving the classification accuracy. Experiments on the structured and unstructured text datasets obtained from the UCI and Reuters repositories are performed in order to evaluate the proposed Association Classifier. The proposed mMCAR classifier is benchmarked against the traditional classifiers and existing Association Classifiers. Experimental results indicate that the proposed Association Classifier, mMCAR, produced high accuracy with a smaller number of classification rules. For the structured dataset, the mMCAR produces an average of 84.24% accuracy as compared to MCAR that obtains 84.23%. Even though the classification accuracy difference is small, the proposed mMCAR uses only 50 rules for the classification while its benchmark method involves 60 rules. On the other hand, mMCAR is at par with MCAR when unstructured dataset is utilized. Both classifiers produce 89% accuracy but mMCAR uses less number of rules for the classification. This study contributes to the text mining domain as automatic classification of huge and widely distributed textual data could facilitate the text representation and retrieval processes

    LC an effective classification based association rule mining algorithm

    Get PDF
    Classification using association rules is a research field in data mining that primarily uses association rule discovery techniques in classification benchmarks. It has been confirmed by many research studies in the literature that classification using association tends to generate more predictive classification systems than traditional classification data mining techniques like probabilistic, statistical and decision tree. In this thesis, we introduce a novel data mining algorithm based on classification using association called “Looking at the Class” (LC), which can be used in for mining a range of classification data sets. Unlike known algorithms in classification using the association approach such as Classification based on Association rule (CBA) system and Classification based on Predictive Association (CPAR) system, which merge disjoint items in the rule learning step without anticipating the class label similarity, the proposed algorithm merges only items with identical class labels. This saves too many unnecessary items combining during the rule learning step, and consequently results in large saving in computational time and memory. Furthermore, the LC algorithm uses a novel prediction procedure that employs multiple rules to make the prediction decision instead of a single rule. The proposed algorithm has been evaluated thoroughly on real world security data sets collected using an automated tool developed at Huddersfield University. The security application which we have considered in this thesis is about categorizing websites based on their features to legitimate or fake which is a typical binary classification problem. Also, experimental results on a number of UCI data sets have been conducted and the measures used for evaluation is the classification accuracy, memory usage, and others. The results show that LC algorithm outperformed traditional classification algorithms such as C4.5, PART and Naïve Bayes as well as known classification based association algorithms like CBA with respect to classification accuracy, memory usage, and execution time on most data sets we consider

    Partial rule match for filtering rules in associative classification

    Get PDF
    In this study, we propose a new method to enhance the accuracy of Modified Multi-class Classification based on Association Rule (MMCAR) classifier.We introduce a Partial Rule Match Filtering (PRMF) method that allows a minimal match of the items in the rule's body in order for the rule to be added into a classifier. Experiments on Reuters-21578 data sets are performed in order to evaluate the effectiveness of PRMF in MMCAR. Results show that the MMCAR classifier performs better as compared to the chosen competitors

    Dynamic Rule Covering Classification in Data Mining with Cyber Security Phishing Application

    Get PDF
    Data mining is the process of discovering useful patterns from datasets using intelligent techniques to help users make certain decisions. A typical data mining task is classification, which involves predicting a target variable known as the class in previously unseen data based on models learnt from an input dataset. Covering is a well-known classification approach that derives models with If-Then rules. Covering methods, such as PRISM, have a competitive predictive performance to other classical classification techniques such as greedy, decision tree and associative classification. Therefore, Covering models are appropriate decision-making tools and users favour them carrying out decisions. Despite the use of Covering approach in data processing for different classification applications, it is also acknowledged that this approach suffers from the noticeable drawback of inducing massive numbers of rules making the resulting model large and unmanageable by users. This issue is attributed to the way Covering techniques induce the rules as they keep adding items to the rule’s body, despite the limited data coverage (number of training instances that the rule classifies), until the rule becomes with zero error. This excessive learning overfits the training dataset and also limits the applicability of Covering models in decision making, because managers normally prefer a summarised set of knowledge that they are able to control and comprehend rather a high maintenance models. In practice, there should be a trade-off between the number of rules offered by a classification model and its predictive performance. Another issue associated with the Covering models is the overlapping of training data among the rules, which happens when a rule’s classified data are discarded during the rule discovery phase. Unfortunately, the impact of a rule’s removed data on other potential rules is not considered by this approach. However, When removing training data linked with a rule, both frequency and rank of other rules’ items which have appeared in the removed data are updated. The impacted rules should maintain their true rank and frequency in a dynamic manner during the rule discovery phase rather just keeping the initial computed frequency from the original input dataset. In response to the aforementioned issues, a new dynamic learning technique based on Covering and rule induction, that we call Enhanced Dynamic Rule Induction (eDRI), is developed. eDRI has been implemented in Java and it has been embedded in WEKA machine learning tool. The developed algorithm incrementally discovers the rules using primarily frequency and rule strength thresholds. These thresholds in practice limit the search space for both items as well as potential rules by discarding any with insufficient data representation as early as possible resulting in an efficient training phase. More importantly, eDRI substantially cuts down the number of training examples scans by continuously updating potential rules’ frequency and strength parameters in a dynamic manner whenever a rule gets inserted into the classifier. In particular, and for each derived rule, eDRI adjusts on the fly the remaining potential rules’ items frequencies as well as ranks specifically for those that appeared within the deleted training instances of the derived rule. This gives a more realistic model with minimal rules redundancy, and makes the process of rule induction efficient and dynamic and not static. Moreover, the proposed technique minimises the classifier’s number of rules at preliminary stages by stopping learning when any rule does not meet the rule’s strength threshold therefore minimising overfitting and ensuring a manageable classifier. Lastly, eDRI prediction procedure not only priorities using the best ranked rule for class forecasting of test data but also restricts the use of the default class rule thus reduces the number of misclassifications. The aforementioned improvements guarantee classification models with smaller size that do not overfit the training dataset, while maintaining their predictive performance. The eDRI derived models particularly benefit greatly users taking key business decisions since they can provide a rich knowledge base to support their decision making. This is because these models’ predictive accuracies are high, easy to understand, and controllable as well as robust, i.e. flexible to be amended without drastic change. eDRI applicability has been evaluated on the hard problem of phishing detection. Phishing normally involves creating a fake well-designed website that has identical similarity to an existing business trustful website aiming to trick users and illegally obtain their credentials such as login information in order to access their financial assets. The experimental results against large phishing datasets revealed that eDRI is highly useful as an anti-phishing tool since it derived manageable size models when compared with other traditional techniques without hindering the classification performance. Further evaluation results using other several classification datasets from different domains obtained from University of California Data Repository have corroborated eDRI’s competitive performance with respect to accuracy, number of knowledge representation, training time and items space reduction. This makes the proposed technique not only efficient in inducing rules but also effective

    Natural language processing based advanced method of unnecessary video detection

    Get PDF
    In this study we have described the process of identifying unnecessary video using an advanced combined method of natural language processing and machine learning. The system also includes a framework that contains analytics databases and which helps to find statistical accuracy and can detect, accept or reject unnecessary and unethical video content. In our video detection system, we extract text data from video content in two steps, first from video to MPEG-1 audio layer 3 (MP3) and then from MP3 to WAV format. We have used the text part of natural language processing to analyze and prepare the data set. We use both Naive Bayes and logistic regression classification algorithms in this detection system to determine the best accuracy for our system. In our research, our video MP4 data has converted to plain text data using the python advance library function. This brief study discusses the identification of unauthorized, unsocial, unnecessary, unfinished, and malicious videos when using oral video record data. By analyzing our data sets through this advanced model, we can decide which videos should be accepted or rejected for the further actions

    Urinary Tract Infection Analysis using Machine Learning based Classification and ANN- A Study of Prediction

    Get PDF
    Urinary tract infection is the most frequently diagnosed infection among humans. A urinary tract infection (UTI) affects the areas of urinary system which includes the ureters, bladder, kidneys and urethra. The primary infected area of urinary system involves the lower tract i.e. bladder and urethra. The infection in bladder is painful as well as uncomfortable but if it spreads to kidneys, it can have severe consequences. Women are more susceptible to urinary infection in comparison to men due to their physiology. This paper aims to study and assess the impact and causes of urinary tract infection in human beings and evaluate the machine learning approach for urinary disease forecasting. The paper also proposed machine learning based methodology for the prediction of the urinary infection and estimating the outcomes of the designed procedures over real-time data and validating the same. The paper focuses to get high prediction accuracy of UTI using confusion matrix by Machine Based Classification and ANN technique. Some specific parameters have been selected with the help of Analysis of variance technique. The naive bayes classifier, J48 decision tree algorithm, and Artificial neural network have been used for the prediction of presence of urinary infection. The accuracy achieved by the proposed model is 95.5% approximately

    Feature Selection for Multi-label Document Based on Wrapper Approach through Class Association Rules

    Get PDF
    Each document in a multi-label classification is connected to a subset of labels. These documents usually include a big number of features, which can hamper the performance of learning algorithms. Therefore, feature selection is helpful in isolating the redundant and irrelevant elements that can hold the performance back. The current study proposes a Naive Bayesian (NB) multi-label classification algorithm by incorporating a wrapper approach for the strategy of feature selection aiming at determining the best minimum confidence threshold. This paper also suggests transforming the multi-label documents prior to utilizing the standard algorithm of feature selection. In such a process, the document was copied into labels that belonged to by adopting all the assigned characteristics for each label. Then, this study conducted an evaluation of seven minimum confidence thresholds. Additionally, Class Association Rules (CARs) represents the wrapper approach for this evaluation. The experiments carried out with benchmark datasets revealed that the Naïve Bayes Multi-label (NBML) classifier with business dataset scored an average precision of 87.9% upon using a 0.1 % of minimum confidence threshold
    corecore