682 research outputs found
A review of associative classification mining
Associative classification mining is a promising approach in data mining that utilizes the
association rule discovery techniques to construct classification systems, also known as
associative classifiers. In the last few years, a number of associative classification algorithms
have been proposed, i.e. CPAR, CMAR, MCAR, MMAC and others. These algorithms
employ several different rule discovery, rule ranking, rule pruning, rule prediction and rule
evaluation methods. This paper focuses on surveying and comparing the state-of-the-art associative
classification techniques with regards to the above criteria. Finally, future directions in associative
classification, such as incremental learning and mining low-quality data sets, are also
highlighted in this paper
Associative classifier coupled with unsupervised feature reduction for dengue fever classification using gene expression data
Recent studies have established the potential of classifiers designed using association rule mining methods. The current study proposes such an associative classifier to efficiently detect dengue fever using gene expression data. Labelled gene expression data has been preprocessed and discretized to mine association rules using well-established rule mining methods. Thereafter, unsupervised clustering methods have been applied to the discretized gene expression data to reduce and select the most promising features. The final feature reduced discretized gene expression data is subsequently used to mine rules in order to classify subjects into 'Dengue Fever' or 'Healthy'. Two well-known association rule mining methods, viz., Apriori and FP-Growth, have been used here along with various types of well established clustering methods. Extensive analysis has been reported with performance parameters in terms of accuracy, precision, recall and false positive rate using 5-fold cross-validation. Furthermore, a separate investigation has been conducted to find the most suitable number of features and confidence of association rule mining methods. The experimental results obtained indicate accurate detection of dengue fever patients at an early stage using the proposed associative classification method.Web of Science10883538834
Recommended from our members
Integrating Network Analysis and Data Mining Techniques into Effective Framework for Web Mining and Recommendation. A Framework for Web Mining and Recommendation
The main motivation for the study described in this dissertation is to benefit from the development in technology and the huge amount of available data which can be easily captured, stored and maintained electronically. We concentrate on Web usage (i.e., log) mining and Web structure mining. Analysing Web log data will reveal valuable feedback reflecting how effective the current structure of a web site is and to help the owner of a web site in understanding the behaviour of the web site visitors. We developed a framework that integrates statistical analysis, frequent pattern mining, clustering, classification and network construction and analysis. We concentrated on the statistical data related to the visitors and how they surf and pass through the various pages of a given web site to land at some target pages. Further, the frequent pattern mining technique was used to study the relationship between the various pages constituting a given web site. Clustering is used to study the similarity of users and pages. Classification suggests a target class for a given new entity by comparing the characteristics of the new entity to those of the known classes. Network construction and analysis is also employed to identify and investigate the links between the various pages constituting a Web site by constructing a network based on the frequency of access to the Web pages such that pages get linked in the network if they are identified in the result of the frequent pattern mining process as frequently accessed together. The knowledge discovered by analysing a web site and its related data should be considered valuable for online shoppers and commercial web site owners. Benefitting from the outcome of the study, a recommendation system was developed to suggest pages to visitors based on their profiles as compared to similar profiles of other visitors. The conducted experiments using popular datasets demonstrate the applicability and effectiveness of the proposed framework for Web mining and recommendation. As a by product of the proposed method, we demonstrate how it is effective in another domain for feature reduction by concentrating on gene expression data analysis as an application with some interesting results reported in Chapter 5
Evolutionary Decomposition of Complex Design Spaces
This dissertation investigates the support of conceptual engineering design through the
decomposition of multi-dimensional search spaces into regions of high performance. Such
decomposition helps the designer identify optimal design directions by the elimination of
infeasible or undesirable regions within the search space. Moreover, high levels of
interaction between the designer and the model increases overall domain knowledge and
significantly reduces uncertainty relating to the design task at hand.
The aim of the research is to develop the archetypal Cluster Oriented Genetic Algorithm
(COGA) which achieves search space decomposition by using variable mutation
(vmCOGA) to promote diverse search and an Adaptive Filter (AF) to extract solutions of
high performance [Parmee 1996a, 1996b]. Since COGAs are primarily used to decompose
design domains of unknown nature within a real-time environment, the elimination of
apriori knowledge, speed and robustness are paramount. Furthermore COGA should
promote the in-depth exploration of the entire search space, sampling all optima and the
surrounding areas. Finally any proposed system should allow for trouble free integration
within a Graphical User Interface environment.
The replacement of the variable mutation strategy with a number of algorithms which
increase search space sampling are investigated. Utility is then increased by incorporating
a control mechanism that maintains optimal performance by adapting each algorithm
throughout search by means of a feedback measure based upon population convergence.
Robustness is greatly improved by modifying the Adaptive Filter through the introduction
of a process that ensures more accurate modelling of the evolving population.
The performance of each prospective algorithm is assessed upon a suite of two-dimensional
test functions using a set of novel performance metrics. A six dimensional
test function is also developed where the areas of high performance are explicitly known,
thus allowing for evaluation under conditions of increased dimensionality. Further
complexity is introduced by two real world models described by both continuous and
discrete parameters. These relate to the design of conceptual airframes and cooling hole
geometries within a gas turbine.
Results are promising and indicate significant improvement over the vmCOGA in terms of
all desired criteria. This further supports the utilisation of COGA as a decision support
tool during the conceptual phase of design.British Aerospace plc, Warton and
Rolls Royce plc, Filto
Machine Learning Techniques for Screening and Diagnosis of Diabetes: a Survey
Diabetes has become one of the major causes of national disease and death in most countries. By 2015, diabetes had affected more than 415 million people worldwide. According to the International Diabetes Federation report, this figure is expected to rise to more than 642 million in 2040, so early screening and diagnosis of diabetes patients have great significance in detecting and treating diabetes on time. Diabetes is a multifactorial metabolic disease, its diagnostic criteria is difficult to cover all the ethology, damage degree, pathogenesis and other factors, so there is a situation for uncertainty and imprecision under various aspects of medical diagnosis process. With the development of Data mining, researchers find that machine learning is playing an increasingly important role in diabetes research. Machine learning techniques can find the risky factors of diabetes and reasonable threshold of physiological parameters to unearth hidden knowledge from a huge amount of diabetes-related data, which has a very important significance for diagnosis and treatment of diabetes. So this paper provides a survey of machine learning techniques that has been applied to diabetes data screening and diagnosis of the disease. In this paper, conventional machine learning techniques are described in early screening and diagnosis of diabetes, moreover deep learning techniques which have a significance of biomedical effect are also described
Dimensionality reduction methods for microarray cancer data using prior knowledge
Microarray studies are currently a very popular source of biological information. They allow the simultaneous measurement of hundreds of thousands of genes, drastically increasing the amount of data that can be gathered in a small amount of time and also decreasing the cost of producing such results. Large numbers of high dimensional data sets are currently being generated and there is an ongoing need to find ways to analyse them to obtain meaningful interpretations. Many microarray experiments are concerned with answering specific biological or medical questions regarding diseases and treatments. Cancer is one of the most popular research areas and there is a plethora of data available requiring in depth analysis. Although the analysis of microarray data has been thoroughly researched over the past ten years, new approaches still appear regularly, and may lead to a better understanding of the available information. The size of the modern data sets presents considerable difficulties to traditional methodologies based on hypothesis testing, and there is a new move towards the use of machine learning in microarray data analysis.
Two new methods of using prior genetic knowledge in machine learning algorithms have been developed and their results are compared with existing methods. The prior knowledge consists of biological pathway data that can be found in on-line databases, and gene ontology terms. The first method, called ``a priori manifold learning'' uses the prior knowledge when constructing a manifold for non-linear feature extraction. It was found to perform better than both linear principal components analysis (PCA) and the non-linear Isomap algorithm (without prior knowledge) in both classification accuracy and quality of the clusters. Both pathway and GO terms were used as prior knowledge, and results showed that using GO terms can make the models over-fit the data. In the cases where the use of GO terms does not over-fit, the results are better than PCA, Isomap and a priori manifold learning using pathways.
The second method, called ``the feature selection over pathway segmentation algorithm'', uses the pathway information to split a big dataset into smaller ones. Then, using AdaBoost, decision trees are constructed for each of the smaller sets and the sets that achieve higher classification accuracy are identified. The individual genes in these subsets are assessed to determine their role in the classification process. Using data sets concerning chronic myeloid leukaemia (CML) two subsets based on pathways were found to be strongly associated with the response to treatment. Using a different data set from measurements on lower grade glioma (LGG) tumours, four informative gene sets were discovered. Further analysis based on the Gini importance measure identified a set of genes for each cancer type (CML, LGG) that could predict the response to treatment very accurately (> 90%). Moreover a single gene that can predict the response to CML treatment accurately was identified.Open Acces
Recommended from our members
Weibull regression with Bayesian variable selection to identify prognostic tumour markers of breast cancer survival.
As data-rich medical datasets are becoming routinely collected, there is a growing demand for regression methodology that facilitates variable selection over a large number of predictors. Bayesian variable selection algorithms offer an attractive solution, whereby a sparsity inducing prior allows inclusion of sets of predictors simultaneously, leading to adjusted effect estimates and inference of which covariates are most important. We present a new implementation of Bayesian variable selection, based on a Reversible Jump MCMC algorithm, for survival analysis under the Weibull regression model. A realistic simulation study is presented comparing against an alternative LASSO-based variable selection strategy in datasets of up to 20,000 covariates. Across half the scenarios, our new method achieved identical sensitivity and specificity to the LASSO strategy, and a marginal improvement otherwise. Runtimes were comparable for both approaches, taking approximately a day for 20,000 covariates. Subsequently, we present a real data application in which 119 protein-based markers are explored for association with breast cancer survival in a case cohort of 2287 patients with oestrogen receptor-positive disease. Evidence was found for three independent prognostic tumour markers of survival, one of which is novel. Our new approach demonstrated the best specificity.PJN and SR were funded by the Medical Research Council. PJN also acknowledges partial support from the NIHR Cambridge Biomedical Research Centre.This is the accepted manuscript. The final version is available from SAGE at http://dx.doi.org/10.1177/096228021454874
- …