2,352 research outputs found
An Interval-based Multiobjective Approach to Feature Subset Selection Using Joint Modeling of Objectives and Variables
This paper studies feature subset selection in classification using a multiobjective estimation of distribution algorithm. We consider six functions, namely area under ROC curve, sensitivity, specificity, precision, F1 measure and Brier score, for evaluation of feature subsets and as the objectives of the problem. One of the characteristics of these objective functions is the existence of noise in their values that should be appropriately handled during optimization. Our proposed
algorithm consists of two major techniques which are specially designed for the feature subset selection problem. The first one is a solution ranking method based on interval values to handle the noise in the objectives of this problem. The second one is a model estimation method for learning a joint probabilistic model of objectives and variables which is used to generate new solutions and advance through the search space. To simplify model estimation, l1 regularized regression is used to select a
subset of problem variables before model learning. The proposed algorithm is compared with a well-known ranking method for interval-valued objectives and a standard multiobjective genetic algorithm. Particularly, the effects of the two new techniques are experimentally investigated. The experimental results show that the proposed algorithm is able to obtain comparable or better
performance on the tested datasets
Multi-dimensional clustering in user profiling
User profiling has attracted an enormous number of technological methods and
applications. With the increasing amount of products and services, user profiling
has created opportunities to catch the attention of the user as well as achieving
high user satisfaction. To provide the user what she/he wants, when and how,
depends largely on understanding them. The user profile is the representation of
the user and holds the information about the user. These profiles are the
outcome of the user profiling.
Personalization is the adaptation of the services to meet the user’s needs and
expectations. Therefore, the knowledge about the user leads to a personalized
user experience. In user profiling applications the major challenge is to build and
handle user profiles. In the literature there are two main user profiling methods,
collaborative and the content-based. Apart from these traditional profiling
methods, a number of classification and clustering algorithms have been used
to classify user related information to create user profiles. However, the profiling,
achieved through these works, is lacking in terms of accuracy. This is because,
all information within the profile has the same influence during the profiling even
though some are irrelevant user information.
In this thesis, a primary aim is to provide an insight into the concept of user
profiling. For this purpose a comprehensive background study of the literature
was conducted and summarized in this thesis. Furthermore, existing user
profiling methods as well as the classification and clustering algorithms were investigated. Being one of the objectives of this study, the use of these
algorithms for user profiling was examined. A number of classification and
clustering algorithms, such as Bayesian Networks (BN) and Decision Trees
(DTs) have been simulated using user profiles and their classification accuracy
performances were evaluated. Additionally, a novel clustering algorithm for the
user profiling, namely Multi-Dimensional Clustering (MDC), has been proposed.
The MDC is a modified version of the Instance Based Learner (IBL) algorithm.
In IBL every feature has an equal effect on the classification regardless of their
relevance. MDC differs from the IBL by assigning weights to feature values to
distinguish the effect of the features on clustering. Existing feature weighing
methods, for instance Cross Category Feature (CCF), has also been
investigated. In this thesis, three feature value weighting methods have been
proposed for the MDC. These methods are; MDC weight method by Cross
Clustering (MDC-CC), MDC weight method by Balanced Clustering (MDC-BC)
and MDC weight method by changing the Lower-limit to Zero (MDC-LZ). All of
these weighted MDC algorithms have been tested and evaluated. Additional
simulations were carried out with existing weighted and non-weighted IBL
algorithms (i.e. K-Star and Locally Weighted Learning (LWL)) in order to
demonstrate the performance of the proposed methods. Furthermore, a real life scenario is implemented to show how the MDC can be used for the user
profiling to improve personalized service provisioning in mobile environments.
The experiments presented in this thesis were conducted by using user profile
datasets that reflect the user’s personal information, preferences and interests.
The simulations with existing classification and clustering algorithms (e.g. Bayesian Networks (BN), NaĂŻve Bayesian (NB), Lazy learning of Bayesian
Rules (LBR), Iterative Dichotomister 3 (Id3)) were performed on the WEKA
(version 3.5.7) machine learning platform. WEKA serves as a workbench to
work with a collection of popular learning schemes implemented in JAVA. In
addition, the MDC-CC, MDC-BC and MDC-LZ have been implemented on
NetBeans IDE 6.1 Beta as a JAVA application and MATLAB. Finally, the real life
scenario is implemented as a Java Mobile Application (Java ME) on NetBeans
IDE 7.1. All simulation results were evaluated based on the error rate and
accuracy
Performance Analysis of a new Filter and Wrapper Sequence for the Survivability Prediction of Breast Cancer Patients
Feature selection is an essential preprocessing step for removing redundant or irrelevant features from multidimensional data to improve predictive performance. Currently, medical clinical datasets are increasingly large and multidimensional and not every feature helps in the necessary predictions. So, feature selection techniques are used to determine relevant feature set that can improve the performance of a learning algorithm. This study presents a performance analysis of a new filter and wrapper sequence involving the intersection of filter methods, Mutual Information and Chi-Square followed by one of the wrapper methods: Sequential Forward Selection and Sequential Backward Selection to obtain a more informative feature set for improved prediction of the survivability of breast cancer patients from the clinical breast cancer dataset, SEER. The improvement in performance due to this filter and wrapper sequence in terms of Accuracy, False Positive Rate, False Negative Rate and Area under the Receiver Operating Characteristics curve is tested using the Machine learning algorithms: Logistic Regression, K-Nearest Neighbour, Decision Tree, Random Forest, Support Vector Machine and Multilayer Perceptron. The performance analysis supports the Sequential Backward Selection of the new filter and wrapper sequence over Sequential Forward Selection for the SEER dataset
Longitudinal clustering analysis and prediction of Parkinson\u27s disease progression using radiomics and hybrid machine learning
Background: We employed machine learning approaches to (I) determine distinct progression trajectories in Parkinson\u27s disease (PD) (unsupervised clustering task), and (II) predict progression trajectories (supervised prediction task), from early (years 0 and 1) data, making use of clinical and imaging features.
Methods: We studied PD-subjects derived from longitudinal datasets (years 0, 1, 2 & 4; Parkinson\u27s Progressive Marker Initiative). We extracted and analyzed 981 features, including motor, non-motor, and radiomics features extracted for each region-of-interest (ROIs: left/right caudate and putamen) using our standardized standardized environment for radiomics analysis (SERA) radiomics software. Segmentation of ROIs on dopamine transposer - single photon emission computed tomography (DAT SPECT) images were performed via magnetic resonance images (MRI). After performing cross-sectional clustering on 885 subjects (original dataset) to identify disease subtypes, we identified optimal longitudinal trajectories using hybrid machine learning systems (HMLS), including principal component analysis (PCA) + K-Means algorithms (KMA) followed by Bayesian information criterion (BIC), Calinski-Harabatz criterion (CHC), and elbow criterion (EC). Subsequently, prediction of the identified trajectories from early year data was performed using multiple HMLSs including 16 Dimension Reduction Algorithms (DRA) and 10 classification algorithms.
Results: We identified 3 distinct progression trajectories. Hotelling\u27s t squared test (HTST) showed that the identified trajectories were distinct. The trajectories included those with (I, II) disease escalation (2 trajectories, 27% and 38% of patients) and (III) stable disease (1 trajectory, 35% of patients). For trajectory prediction from early year data, HMLSs including the stochastic neighbor embedding algorithm (SNEA, as a DRA) as well as locally linear embedding algorithm (LLEA, as a DRA), linked with the new probabilistic neural network classifier (NPNNC, as a classifier), resulted in accuracies of 78.4% and 79.2% respectively, while other HMLSs such as SNEA + Lib_SVM (library for support vector machines) and t_SNE (t-distributed stochastic neighbor embedding) + NPNNC resulted in 76.5% and 76.1% respectively.
Conclusions: This study moves beyond cross-sectional PD subtyping to clustering of longitudinal disease trajectories. We conclude that combining medical information with SPECT-based radiomics features, and optimal utilization of HMLSs, can identify distinct disease trajectories in PD patients, and enable effective prediction of disease trajectories from early year data
- …