579,348 research outputs found
Infinite Latent Feature Selection: A Probabilistic Latent Graph-Based Ranking Approach
Feature selection is playing an increasingly significant role with respect to
many computer vision applications spanning from object recognition to visual
object tracking. However, most of the recent solutions in feature selection are
not robust across different and heterogeneous set of data. In this paper, we
address this issue proposing a robust probabilistic latent graph-based feature
selection algorithm that performs the ranking step while considering all the
possible subsets of features, as paths on a graph, bypassing the combinatorial
problem analytically. An appealing characteristic of the approach is that it
aims to discover an abstraction behind low-level sensory data, that is,
relevancy. Relevancy is modelled as a latent variable in a PLSA-inspired
generative process that allows the investigation of the importance of a feature
when injected into an arbitrary set of cues. The proposed method has been
tested on ten diverse benchmarks, and compared against eleven state of the art
feature selection methods. Results show that the proposed approach attains the
highest performance levels across many different scenarios and difficulties,
thereby confirming its strong robustness while setting a new state of the art
in feature selection domain.Comment: Accepted at the IEEE International Conference on Computer Vision
(ICCV), 2017, Venice. Preprint cop
Data mining of many-attribute data : investigating the interaction between feature selection strategy and statistical features of datasets
In many datasets, there is a very large number of attributes (e.g. many thousands).
Such datasets can cause many problems for machine learning methods. Various
feature selection (FS) strategies have been developed to address these problems. The
idea of an FS strategy is to reduce the number of features in a dataset (e.g. from many
thousands to a few hundred) so that machine learning and/or statistical analysis can be
done much more quickly and effectively. Obviously, FS strategies attempt to select
the features that are most important, considering the machine learning task to be done.
The work presented in this dissertation concerns the comparison between several
popular feature selection strategies, and, in particular, investigation of the interaction
between feature selection strategy and simple statistical features of the dataset. The
basic hypothesis, not investigated before, is that the correct choice of FS strategy for a
particular dataset should be based on a simple (at least) statistical analysis of the
dataset.
First, we examined the performance of several strategies on a selection of datasets.
Strategies examined were: four widely-used FS strategies (Correlation, Relief F,
Evolutionary Algorithm, no-feature-selection), several feature bias (FB) strategies (in
which the machine learning method considers all features, but makes use of bias
values suggested by the FB strategy), and also combinations of FS and FB strategies.
The results showed us that FB methods displayed strong capability on some datasets
and that combined strategies were also often successful.
Examining these results, we noted that patterns of performance were not immediately
understandable. This led to the above hypothesis (one of the main contributions of
the thesis) that statistical features of the dataset are an important consideration when
choosing an FS strategy. We then investigated this hypothesis with several further
experiments. Analysis of the results revealed that a simple statistical feature of a
dataset, that can be easily pre-calculated, has a clear relationship with the performance
Silang Luo PHD-06-2009 Page 2
of certain FS methods, and a similar relationship with differences in performance
between certain pairs of FS strategies.
In particular, Correlation based FS is a very widely-used FS technique based on the
basic hypothesis that good feature sets contain features that are highly correlated with
the class, yet uncorrelated with each other. By analysing the outcome of several FS
strategies on different artificial datasets, the experiments suggest that CFS is never the
best choice for poorly correlated data.
Finally, considering several methods, we suggest tentative guidelines for choosing an
FS strategy based on simply calculated measures of the dataset
Exploring EEG Features in Cross-Subject Emotion Recognition
Recognizing cross-subject emotions based on brain imaging data, e.g., EEG, has always been difficult due to the poor generalizability of features across subjects. Thus, systematically exploring the ability of different EEG features to identify emotional information across subjects is crucial. Prior related work has explored this question based only on one or two kinds of features, and different findings and conclusions have been presented. In this work, we aim at a more comprehensive investigation on this question with a wider range of feature types, including 18 kinds of linear and non-linear EEG features. The effectiveness of these features was examined on two publicly accessible datasets, namely, the dataset for emotion analysis using physiological signals (DEAP) and the SJTU emotion EEG dataset (SEED). We adopted the support vector machine (SVM) approach and the "leave-one-subject-out" verification strategy to evaluate recognition performance. Using automatic feature selection methods, the highest mean recognition accuracy of 59.06% (AUC = 0.605) on the DEAP dataset and of 83.33% (AUC = 0.904) on the SEED dataset were reached. Furthermore, using manually operated feature selection on the SEED dataset, we explored the importance of different EEG features in cross-subject emotion recognition from multiple perspectives, including different channels, brain regions, rhythms, and feature types. For example, we found that the Hjorth parameter of mobility in the beta rhythm achieved the best mean recognition accuracy compared to the other features. Through a pilot correlation analysis, we further examined the highly correlated features, for a better understanding of the implications hidden in those features that allow for differentiating cross-subject emotions. Various remarkable observations have been made. The results of this paper validate the possibility of exploring robust EEG features in cross-subject emotion recognition
Pattern Classification Using an Olfactory Model with PCA Feature Selection in Electronic Noses: Study and Application
Biologically-inspired models and algorithms are considered as promising sensor array signal processing methods for electronic noses. Feature selection is one of the most important issues for developing robust pattern recognition models in machine learning. This paper describes an investigation into the classification performance of a bionic olfactory model with the increase of the dimensions of input feature vector (outer factor) as well as its parallel channels (inner factor). The principal component analysis technique was applied for feature selection and dimension reduction. Two data sets of three classes of wine derived from different cultivars and five classes of green tea derived from five different provinces of China were used for experiments. In the former case the results showed that the average correct classification rate increased as more principal components were put in to feature vector. In the latter case the results showed that sufficient parallel channels should be reserved in the model to avoid pattern space crowding. We concluded that 6∼8 channels of the model with principal component feature vector values of at least 90% cumulative variance is adequate for a classification task of 3∼5 pattern classes considering the trade-off between time consumption and classification rate
Feature selection in detection of adverse drug reactions from the Health Improvement Network (THIN) database
Adverse drug reaction (ADR) is widely concerned for public health issue. ADRs
are one of most common causes to withdraw some drugs from market. Prescription
event monitoring (PEM) is an important approach to detect the adverse drug
reactions. The main problem to deal with this method is how to automatically
extract the medical events or side effects from high-throughput medical events,
which are collected from day to day clinical practice. In this study we propose
a novel concept of feature matrix to detect the ADRs. Feature matrix, which is
extracted from big medical data from The Health Improvement Network (THIN)
database, is created to characterize the medical events for the patients who
take drugs. Feature matrix builds the foundation for the irregular and big
medical data. Then feature selection methods are performed on feature matrix to
detect the significant features. Finally the ADRs can be located based on the
significant features. The experiments are carried out on three drugs:
Atorvastatin, Alendronate, and Metoclopramide. Major side effects for each drug
are detected and better performance is achieved compared to other computerized
methods. The detected ADRs are based on computerized methods, further
investigation is needed.Comment: International Journal of Information Technology and Computer Science
(IJITCS), in print, 201
A review on feature extraction and feature selection for handwritten character recognition
The development of handwriting character recognition (HCR) is an interesting area in pattern recognition. HCR system consists of a number of stages which are preprocessing, feature extraction, classification and followed by the actual recognition. It is generally agreed that one of the main factors influencing performance in HCR is the selection of an appropriate set of features for representing input samples. This paper provides a review of these advances. In a HCR, the set of features plays as main issues, as procedure in choosing the relevant feature that yields minimum classification error. To overcome these issues and maximize classification performance, many techniques have been proposed for reducing the dimensionality of the feature space in which data have to be processed. These techniques, generally denoted as feature reduction, may be divided in two main categories, called feature extraction and feature selection. A large number of research papers and reports have already been published on this topic. In this paper we provide an overview of some of the methods and approach of feature extraction and selection. Throughout this paper, we apply the investigation and analyzation of feature extraction and selection approaches in order to obtain the current trend. Throughout this paper also, the review of metaheuristic harmony search algorithm (HSA) has provide
Radiomics in the characterization of lipid-poor adrenal adenomas at unenhanced CT: time to look beyond usual density metrics
Objectives: In this study, we developed a radiomic signature for the classification of benign lipid-poor adenomas, which may potentially help clinicians limit the number of unnecessary investigations in clinical practice. Indeterminate adrenal lesions of benign and malignant nature may exhibit different values of key radiomics features. Methods: Patients who had available histopathology reports and a non-contrast-enhanced CT scan were included in the study. Radiomics feature extraction was done after the adrenal lesions were contoured. The primary feature selection and prediction performance scores were calculated using the least absolute shrinkage and selection operator (LASSO). To eliminate redundancy, the best-performing features were further examined using the Pearson correlation coefficient, and new predictive models were created. Results: This investigation covered 50 lesions in 48 patients. After LASSO-based radiomics feature selection, the test dataset’s 30 iterations of logistic regression models produced an average performance of 0.72. The model with the best performance, made up of 13 radiomics features, had an AUC of 0.99 in the training phase and 1.00 in the test phase. The number of features was lowered to 5 after performing Pearson’s correlation to prevent overfitting. The final radiomic signature trained a number of machine learning classifiers, with an average AUC of 0.93. Conclusions: Including more radiomics features in the identification of adenomas may improve the accuracy of NECT and reduce the need for additional imaging procedures and clinical workup, according to this and other recent radiomics studies that have clear points of contact with current clinical practice. Clinical relevance statement: The study developed a radiomic signature using unenhanced CT scans for classifying lipid-poor adenomas, potentially reducing unnecessary investigations that scored a final accuracy of 93%. Key Points: • Radiomics has potential for differentiating lipid-poor adenomas and avoiding unnecessary further investigations. • Quadratic mean, strength, maximum 3D diameter, volume density, and area density are promising predictors for adenomas. • Radiomics models reach high performance with average AUC of 0.95 in the training phase and 0.72 in the test phase
- …