69 research outputs found
Feature Selection For High-Dimensional Clustering
We present a nonparametric method for selecting informative features in
high-dimensional clustering problems. We start with a screening step that uses
a test for multimodality. Then we apply kernel density estimation and mode
clustering to the selected features. The output of the method consists of a
list of relevant features, and cluster assignments. We provide explicit bounds
on the error rate of the resulting clustering. In addition, we provide the
first error bounds on mode based clustering.Comment: 11 pages, 2 figure
Causality-Based Feature Importance Quantifying Methods: PN-FI, PS-FI and PNS-FI
In the current ML field models are getting larger and more complex, and data
used for model training are also getting larger in quantity and higher in
dimensions. Therefore, in order to train better models, and save training time
and computational resources, a good Feature Selection (FS) method in the
preprocessing stage is necessary. Feature importance (FI) is of great
importance since it is the basis of feature selection. Therefore, this paper
creatively introduces the calculation of PN (the probability of Necessity), PN
(the probability of Sufficiency), and PNS (the probability of Necessity and
Sufficiency) of Causality into quantifying feature importance and creates 3 new
FI measuring methods, PN-FI, which means how much importance a feature has in
image recognition tasks, PS-FI that means how much importance a feature has in
image generating tasks, and PNS-FI which measures both. The main body of this
paper is three RCTs, with whose results we show how PS-FI, PN-FI, and PNS-FI of
3 features, dog nose, dog eyes, and dog mouth are calculated. The experiments
show that firstly, FI values are intervals with tight upper and lower bounds.
Secondly, the feature dog eyes has the most importance while the other two have
almost the same. Thirdly, the bounds of PNS and PN are tighter than the bounds
of PS.Comment: 7 page
Robust variable selection for model-based learning in presence of adulteration
The problem of identifying the most discriminating features when performing
supervised learning has been extensively investigated. In particular, several
methods for variable selection in model-based classification have been
proposed. Surprisingly, the impact of outliers and wrongly labeled units on the
determination of relevant predictors has received far less attention, with
almost no dedicated methodologies available in the literature. In the present
paper, we introduce two robust variable selection approaches: one that embeds a
robust classifier within a greedy-forward selection procedure and the other
based on the theory of maximum likelihood estimation and irrelevance. The
former recasts the feature identification as a model selection problem, while
the latter regards the relevant subset as a model parameter to be estimated.
The benefits of the proposed methods, in contrast with non-robust solutions,
are assessed via an experiment on synthetic data. An application to a
high-dimensional classification problem of contaminated spectroscopic data
concludes the paper
HYBRID FEATURE SELECTION AND SUPPORT VECTOR MACHINE FRAMEWORK FOR PREDICTING MAINTENANCE FAILURES
The main aim of predictive maintenance is to minimize downtime, failure risks and maintenance costs in manufacturing systems. Over the past few years, machine learning methods gained ground with diverse and successful applications in the area of predictive maintenance. This study shows that performing preprocessing techniques such as oversampling and features selection for failure prediction, is promising. For instance, to handle imbalanced data, the SMOTE-Tomek method is used. For features selection, three different methods can be applied: Recursive Feature Elimination, Random Forest and Variance Threshold. The data considered in this paper for simulation is used in literature; it is applied to aircraft engine sensors measurements to predict engines failure, while the predicting algorithm used is a Support Vector Machine. The results show that classification accuracy can be significantly boosted by using the preprocessing techniques
DECISION TREE SIMPLIFICATION THROUGH FEATURE SELECTION APPROACH IN SELECTING FISH FEED SELLERS
Feed is a crucial variable because it can determine the success of fish farming. Breeders can use two types of artificial feed, namely alternative feed and pellets. Many cultivators need pellets as the main consumption for the fish they are cultivating because the pellets contain a composition that has been adjusted to their needs based on the type and age of the fish. However, currently, cultivators are facing a problem, namely the high price of fish pellets on the market. Therefore, an analysis of the classification of the selection of fish feed sellers is needed that is adjusted to several criteria like the number of types of feed, price, order, delivery, and availability of discounts. This study conducted a classification analysis of simplification of characteristics in selecting fish feed sellers in Kendal Regency that would then be compared with a model without feature selection by utilizing the Decision Tree C4.5 method. The results of this study are the decision tree with the best performance where C4.5 with the application of the selected feature has an accuracy value of 92%, while C4.5 without the selection feature has an accuracy of 86.8%. The results of this study indicate that the C4.5 method with the application of selection features is better than C4.5 without selection features so that it can be applied to the selection of freshwater fish feed sellers in Kendal Regency
Graph Convolutional Network-based Feature Selection for High-dimensional and Low-sample Size Data
Feature selection is a powerful dimension reduction technique which selects a
subset of relevant features for model construction. Numerous feature selection
methods have been proposed, but most of them fail under the high-dimensional
and low-sample size (HDLSS) setting due to the challenge of overfitting. In
this paper, we present a deep learning-based method - GRAph Convolutional
nEtwork feature Selector (GRACES) - to select important features for HDLSS
data. We demonstrate empirical evidence that GRACES outperforms other feature
selection methods on both synthetic and real-world datasets.Comment: 24 pages, 4 figures, 4 table
Predicting reaction based on customer's transaction using machine learning approaches
Banking advertisements are important because they help target specific customers on subscribing to their packages or other deals by giving their current customers more fixed-term deposit offers. This is done through promotional advertisements on the Internet or media pages, and this task is the responsibility of the shopping department. In order to build a relationship with them, offer them the best deals, and be appropriate for the client with the company's assurance to recover these deposits, many banks or telecommunications firms store the data of their customers. The Portuguese bank increases its sales by establishing a relationship with its customers. This study proposes creating a prediction model using machine learning algorithms, to see how the customer reacts to subscribe to those fixed-term deposits or offers made with the aid of their past record. This classification is binary, i.e., the prediction of whether or not a customer will embrace these offers. Four classifiers that include k-nearest neighbor (k-NN) algorithm, decision tree, naive Bayes, and support vector machines (SVM) were used, and the best result was obtained from the classifier decision tree with an accuracy of 91% and the other classifier SVM with an accuracy of 89%
Using Feature Selection Methods to Discover Common Users’ Preferences for Online Recommender Systems
Recommender systems have taken over user’s choice to choose the items/services they want from online markets, where lots of merchandise is traded. Collaborative filtering-based recommender systems uses user opinions and preferences. Determination of commonly used attributes that influence preferences used for prediction and subsequent recommendation of unknown or new items to users is a significant objective while developing recommender engines. In conventional systems, study of user behavior to know their dis/like over items would be carried-out. In this paper, presents feature selection methods to mine such preferences through selection of high influencing attributes of the items. In machine learning, feature selection is used as a data pre-processing method but extended its use on this work to achieve two objectives; removal of redundant, uninformative features and for selecting formative, relevant features based on the response variable. The latter objective, was suggested to identify and determine the frequent and shared features that would be preferred mostly by marketplace online users as they express their preferences. The dataset used for experimentation and determination was synthetic dataset.  The Jupyter Notebook™ using python was used to run the experiments. Results showed that given a number of formative features, there were those selected, with high influence to the response variable. Evidence showed that different feature selection methods resulted with different feature scores, and intrinsic method had the best overall results with 85% model accuracy. Selected features were used as frequently preferred attributes that influence users’ preferences
Implications on feature detection when using the benefit–cost ratio
In many practical machine learning applications, there are two objectives: one is to maximize predictive accuracy and the other is to minimize costs of the resulting model. These costs of individual features may be financial costs, but can also refer to other aspects, for example, evaluation time. Feature selection addresses both objectives, as it reduces the number of features and can improve the generalization ability of the model. If costs differ between features, the feature selection needs to trade-off the individual benefit and cost of each feature. A popular trade-off choice is the ratio of both, the benefit–cost ratio (BCR). In this paper, we analyze implications of using this measure with special focus to the ability to distinguish relevant features from noise. We perform simulation studies for different cost and data settings and obtain detection rates of relevant features and empirical distributions of the trade-off ratio. Our simulation studies exposed a clear impact of the cost setting on the detection rate. In situations with large cost differences and small effect sizes, the BCR missed relevant features and preferred cheap noise features. We conclude that a trade-off between predictive performance and costs without a controlling hyperparameter can easily overemphasize very cheap noise features. While the simple benefit–cost ratio offers an easy solution to incorporate costs, it is important to be aware of its risks. Avoiding costs close to 0, rescaling large cost differences, or using a hyperparameter trade-off are ways to counteract the adverse effects exposed in this paper
- …