113,037 research outputs found

    Unsupervised Feature Selection Based on Self-configuration Approaches using Multidimensional Scaling

    Get PDF
    Some researchers often collect features so the principal information does not lose. However, many features sometimes cause problems. The truth of analysis results will decrease because of the irrelevant or repetitive features. To overcome it, one of the solutions is feature selection. They are divided into two, namely supervised and unsupervised learning. In supervised, the feature selection can only be carried out on data containing labels. Meanwhile, in unsupervised, there are three approaches correlation, configuration, and variance. This study proposes an unsupervised feature selection by combining correlation and configuration using multidimensional scaling (MDS). The proposed algorithm is MDS-Clustering, which uses hierarchical and non-hierarchical clustering. The result of MDS-clustering is compared with the existing feature selection. There are three schemes in the comparison process, namely, 75\%, 50\%, and 25\% feature selected. The dataset used in this study is the UCI dataset. The validities used are the goodness-of-fit of the proximity matrix (GoFP) and the accuracy of the classification algorithm. The comparison results show that the feature selection proposed is certainly worth recommending as a new approach in the feature selection process. Besides, on certain data, the algorithm can outperform the existing feature selection

    What are the Best Hierarchical Descriptors for Complex Networks?

    Full text link
    This work reviews several hierarchical measurements of the topology of complex networks and then applies feature selection concepts and methods in order to quantify the relative importance of each measurement with respect to the discrimination between four representative theoretical network models, namely Erd\"{o}s-R\'enyi, Barab\'asi-Albert, Watts-Strogatz as well as a geographical type of network. The obtained results confirmed that the four models can be well-separated by using a combination of measurements. In addition, the relative contribution of each considered feature for the overall discrimination of the models was quantified in terms of the respective weights in the canonical projection into two dimensions, with the traditional clustering coefficient, hierarchical clustering coefficient and neighborhood clustering coefficient resulting particularly effective. Interestingly, the average shortest path length and hierarchical node degrees contributed little for the separation of the four network models.Comment: 9 pages, 4 figure

    Towards Clustering of Mobile and Smartwatch Accelerometer Data for Physical Activity Recognition

    Get PDF
    Mobile and wearable devices now have a greater capability of sensing human activity ubiquitously and unobtrusively through advancements in miniaturization and sensing abilities. However, outstanding issues remain around the energy restrictions of these devices when processing large sets of data. This paper presents our approach that uses feature selection to refine the clustering of accelerometer data to detect physical activity. This also has a positive effect on the computational burden that is associated with processing large sets of data, as energy efficiency and resource use is decreased because less data is processed by the clustering algorithms. Raw accelerometer data, obtained from smartphones and smartwatches, have been preprocessed to extract both time and frequency domain features. Principle component analysis feature selection (PCAFS) and correlation feature selection (CFS) have been used to remove redundant features. The reduced feature sets have then been evaluated against three widely used clustering algorithms, including hierarchical clustering analysis (HCA), k-means, and density-based spatial clustering of applications with noise (DBSCAN). Using the reduced feature sets resulted in improved separability, reduced uncertainty, and improved efficiency compared with the baseline, which utilized all features. Overall, the CFS approach in conjunction with HCA produced higher Dunn Index results of 9.7001 for the phone and 5.1438 for the watch features, which is an improvement over the baseline. The results of this comparative study of feature selection and clustering, with the specific algorithms used, has not been performed previously and provides an optimistic and usable approach to recognize activities using either a smartphone or smartwatch

    Complete agglomerative hierarchy document’s clustering based on fuzzy luhn’s gibbs latent dirichlet allocation

    Get PDF
    Agglomerative hierarchical is a bottom up clustering method, where the distances between documents can be retrieved by extracting feature values using a topic-based latent dirichlet allocation method. To reduce the number of features, term selection can be done using Luhn’s Idea. Those methods can be used to build the better clusters for document. But, there is less research discusses it. Therefore, in this research, the term weighting calculation uses Luhn’s Idea to select the terms by defining upper and lower cut-off, and then extracts the feature of terms using gibbs sampling latent dirichlet allocation combined with term frequency and fuzzy Sugeno method. The feature values used to be the distance between documents, and clustered with single, complete and average link algorithm. The evaluations show the feature extraction with and without lower cut-off have less difference. But, the topic determination of each term based on term frequency and fuzzy Sugeno method is better than Tsukamoto method in finding more relevant documents. The used of lower cut-off and fuzzy Sugeno gibbs latent dirichlet allocation for complete agglomerative hierarchical clustering have consistent metric values. This clustering method suggested as a better method in clustering documents that is more relevant to its gold standard

    Partitional Clustering

    Get PDF
    People are living in a world full of data. Humans are collecting data from many measurements and observations in their daily works. The sorting of these numerous data is important and necessary in terms of analyzing, reasoning, and decision-making processes. For this reason, clustering has been used in many areas and has become very important in recent years. Feature selection and classifying the data in subsets can be changed data to data. As a result of these feature selection methods, some clustering methods have been revealed. Hierarchical clustering, partitional clustering, artificial system clustering, kernel-based clustering, and sequential data clustering are determined for different clustering strategies. This chapter examines some popular partitional clustering techniques and algorithms. Partitional clustering assigns a set of data points into k-clusters by using iterative processes. The predefined criterion function (J) assigns the datum into kth number set. As a result of this criterion function value in k sets (maximization and minimization calculation), clustering can be done. This chapter starts with criterion function for clustering process. In addition, some applications will be done for each algorithm in this chapter

    Feature selection by multi-objective optimization: application to network anomaly detection by hierarchical self-organizing maps.

    Get PDF
    Feature selection is an important and active issue in clustering and classification problems. By choosing an adequate feature subset, a dataset dimensionality reduction is allowed, thus contributing to decreasing the classification computational complexity, and to improving the classifier performance by avoiding redundant or irrelevant features. Although feature selection can be formally defined as an optimisation problem with only one objective, that is, the classification accuracy obtained by using the selected feature subset, in recent years, some multi-objective approaches to this problem have been proposed. These either select features that not only improve the classification accuracy, but also the generalisation capability in case of supervised classifiers, or counterbalance the bias toward lower or higher numbers of features that present some methods used to validate the clustering/classification in case of unsupervised classifiers. The main contribution of this paper is a multi-objective approach for feature selection and its application to an unsupervised clustering procedure based on Growing Hierarchical Self-Organizing Maps (GHSOM) that includes a new method for unit labelling and efficient determination of the winning unit. In the network anomaly detection problem here considered, this multi-objective approach makes it possible not only to differentiate between normal and anomalous traffic but also among different anomalies. The efficiency of our proposals has been evaluated by using the well-known DARPA/NSL-KDD datasets that contain extracted features and labeled attacks from around 2 million connections. The selected feature sets computed in our experiments provide detection rates up to 99.8% with normal traffic and up to 99.6% with anomalous traffic, as well as accuracy values up to 99.12%.This work has been funded by FEDER funds and the Ministerio de Ciencia e Innovación of the Spanish Government under Project No. TIN2012-32039

    A supervised clustering approach for fMRI-based inference of brain states

    Get PDF
    We propose a method that combines signals from many brain regions observed in functional Magnetic Resonance Imaging (fMRI) to predict the subject's behavior during a scanning session. Such predictions suffer from the huge number of brain regions sampled on the voxel grid of standard fMRI data sets: the curse of dimensionality. Dimensionality reduction is thus needed, but it is often performed using a univariate feature selection procedure, that handles neither the spatial structure of the images, nor the multivariate nature of the signal. By introducing a hierarchical clustering of the brain volume that incorporates connectivity constraints, we reduce the span of the possible spatial configurations to a single tree of nested regions tailored to the signal. We then prune the tree in a supervised setting, hence the name supervised clustering, in order to extract a parcellation (division of the volume) such that parcel-based signal averages best predict the target information. Dimensionality reduction is thus achieved by feature agglomeration, and the constructed features now provide a multi-scale representation of the signal. Comparisons with reference methods on both simulated and real data show that our approach yields higher prediction accuracy than standard voxel-based approaches. Moreover, the method infers an explicit weighting of the regions involved in the regression or classification task

    ENHANCEMENT OF DECISION TREE METHOD BASED ON HIERARCHICAL CLUSTERING AND DISPERSION RATIO

    Get PDF
    The classification process using a decision tree is a classification method that has a feature selection process. Decision tree classifications using information gain have a disadvantage when the dataset has unique attributes for each imbalanced class record and distribution. The data used for decision tree classification has 2 types, numerical and nominal. The numerical data type is carried out a discretization process so that it gets data intervals. Weaknesses in the information gain method can be reduced by using a dispersion ratio method that does not depend on the class distribution, but on the frequency distribution. Numeric type data will be dis-criticized using the hierarchical clustering method to obtain a balanced data cluster. The data used in this study were taken from the UCI machine learning repository, which has two types of numeric and nominal data. There are two stages in this research namely, first the numeric type data will be discretized using hierarchical clustering with 3 methods, namely single link, complete link, and average link. Second, the results of discretization will be merged again then the formation of trees with splitting attributes using dispersion ratio and evaluated with cross-validation k-fold 7. The results obtained show that the discretization of data with hierarchical clustering can increase predictions by 14.6% compared with data without discretization. The attribute splitting process with the dispersion ratio of the data resulting from the discretization of hierarchical clustering can increase the prediction by 6.51%
    corecore