31 research outputs found

    Different Subspace Classification

    Get PDF
    We introduce the idea of Characteristic Regions to solve a classification problem. By identifying regions in which classes are dense (i.e. many observations) and also relevant (for discrimination) we can characterize the different classes. These Characteristic Regions are used to generate a classification rule. The result can be visualized so the user is provided with an insight into data for an easy interpretation. --

    Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering

    Get PDF
    In order to group the observations of a data set into a given number of clusters, an ?optimal? subset out of a greater number of explanatory variables is to be selected. The problem is approached by maximizing a quality measure under certain restrictions that are supposed to keep the subset most representative of the whole data. The restrictions may either be set manually, or generated from the data. A genetic optimization algorithm is developed to solve this problem. The procedure is then applied to a data set describing features of sub-districts of the city of Dortmund, Germany, to detect different social milieus and investigate the variables making up the differences between these. --

    Variable selection for discrimination of more than two classes where data are sparse

    Get PDF
    In classification, with an increasing number of variables, the required number of observations grows drastically. In this paper we present an approach to put into effect the maximal possible variable selection, by splitting a K class classification problem into pairwise problems. The principle makes use of the possibility that a variable that discriminates two classes will not necessarily do so for all such class pairs. We further present the construction of a classification rule based on the pairwise solutions by the Pairwise Coupling algorithm according to Hastie and Tibshirani (1998). The suggested proceedure can be applied to any classification method. Finally, situations with lack of data in multidimensional spaces are investigated on different simulated data sets to illustrate the problem and the possible gain. The principle is compared to the classical approach of linear and quadratic discriminant analysis. --

    Predicting eBay Prices: Selecting and Interpreting Machine Learning Models – Results of the AG DANK 2018 Data Science Competition

    Get PDF
    The annual meeting of the work group on data analysis and numeric classification (DANK) took place at Stralsund University of Applied Sciences, Germany on October 26h and 27h, 2018 with a focus theme on interpretable machine learning. Traditionally, the conference is accompanied by a data science competition where the participants are invited to analyze one or several data sets and compare and discuss their solutions. In 2018, the task was to predict end prices of eBay auctions. The paper describes the task as well as a discussion of the results as provided by the conference participants. These cover aspects of preprocessing, comparison of different models, task specific hyperparameter tuning as well as the interpretation of the resulting models and the relevance of additional text information

    Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering

    Full text link
    In order to group the observations of a data set into a given number of clusters, an 'optimal' subset out of a greater number of explanatory variables is to be selected. The problem is approached by maximizing a quality measure under certain restrictions that are supposed to keep the subset most representative of the whole data. The restrictions may either be set manually, or generated from the data. A genetic optimization algorithm is developed to solve this problem. The procedure is then applied to a data set describing features of sub-districts of the city of Dortmund, Germany, to detect different social milieus and investigate the variables making up the differences between these

    Application of a Genetic Algorithm to Variable Selection in Fuzzy Clustering

    Get PDF
    In order to group the observations of a data set into a given number of clusters, an optimal subset out of a greater number of explanatory variables is to be selected. The problem is approached by maximizing a quality measure under certain restrictions that are supposed to keep the subset most representative of the whole data. The restrictions may either be set manually, or generated from the data. A genetic optimization algorithm is developed to solve this problem. The procedure is then applied to a data set describing features of sub-districts of the city of Dortmund, Germany, to detect different social milieus and investigate the variables making up the differences between these.In order to group the observations of a data set into a given number of clusters, an ‘optimal’ subset out of a greater number of explanatory variables is to be selected. The problem is approached by maximizing a quality measure under certain restrictions that are supposed to keep the subset most representative of the whole data. The restrictions may either be set manually, or generated from the data. A genetic optimization algorithm is developed to solve this problem. The procedure is then applied to a data set describing features of sub-districts of the city of Dortmund, Germany, to detect different social milieus and investigate the variables making up the differences between these

    Cluster Validation for Mixed-Type Data

    Get PDF
    For cluster analysis based on mixed-type data (i.e. data consisting of numerical and categorical variables), comparatively few clustering methods are available. One popular approach to deal with this kind of problems is an extension of the k-means algorithm (Huang, 1998), the so-called k-prototype algorithm, which is implemented in the R package clustMixType (Szepannek and Aschenbruck, 2019). It is further known that the selection of a suitable number of clusters k is particularly crucial in partitioning cluster procedures. Many implementations of cluster validation indices in R are not suitable for mixed-type data. This paper examines the transferability of validation indices, such as the Gamma index, Average Silhouette Width or Dunn index to mixed-type data. Furthermore, the R package clustMixType is extended by these indices and their application is demonstrated. Finally, the behaviour of the adapted indices is tested by a short simulation study using different data scenarios

    Variable selection for discrimination of more than two classes where data are sparse

    Get PDF
    In classification, with an increasing number of variables, the required number of observations grows drastically. In this paper we present an approach to put into effect the maximal possible variable selection, by splitting a K class classification problem into pairwise problems. The principle makes use of the possibility that a variable that discriminates two classes will not necessarily do so for all such class pairs. We further present the construction of a classification rule based on the pairwise solutions by the Pairwise Coupling algorithm according to Hastie and Tibshirani (1998). The suggested proceedure can be applied to any classification method. Finally, situations with lack of data in multidimensional spaces are investigated on different simulated data sets to illustrate the problem and the possible gain. The principle is compared to the classical approach of linear and quadratic discriminant analysis
    corecore