3 research outputs found

    Clustering high dimensional data using subspace and projected clustering algorithms

    Get PDF
    Problem statement: Clustering has a number of techniques that have been developed in statistics, pattern recognition, data mining, and other fields. Subspace clustering enumerates clusters of objects in all subspaces of a dataset. It tends to produce many over lapping clusters. Approach: Subspace clustering and projected clustering are research areas for clustering in high dimensional spaces. In this research we experiment three clustering oriented algorithms, PROCLUS, P3C and STATPC. Results: In general, PROCLUS performs better in terms of time of calculation and produced the least number of un-clustered data while STATPC outperforms PROCLUS and P3C in the accuracy of both cluster points and relevant attributes found. Conclusions/Recommendations: In this study, we analyze in detail the properties of different data clustering method.Comment: 9 pages, 6 figure

    EDSC: Efficient document subspace clustering technique for high-dimensional data

    Get PDF
    With the advancement in the pervasive technology, there is a spontaneous rise in the size of the data. Such data are generated from various forms of resources right from individual to organization level. Due to the characteristics of unstructured or semi-structuredness in data representation, the existing data analytics approaches are not directly applicable which leads to curse of dimensionality problem. Hence, this paper presents an Efficient Document Subspace Clustering (EDSC) technique for high-dimensional data that contributes to the existing system with respect to identification by eliminating the redundant data. The discrete segmentation of data points are used to explicitly expose the dimensionality of hidden subspaces in the clusters. The outcome of the proposed system was compared with existing system to find the effective document clustering process for high-dimensional data. The processing time of EDSC for subspace clustering is reduced by 50% as compared to the existing system

    Machine learning for understanding complex, interlinked social data

    Get PDF
    With the growing availability of ‘big’ data, increasing computer power, and improved data storage capacities, machine learning techniques are now frequently employed in order to make sense of data. Yet, the social sciences have been slow to adopt these techniques, and there is little evidence of their use in some academic fields. This thesis explores the methods most commonly utilised in social science research, that is, linear regression and null hypothesis significance testing, in order to identify how machine learning methods might complement these more established methods. A case study exploring the Troubled Families programme provides a practical example of how machine learning techniques can be utilised on complex, interlinked social data in order to provide deeper understanding and more insight into the data. Eleven different types of families were identified using cluster analysis, and analysis was performed in order to understand how the family’s lives changed after joining the TF programme when compared to before. The analysis provided insight into the various types of families that existed and the problems that they had. It also highlighted that, had the data been analysed on an overall global level, it would have been prone to an averaging effect whereby many of the changes that occurred were not apparent; analysis on the cluster-level resulted in identification of cluster-level patterns, and a greater understanding of the data. This thesis demonstrated that machine learning techniques, such as cluster analysis and decision tree learning, can be effectively utilised on complex ‘real-life’ social science datasets. These methods can identify hidden groups and relationships, and important predictors in a dataset, provide a better understanding of the structure of the data, and aid in generating research questions and hypotheses
    corecore