3 research outputs found

    High-dimensional Sparse Count Data Clustering Using Finite Mixture Models

    Get PDF
    Due to the massive amount of available digital data, automating its analysis and modeling for different purposes and applications has become an urgent need. One of the most challenging tasks in machine learning is clustering, which is defined as the process of assigning observations sharing similar characteristics to subgroups. Such a task is significant, especially in implementing complex algorithms to deal with high-dimensional data. Thus, the advancement of computational power in statistical-based approaches is increasingly becoming an interesting and attractive research domain. Among the successful methods, mixture models have been widely acknowledged and successfully applied in numerous fields as they have been providing a convenient yet flexible formal setting for unsupervised and semi-supervised learning. An essential problem with these approaches is to develop a probabilistic model that represents the data well by taking into account its nature. Count data are widely used in machine learning and computer vision applications where an object, e.g., a text document or an image, can be represented by a vector corresponding to the appearance frequencies of words or visual words, respectively. Thus, they usually suffer from the well-known curse of dimensionality as objects are represented with high-dimensional and sparse vectors, i.e., a few thousand dimensions with a sparsity of 95 to 99%, which decline the performance of clustering algorithms dramatically. Moreover, count data systematically exhibit the burstiness and overdispersion phenomena, which both cannot be handled with a generic multinomial distribution, typically used to model count data, due to its dependency assumption. This thesis is constructed around six related manuscripts, in which we propose several approaches for high-dimensional sparse count data clustering via various mixture models based on hierarchical Bayesian modeling frameworks that have the ability to model the dependency of repetitive word occurrences. In such frameworks, a suitable distribution is used to introduce the prior information into the construction of the statistical model, based on a conjugate distribution to the multinomial, e.g. the Dirichlet, generalized Dirichlet, and the Beta-Liouville, which has numerous computational advantages. Thus, we proposed a novel model that we call the Multinomial Scaled Dirichlet (MSD) based on using the scaled Dirichlet as a prior to the multinomial to allow more modeling flexibility. Although these frameworks can model burstiness and overdispersion well, they share similar disadvantages making their estimation procedure is very inefficient when the collection size is large. To handle high-dimensionality, we considered two approaches. First, we derived close approximations to the distributions in a hierarchical structure to bring them to the exponential-family form aiming to combine the flexibility and efficiency of these models with the desirable statistical and computational properties of the exponential family of distributions, including sufficiency, which reduce the complexity and computational efforts especially for sparse and high-dimensional data. Second, we proposed a model-based unsupervised feature selection approach for count data to overcome several issues that may be caused by the high dimensionality of the feature space, such as over-fitting, low efficiency, and poor performance. Furthermore, we handled two significant aspects of mixture based clustering methods, namely, parameters estimation and performing model selection. We considered the Expectation-Maximization (EM) algorithm, which is a broadly applicable iterative algorithm for estimating the mixture model parameters, with incorporating several techniques to avoid its initialization dependency and poor local maxima. For model selection, we investigated different approaches to find the optimal number of components based on the Minimum Message Length (MML) philosophy. The effectiveness of our approaches is evaluated using challenging real-life applications, such as sentiment analysis, hate speech detection on Twitter, topic novelty detection, human interaction recognition in films and TV shows, facial expression recognition, face identification, and age estimation

    Evaluation of a Feature Subset Selection method to find informative spectral bands of Hyperion hyperspectral data for hydrothermal alteration mapping: A case study from the Darrehzar porphyry copper mine, Kerman, Iran

    No full text
    Introduction In the regional prospecting of ore minerals, geologists usually utilize remote sensing images for hydrothermal alteration mineral mapping as a kind of lithological anomaly, which may be linked to mineral deposits (Carranza, 2002). Compared to the multispectral remote sensing images, composed of few spectral bands, the hyperspectral data prepare much more spectral details of the surface materials in many bands. These high spectral resolution images provide subtle spectral data for identifying similar materials of the surface (Camps-Valls et al., 2014). This ability could greatly promote the potential of hyperspectral based mineral mapping (Wang and Zheng, 2010). As in the two last decades, hyperspectral remote sensing has been an important tool for studying earth’s minerals and rocks (Zhang and Peijun, 2014). Although, the high number of spectral bands is an important advantage for hyperspectral images, many of those bands are usually irrelevant and redundant and, therefore, cause just the size and complexity of the band space to be increased. This complexity can lead to an ill-posed problem in supervised classification, namely the curse of dimensionality and the Hughes phenomenon, which negatively affect the accuracy of the classification (Camps-Valls, 2014). Feature reduction methods can be applied to overcome these problems and to eliminate those spectral bands in the classification of hyperspectral images that provide no further useful information. These methods produce an efficient subset of features (spectral bands in remote sensing field) from the original feature space. The decrease in complexity obtained as a result of the feature space reduction can increase the ability of classifiers to efficiently capture the classification rules. Consequently, the speed, generalization, and predictive classification accuracy are increased (Gheyas and Smith, 2010; Camps-Valls et al., 2014). This study is aimed at evaluation and management of the curse of dimensionality risk in hyper spectral data classification by means of a feature reduction method. The method is utilized to select more informative spectral bands of Hyperion hyperspectral data, which are more effective for the classification of hydrothermal alteration zones. The well-known study area here is the Darrehzar porphyry copper mine located 8 km from the southeast of the giant Sarcheshmeh mine. Materials and methods 1. Hyperion data The Hyperion hyperspectral image with 242 spectral bands acquired on July 26, 2004 was available and it was used in this study. 2. Train and test datasets Two datasets were utilized. The first dataset that resulted from the Mixture Tuned Matched Filtering (MTMF) method was applied to feed the feature reduction method and the second dataset containing 17 rock samples collected from the study area was used to carry out the classification by SVM. 3. The feature reduction method In this study, we applied a hybrid Feature Subset Selection (FSS) method to reduce the number of spectral bands of Hyperion data. Extensive details may be found in Moradkhani et al. (2015). Discussion and results The Feature Subset Selection (FSS) algorithm was applied to reduce the size of the spectral bands of Hyperion data. The implementation of this algorithm resulted in the selection of 9 bands among all 165 spectral bands (i.e. 5% of all useable spectral bands of Hyperion) as the more influential bands for the identification of clay minerals. These bands belong to the two spectral ranges, 2125-2250 nm and 2250-2400 nm, respectively. On the other hand, it is believed that the Short-Wave Infrared (SWIR) electromagnetic range (2000-2500 nm) is an important spectral range for distinguishing clay minerals of the hydrothermal alteration systems (Hosseinjani Zadeh et al., 2014). This implies that two ranges introduced by FSS were accurately selected, because both of them coincide with the SWIR range. Clearly speaking, bands 201, 202, 204, and 205 in the range of 2125-2250 nm are used for muskovit, kaolinit and alunit enhancement. Moreover, bands 217, 220, 222, 223, and 224 in the 2250-2400 nm are appropriate for chlorite classification. A comparison between the maps of SVM based classification of the alteration zones using 9 (selected by feature selection method) and 165 (all useable bands of Hyperion data) spectral bands confirmed a significant improvement in the output results when 9 more informative bands are utilized for classification instead of all 165 bands. In fact, the classification based on 9 selected bands is comparable and even more effective than the full band classification. This is because the decrease in spectral bands makes SVM learn the rules of classification more accurately. Reference Camps-Valls, G., Tuia, D., Bruzzone, L. and Benediktsson, J., 2014. Advances in Hyperspectral Image Classification. IEEE Signal Processing Magazine, 31(1): 45–54. Carranza, E.J.M., 2002. Geologically-Constrained Mineral Potential Mapping. Ph.D. Thesis, Delft University of Technology, Delft, Netherlands, 480 pp. Gheyas, A. and Smith, L.S., 2010. Feature subset selection in large dimensionality domains. Pattern Recognition, 43(1): 5-13. Hosseinjani Zadeh, M., Tangestani, M.H., Velasco Roldan, F. and Yusta. I., 2014. Spectral characteristics of minerals in alteration zones associated with porphyry copper deposits in the middle part of Kerman copper belt, SE Iran., SE Iran. Ore Geology Reviews, 62: 191-198. Moradkhani, M., Amiri, A., Javaherian, M. and Safari, H., 2015. A hybrid algorithm for feature subset selection in high-dimensional data sets using FICA and IWSSr algorithm. Applied Soft Computing, 35: 123–135. Wang, Z.H. and Zheng, C.Y., 2010. Rocks/Minerals Information Extraction from EO-1 Hyperion Data Base on SVM. International Conference on Intelligent Computation Technology and Automation, Changsha, China. Zhang, X. and Peijun, L., 2014. Lithological mapping from hyperspectral data by improved use of spectral angle mapper. International Journal of Applied Earth Observation and Geoinformation, 31: 95–109
    corecore