3 research outputs found
High-dimensional Sparse Count Data Clustering Using Finite Mixture Models
Due to the massive amount of available digital data, automating its analysis and modeling for
different purposes and applications has become an urgent need. One of the most challenging tasks
in machine learning is clustering, which is defined as the process of assigning observations sharing
similar characteristics to subgroups. Such a task is significant, especially in implementing complex
algorithms to deal with high-dimensional data. Thus, the advancement of computational power in
statistical-based approaches is increasingly becoming an interesting and attractive research domain.
Among the successful methods, mixture models have been widely acknowledged and successfully
applied in numerous fields as they have been providing a convenient yet flexible formal setting for
unsupervised and semi-supervised learning. An essential problem with these approaches is to develop
a probabilistic model that represents the data well by taking into account its nature. Count
data are widely used in machine learning and computer vision applications where an object, e.g.,
a text document or an image, can be represented by a vector corresponding to the appearance frequencies
of words or visual words, respectively. Thus, they usually suffer from the well-known
curse of dimensionality as objects are represented with high-dimensional and sparse vectors, i.e., a
few thousand dimensions with a sparsity of 95 to 99%, which decline the performance of clustering
algorithms dramatically. Moreover, count data systematically exhibit the burstiness and overdispersion
phenomena, which both cannot be handled with a generic multinomial distribution, typically
used to model count data, due to its dependency assumption.
This thesis is constructed around six related manuscripts, in which we propose several approaches
for high-dimensional sparse count data clustering via various mixture models based on hierarchical Bayesian modeling frameworks that have the ability to model the dependency of repetitive
word occurrences. In such frameworks, a suitable distribution is used to introduce the prior
information into the construction of the statistical model, based on a conjugate distribution to the
multinomial, e.g. the Dirichlet, generalized Dirichlet, and the Beta-Liouville, which has numerous
computational advantages. Thus, we proposed a novel model that we call the Multinomial
Scaled Dirichlet (MSD) based on using the scaled Dirichlet as a prior to the multinomial to allow
more modeling flexibility. Although these frameworks can model burstiness and overdispersion
well, they share similar disadvantages making their estimation procedure is very inefficient when
the collection size is large. To handle high-dimensionality, we considered two approaches. First,
we derived close approximations to the distributions in a hierarchical structure to bring them to
the exponential-family form aiming to combine the flexibility and efficiency of these models with
the desirable statistical and computational properties of the exponential family of distributions, including
sufficiency, which reduce the complexity and computational efforts especially for sparse
and high-dimensional data. Second, we proposed a model-based unsupervised feature selection approach
for count data to overcome several issues that may be caused by the high dimensionality of
the feature space, such as over-fitting, low efficiency, and poor performance.
Furthermore, we handled two significant aspects of mixture based clustering methods, namely,
parameters estimation and performing model selection. We considered the Expectation-Maximization
(EM) algorithm, which is a broadly applicable iterative algorithm for estimating the mixture model
parameters, with incorporating several techniques to avoid its initialization dependency and poor
local maxima. For model selection, we investigated different approaches to find the optimal number
of components based on the Minimum Message Length (MML) philosophy. The effectiveness of
our approaches is evaluated using challenging real-life applications, such as sentiment analysis, hate
speech detection on Twitter, topic novelty detection, human interaction recognition in films and TV
shows, facial expression recognition, face identification, and age estimation
Evaluation of a Feature Subset Selection method to find informative spectral bands of Hyperion hyperspectral data for hydrothermal alteration mapping: A case study from the Darrehzar porphyry copper mine, Kerman, Iran
Introduction
In the regional prospecting of ore minerals, geologists usually utilize remote sensing images for hydrothermal alteration mineral mapping as a kind of lithological anomaly, which may be linked to mineral deposits (Carranza, 2002).
Compared to the multispectral remote sensing images, composed of few spectral bands, the hyperspectral data prepare much more spectral details of the surface materials in many bands. These high spectral resolution images provide subtle spectral data for identifying similar materials of the surface (Camps-Valls et al., 2014). This ability could greatly promote the potential of hyperspectral based mineral mapping (Wang and Zheng, 2010). As in the two last decades, hyperspectral remote sensing has been an important tool for studying earth’s minerals and rocks (Zhang and Peijun, 2014).
Although, the high number of spectral bands is an important advantage for hyperspectral images, many of those bands are usually irrelevant and redundant and, therefore, cause just the size and complexity of the band space to be increased. This complexity can lead to an ill-posed problem in supervised classification, namely the curse of dimensionality and the Hughes phenomenon, which negatively affect the accuracy of the classification (Camps-Valls, 2014).
Feature reduction methods can be applied to overcome these problems and to eliminate those spectral bands in the classification of hyperspectral images that provide no further useful information. These methods produce an efficient subset of features (spectral bands in remote sensing field) from the original feature space. The decrease in complexity obtained as a result of the feature space reduction can increase the ability of classifiers to efficiently capture the classification rules. Consequently, the speed, generalization, and predictive classification accuracy are increased (Gheyas and Smith, 2010; Camps-Valls et al., 2014).
This study is aimed at evaluation and management of the curse of dimensionality risk in hyper spectral data classification by means of a feature reduction method. The method is utilized to select more informative spectral bands of Hyperion hyperspectral data, which are more effective for the classification of hydrothermal alteration zones. The well-known study area here is the Darrehzar porphyry copper mine located 8 km from the southeast of the giant Sarcheshmeh mine.
Materials and methods
1. Hyperion data
The Hyperion hyperspectral image with 242 spectral bands acquired on July 26, 2004 was available and it was used in this study.
2. Train and test datasets
Two datasets were utilized. The first dataset that resulted from the Mixture Tuned Matched Filtering (MTMF) method was applied to feed the feature reduction method and the second dataset containing 17 rock samples collected from the study area was used to carry out the classification by SVM.
3. The feature reduction method
In this study, we applied a hybrid Feature Subset Selection (FSS) method to reduce the number of spectral bands of Hyperion data. Extensive details may be found in Moradkhani et al. (2015).
Discussion and results
The Feature Subset Selection (FSS) algorithm was applied to reduce the size of the spectral bands of Hyperion data. The implementation of this algorithm resulted in the selection of 9 bands among all 165 spectral bands (i.e. 5% of all useable spectral bands of Hyperion) as the more influential bands for the identification of clay minerals. These bands belong to the two spectral ranges, 2125-2250 nm and 2250-2400 nm, respectively. On the other hand, it is believed that the Short-Wave Infrared (SWIR) electromagnetic range (2000-2500 nm) is an important spectral range for distinguishing clay minerals of the hydrothermal alteration systems (Hosseinjani Zadeh et al., 2014). This implies that two ranges introduced by FSS were accurately selected, because both of them coincide with the SWIR range. Clearly speaking, bands 201, 202, 204, and 205 in the range of 2125-2250 nm are used for muskovit, kaolinit and alunit enhancement. Moreover, bands 217, 220, 222, 223, and 224 in the 2250-2400 nm are appropriate for chlorite classification.
A comparison between the maps of SVM based classification of the alteration zones using 9 (selected by feature selection method) and 165 (all useable bands of Hyperion data) spectral bands confirmed a significant improvement in the output results when 9 more informative bands are utilized for classification instead of all 165 bands. In fact, the classification based on 9 selected bands is comparable and even more effective than the full band classification. This is because the decrease in spectral bands makes SVM learn the rules of classification more accurately.
Reference
Camps-Valls, G., Tuia, D., Bruzzone, L. and Benediktsson, J., 2014. Advances in Hyperspectral Image Classification. IEEE Signal Processing Magazine, 31(1): 45–54.
Carranza, E.J.M., 2002. Geologically-Constrained Mineral Potential Mapping. Ph.D. Thesis, Delft University of Technology, Delft, Netherlands, 480 pp.
Gheyas, A. and Smith, L.S., 2010. Feature subset selection in large dimensionality domains. Pattern Recognition, 43(1): 5-13.
Hosseinjani Zadeh, M., Tangestani, M.H., Velasco Roldan, F. and Yusta. I., 2014. Spectral characteristics of minerals in alteration zones associated with porphyry copper deposits in the middle part of Kerman copper belt, SE Iran., SE Iran. Ore Geology Reviews, 62: 191-198.
Moradkhani, M., Amiri, A., Javaherian, M. and Safari, H., 2015. A hybrid algorithm for feature subset selection in high-dimensional data sets using FICA and IWSSr algorithm. Applied Soft Computing, 35: 123–135.
Wang, Z.H. and Zheng, C.Y., 2010. Rocks/Minerals Information Extraction from EO-1 Hyperion Data Base on SVM. International Conference on Intelligent Computation Technology and Automation, Changsha, China.
Zhang, X. and Peijun, L., 2014. Lithological mapping from hyperspectral data by improved use of spectral angle mapper. International Journal of Applied Earth Observation and Geoinformation, 31: 95–109