129 research outputs found

    Temporal optimisation of image acquisition for land cover classification with random forest and MODIS time-series

    Get PDF
    The analysis and classification of land cover is one of the principal applications in terrestrial remote sensing. Due to the seasonal variability of different vegetation types and land surface characteristics, the ability to discriminate land cover types changes over time. Multi-temporal classification can help to improve the classification accuracies, but different constraints, such as financial restrictions or atmospheric conditions, may impede their application. The optimisation of image acquisition timing and frequencies can help to increase the effectiveness of the classification process. For this purpose, the Feature Importance (FI) measure of the state-of-the art machine learning method Random Forest was used to determine the optimal image acquisition periods for a general (Grassland, Forest, Water, Settlement, Peatland) and Grassland specific (Improved Grassland, Semi-Improved Grassland) land cover classification in central Ireland based on a 9-year time-series of MODIS Terra 16 day composite data (MOD13Q1). Feature Importances for each acquisition period of the Enhanced Vegetation Index (EVI) and Normalised Difference Vegetation Index (NDVI) were calculated for both classification scenarios. In the general land cover classification, the months December and January showed the highest, and July and August the lowest separability for both VIs over the entire nine-year period. This temporal separability was reflected in the classification accuracies, where the optimal choice of image dates outperformed the worst image date by 13% using NDVI and 5% using EVI on a mono-temporal analysis. With the addition of the next best image periods to the data input the classification accuracies converged quickly to their limit at around 8–10 images. The binary classification schemes, using two classes only, showed a stronger seasonal dependency with a higher intra-annual, but lower inter-annual variation. Nonetheless anomalous weather conditions, such as the cold winter of 2009/2010 can alter the temporal separability pattern significantly. Due to the extensive use of the NDVI for land cover discrimination, the findings of this study should be transferrable to data from other optical sensors with a higher spatial resolution. However, the high impact of outliers from the general climatic pattern highlights the limitation of spatial transferability to locations with different climatic and land cover conditions. The use of high-temporal, moderate resolution data such as MODIS in conjunction with machine-learning techniques proved to be a good base for the prediction of image acquisition timing for optimal land cover classification results

    Kernel-Based Data Mining Approach with Variable Selection for Nonlinear High-Dimensional Data

    Get PDF
    In statistical data mining research, datasets often have nonlinearity and high-dimensionality. It has become difficult to analyze such datasets in a comprehensive manner using traditional statistical methodologies. Kernel-based data mining is one of the most effective statistical methodologies to investigate a variety of problems in areas including pattern recognition, machine learning, bioinformatics, chemometrics, and statistics. In particular, statistically-sophisticated procedures that emphasize the reliability of results and computational efficiency are required for the analysis of high-dimensional data. In this dissertation, first, a novel wrapper method called SVM-ICOMP-RFE based on hybridized support vector machine (SVM) and recursive feature elimination (RFE) with information-theoretic measure of complexity (ICOMP) is introduced and developed to classify high-dimensional data sets and to carry out subset selection of the variables in the original data space for finding the best for discriminating between groups. Recursive feature elimination (RFE) ranks variables based on the information-theoretic measure of complexity (ICOMP) criterion. Second, a dual variables functional support vector machine approach is proposed. The proposed approach uses both the first and second derivatives of the degradation profiles. The modified floating search algorithm for the repeated variable selection, with newly-added degradation path points, is presented to find a few good variables while reducing the computation time for on-line implementation. Third, a two-stage scheme for the classification of near infrared (NIR) spectral data is proposed. In the first stage, the proposed multi-scale vertical energy thresholding (MSVET) procedure is used to reduce the dimension of the high-dimensional spectral data. In the second stage, a few important wavelet coefficients are selected using the proposed SVM gradient-recursive feature elimination (RFE). Fourth, a novel methodology based on a human decision making process for discriminant analysis called PDCM is proposed. The proposed methodology consists of three basic steps emulating the thinking process: perception, decision, and cognition. In these steps two concepts known as support vector machines for classification and information complexity are integrated to evaluate learning models

    Gene selection algorithms for microarray data based on least squares support vector machine

    Get PDF
    BACKGROUND: In discriminant analysis of microarray data, usually a small number of samples are expressed by a large number of genes. It is not only difficult but also unnecessary to conduct the discriminant analysis with all the genes. Hence, gene selection is usually performed to select important genes. RESULTS: A gene selection method searches for an optimal or near optimal subset of genes with respect to a given evaluation criterion. In this paper, we propose a new evaluation criterion, named the leave-one-out calculation (LOOC, A list of abbreviations appears just above the list of references) measure. A gene selection method, named leave-one-out calculation sequential forward selection (LOOCSFS) algorithm, is then presented by combining the LOOC measure with the sequential forward selection scheme. Further, a novel gene selection algorithm, the gradient-based leave-one-out gene selection (GLGS) algorithm, is also proposed. Both of the gene selection algorithms originate from an efficient and exact calculation of the leave-one-out cross-validation error of the least squares support vector machine (LS-SVM). The proposed approaches are applied to two microarray datasets and compared to other well-known gene selection methods using codes available from the second author. CONCLUSION: The proposed gene selection approaches can provide gene subsets leading to more accurate classification results, while their computational complexity is comparable to the existing methods. The GLGS algorithm can also better scale to datasets with a very large number of genes

    Integration of Spatial and Spectral Information for Hyperspectral Image Classification

    Get PDF
    Hyperspectral imaging has become a powerful tool in biomedical and agriculture fields in the recent years and the interest amongst researchers has increased immensely. Hyperspectral imaging combines conventional imaging and spectroscopy to acquire both spatial and spectral information from an object. Consequently, a hyperspectral image data contains not only spectral information of objects, but also the spatial arrangement of objects. Information captured in neighboring locations may provide useful supplementary knowledge for analysis. Therefore, this dissertation investigates the integration of information from both the spectral and spatial domains to enhance hyperspectral image classification performance. The major impediment to the combined spatial and spectral approach is that most spatial methods were only developed for single image band. Based on the traditional singleimage based local Geary measure, this dissertation successfully proposes a Multidimensional Local Spatial Autocorrelation (MLSA) for hyperspectral image data. Based on the proposed spatial measure, this research work develops a collaborative band selection strategy that combines both the spectral separability measure (divergence) and spatial homogeneity measure (MLSA) for hyperspectral band selection task. In order to calculate the divergence more efficiently, a set of recursive equations for the calculation of divergence with an additional band is derived to overcome the computational restrictions. Moreover, this dissertation proposes a collaborative classification method which integrates the spectral distance and spatial autocorrelation during the decision-making process. Therefore, this method fully utilizes the spatial-spectral relationships inherent in the data, and thus improves the classification performance. In addition, the usefulness of the proposed band selection and classification method is evaluated with four case studies. The case studies include detection and identification of tumor on poultry carcasses, fecal on apple surface, cancer on mouse skin and crop in agricultural filed using hyperspectral imagery. Through the case studies, the performances of the proposed methods are assessed. It clearly shows the necessity and efficiency of integrating spatial information for hyperspectral image processing

    Stratified Pathway Analysis to Identify Gene Sets Associated with Oral Contraceptive Use and Breast Cancer

    Get PDF
    published_or_final_versio

    Gene Expression Analysis Methods on Microarray Data a A Review

    Get PDF
    In recent years a new type of experiments are changing the way that biologists and other specialists analyze many problems. These are called high throughput experiments and the main difference with those that were performed some years ago is mainly in the quantity of the data obtained from them. Thanks to the technology known generically as microarrays, it is possible to study nowadays in a single experiment the behavior of all the genes of an organism under different conditions. The data generated by these experiments may consist from thousands to millions of variables and they pose many challenges to the scientists who have to analyze them. Many of these are of statistical nature and will be the center of this review. There are many types of microarrays which have been developed to answer different biological questions and some of them will be explained later. For the sake of simplicity we start with the most well known ones: expression microarrays

    Discriminant analysis of multi sensor data fusion based on percentile forward feature selection

    Get PDF
    Feature extraction is a widely used approach to extract significant features in multi sensor data fusion. However, feature extraction suffers from some drawbacks. The biggest problem is the failure to identify discriminative features within multi-group data. Thus, this study proposed a new discriminant analysis of multi sensor data fusion using feature selection based on the unbounded and bounded Mahalanobis distance to replace the feature extraction approach in low and intermediate levels data fusion. This study also developed percentile forward feature selection (PFFS) to identify discriminative features feasible for sensor data classification. The proposed discriminant procedure begins by computing the average distance between multi- group using the unbounded and bounded distances. Then, the selection of features started by ranking the fused features in low and intermediate levels based on the computed distances. The feature subsets were selected using the PFFS. The constructed classification rules were measured using classification accuracy measure. The whole investigations were carried out on ten e-nose and e-tongue sensor data. The findings indicated that the bounded Mahalanobis distance is superior in selecting important features with fewer features than the unbounded criterion. Moreover, with the bounded distance approach, the feature selection using the PFFS obtained higher classification accuracy. The overall proposed procedure is found fit to replace the traditional discriminant analysis of multi sensor data fusion due to greater discriminative power and faster convergence rate of higher accuracy. As conclusion, the feature selection can solve the problem of feature extraction. Next, the proposed PFFS has been proved to be effective in selecting subsets of features of higher accuracy with faster computation. The study also specified the advantage of the unbounded and bounded Mahalanobis distance in feature selection of high dimensional data which benefit both engineers and statisticians in sensor technolog

    Feature Selection in Image Databases

    Get PDF
    Even though the problem of determining the number of features required to provide an acceptable classification performance has been a topic of interest to the researchers in the pattern recognition community for a few decades, a formal method for solving this problem still does not exist. For instance, the well-known dimensionality reduction method of principal component analysis (PCA) sorts the features it generates in the order of their importance, but it does not provide a mechanism for determining the number of sorted features that need to be retained for a meaningful classification. Discrete wavelet transform (DWT) is another linear transformation used for data compaction, in which the coefficients in the transform domain can be sorted in different orders depending on their importance. However, the question of determining the number of features to be retained for a good classification of the data remains unanswered. The objective of this study is to develop schemes for determining the number of features in the PCA and DWT domains that are sufficient for a classifier to provide a maximum possible classifiability of the samples in these transform domains. The energy content of the DWT and PCA coefficients of practical signals follow a specific pattern. The proposed schemes, by exploiting this property of the signals, develop criteria that are based on maintaining the energy of the ensemble of the feature vectors as their dimensionality is reduced. Within this unifying theme, in this thesis, the problem of dimension reduction is investigated when the features are generated by the linear transformation techniques of the discrete wavelet transform and the principal component analysis, and by the nonlinear technique of kernel principal component analysis. The first part of this study is concerned with developing a criterion for determining the number of coefficients when the features are represented as wavelet coefficients. The reduction in the dimensionality of the feature vectors is performed by letting the matrices of the wavelet coefficients of the data samples to undergo the process of Morton scanning and choosing a set of a fixed number of coefficients from these matrices whose energy content approaches to that of the original set of all the samples. In the second part of the thesis, the problem of determining a reduced dimensionality of feature vectors is investigated when the features are PCA generated. The proposed method of finding a reduced dimensionality of feature vectors is based on evaluating a cumulative distance between all the pairs of distinct clusters with a reduced set of features and examining its proximity to the distance when all the features are included. The PCA methods for data classification work well when the distinct clusters are linearly separable. For clusters that are nonlinearly separable, the kernel versions of PCA (KPCA) prove to be more efficient for generating features. The method developed in the second part of this thesis for obtaining the reduced dimensionality of the PCA based feature vectors cannot be readily extended to the kernel space because of the lack of availability of the feature vectors in an explicit form in this space. Therefore, the third part of this study develops a suitable criterion for obtaining reduced dimensionality of the feature vectors when they are generated by a kernel PCA. Extensive experiments are performed on a series of image databases to demonstrate the effectiveness of the criteria developed in this study for predicting the number of features to be retained. It is shown that there is a direct correlation between the expressions developed for the criteria and the classification accuracy as functions of the number of features retained. The results of the experiments show that with the use of the three feature selection techniques, a classifier can provide its maximum classifiability, that is, a classifiability attained by the uncompressed feature vectors, with only a small fraction of the original features. The robustness of the proposed methods is also investigated by applying them to noise-corrupted images
    • …
    corecore