238 research outputs found

    Resampling Methods for Unsupervised Learning from Sample Data

    Get PDF

    Generalized rule antecedent structure for TSK type of dynamic models: Structure identification and parameter estimation

    Get PDF
    Scope and Method of Study: A novel rule antecedent structure is proposed to generalize TSK type of dynamic fuzzy models to deal with the problem of curse of dimensionality in conventional TSK fuzzy models. The proposed antecedent structure uses only nonlinear variables, which directly reduce the number of possible rules by reducing antecedent dimension. Additionally, one more degree of freedom is introduced to design antecedents to cover an antecedent space more efficiently, which further reduces the number of rules. The resultant GTSK model is identified in two stages. A novel recursive estimation based on spatially rearranged data is used to determine the consequent and antecedent variables. Model parameter values are obtained from partitioned antecedent space, which is the result of solving a series of splitting and regression problems.Findings and Conclusions: The proposed rule antecedent structure is able to substantially reduce the complexity in a TSK type of dynamic model. The proposed dynamic order determination and nonlinear component detection methods are tested to be able to identify model structures and shown to be less sensitive to noise than other methods. Instead of directly estimating model parameters, the proposed approach solves a series of splitting and regression problems to partition the antecedent space as well as compute the antecedent and consequent parameters. The resultant antecedent partition is meaningful. The boundaries divide an antecedent space into regions, within which a linear relation is valid. The resultant GTSK model is tested on several nonlinear dynamic processes and shown to be more interpretable and informative than other modeling methods without loss of accuracy

    Investigating Abstract Algebra Students' Representational Fluency and Example-Based Intuitions

    Get PDF
    The quotient group concept is a difficult for many students getting started in abstract algebra (Dubinsky et al., 1994; Melhuish, Lew, Hicks, and Kandasamy, 2020). The first study in this thesis explores an undergraduate, a first-year graduate, and second-year graduate students' representational fluency as they work on a "collapsing structure", quotient, task across multiple registers: Cayley tables, group presentations, Cayley digraphs to Schreier coset digraphs, and formal-symbolic mappings. The second study characterizes the (partial) make-up of two graduate learners' example-based intuitions related to orbit-stabilizer relationships induced by group actions. The (partial) make-up of a learner's intuition as a quantifiable object was defined in this thesis as a point viewed in R17, 12 variable values collected with a new prototype instrument, The Non-Creative versus Creative Forms of Intuition Survey (NCCFIS), 2 values for confidence in truth value, and 3 additional variables: error to non-error type, unique versus common, and network thinking. The revised Fuzzy C-Means Clustering Algorithm (FCM) by Bezdek et al. (1981) was used to classify the (partial) make-up of learners' reported intuitions into fuzzy sets based on attribute similarity

    CONTRIBUTIONS IN CLASSIFICATION: VISUAL PRUNING FOR DECISION TREES, P-SPLINE BASED CLUSTERING OF CORRELATED SERIES, BOOSTED-ORIENTED PROBABILISTIC CLUSTERING OF SERIES.

    Get PDF
    This work consists of three papers written during my Ph.D. period. The thesis consists of five chapters. In chapter 2 the basic building blocks of our works are introduced. In particular we briefly recall the concepts of classification (supervised and unsupervised) and penalized spline. In chapter 3 we present a paper whose idea was presented at Cladag 2013 Symposium. Within the framework of recursive partitioning algorithms by tree-based methods, this paper provides a contribution on both the visual representation of the data partition in a geometrical space and the selection of the decision tree. In our visual approach the identification of both the best tree and of weakest links is immediately evaluable by the graphical analysis of the tree structure without considering the pruning sequence. The results in terms of error rate are really similar to the ones returned by the Classification And Regression Trees procedure, showing how this new way to select the best tree is a valid alternative to the well known cost-complexity pruning In chapter 4 we present a paper on parsimonious clustering of correlated series. Clustering of time series has become an important topic, motivated by the increased interest in these type of data. Most of the time, these procedures do not facilitate the removal of noise from data, have difficulties handling time series with unequal length and require a preprocessing step of the data considered, i.e. by modeling each series with an appropriate model for time series. In this work we propose a new clustering data (time) series way, which can be considered as belonging to both model-based and feature-based approach. Our method consists of since we model each series by penalized spline (P-spline) smoothers and performing clustering directly on spline coefficients. Using the P-spline smoothers the signal of series is separated from the noise, capturing the different shapes of series. The P-spline coefficients are close to the fitted curve and present the skeleton of the fit. Thus, summarizing each series by coefficients reduces the dimensionality of the problem, improving significantly computation time without reduction in performance of clustering procedure. To select the smoothing parameter we adopt a V-curve procedure. This criterion does not require the computation of the effective model dimension and it is insensitive to serial correlation in the noise around the trend. Using the P-spline smoothers, moments of the original data are conserved. This implies that mean and variance of the estimated series are equal to those of the raw series. This consideration allows to use a similar approach in dealing with series of different length. The performance is evaluated analyzing a simulated data set,also considering series with different length. An application of our proposal on financial time series is also performed. In Chapter 5 we present a paper that proposes a fuzzy clustering algorithm that is independent from the choice of the fuzzifier. It comes from two approaches, theoretically motivated for respectively unsupervised and supervised classification cases. The first is the Probabilistic Distance (PD) clustering procedure. The second is the well known Boosting philosophy. From the PD approach we took the idea of determining the probabilities of each series to any of the k clusters. As this probability is unequivocally related to the distance of each series from the cluster centers, there are no degrees of freedom in determine the membership matrix. From the Boosting approach we took the idea of weighting each series according some measure of badness of fit in order to define an unsupervised learning process based on a weighted re-sampling procedure. Our idea is to adapt the boosting philosophy to unsupervised learning problems, specially to non hierarchical cluster analysis. In such a case there not exists a target variable, but as the goal is to assign each instance (i.e. a series) of a data set to a cluster, we have a target instance. The representative instance of a given cluster (i.e. the center of a cluster) can be assumed as a target instance, a loss function to be minimized can be assumed as a synthetic index of the global performance, the probability of each series to belong to a given cluster can be assumed as the individual contribution of a given instance to the overall solution. In contrast to the boosting approach, the higher is the probability of a given series to be member of a given cluster, the higher is the weight of that instance in the re-sampling process. As a learner we use a P-spline smoother. To define the probabilities of each series to belong to a given cluster we use the PD clustering approach. This approach allows us to define a suitable loss function and, at the same time, to propose a fuzzy clustering procedure that does not depend on the definition of a fuzzifier parameter. The global performance of the proposed method is investigated by three experiments (one of them on simulated data and the remaining two on data sets known in literature) evaluated by using a fuzzy variant of the Rand Index. Chapter 6 concludes the thesis

    Data Clustering and Partial Supervision with Some Parallel Developments

    Get PDF
    Data Clustering and Partial Supell'ision with SOllie Parallel Developments by Sameh A. Salem Clustering is an important and irreplaceable step towards the search for structures in the data. Many different clustering algorithms have been proposed. Yet, the sources of variability in most clustering algorithms affect the reliability of their results. Moreover, the majority tend to be based on the knowledge of the number of clusters as one of the input parameters. Unfortunately, there are many scenarios, where this knowledge may not be available. In addition, clustering algorithms are very computationally intensive which leads to a major challenging problem in scaling up to large datasets. This thesis gives possible solutions for such problems. First, new measures - called clustering performance measures (CPMs) - for assessing the reliability of a clustering algorithm are introduced. These CPMs can be used to evaluate: I) clustering algorithms that have a structure bias to certain type of data distribution as well as those that have no such biases, 2) clustering algorithms that have initialisation dependency as well as the clustering algorithms that have a unique solution for a given set of parameter values with no initialisation dependency. Then, a novel clustering algorithm, which is a RAdius based Clustering ALgorithm (RACAL), is proposed. RACAL uses a distance based principle to map the distributions of the data assuming that clusters are determined by a distance parameter, without having to specify the number of clusters. Furthermore, RACAL is enhanced by a validity index to choose the best clustering result, i.e. result has compact clusters with wide cluster separations, for a given input parameter. Comparisons with other clustering algorithms indicate the applicability and reliability of the proposed clustering algorithm. Additionally, an adaptive partial supervision strategy is proposed for using in conjunction with RACAL_to make it act as a classifier. Results from RACAL with partial supervision, RACAL-PS, indicate its robustness in classification. Additionally, a parallel version of RACAL (P-RACAL) is proposed. The parallel evaluations of P-RACAL indicate that P-RACAL is scalable in terms of speedup and scaleup, which gives the ability to handle large datasets of high dimensions in a reasonable time. Next, a novel clustering algorithm, which achieves clustering without any control of cluster sizes, is introduced. This algorithm, which is called Nearest Neighbour Clustering, Algorithm (NNCA), uses the same concept as the K-Nearest Neighbour (KNN) classifier with the advantage that the algorithm needs no training set and it is completely unsupervised. Additionally, NNCA is augmented with a partial supervision strategy, NNCA-PS, to act as a classifier. Comparisons with other methods indicate the robustness of the proposed method in classification. Additionally, experiments on parallel environment indicate the suitability and scalability of the parallel NNCA, P-NNCA, in handling large datasets. Further investigations on more challenging data are carried out. In this context, microarray data is considered. In such data, the number of clusters is not clearly defined. This points directly towards the clustering algorithms that does not require the knowledge of the number of clusters. Therefore, the efficacy of one of these algorithms is examined. Finally, a novel integrated clustering performance measure (lCPM) is proposed to be used as a guideline for choosing the proper clustering algorithm that has the ability to extract useful biological information in a particular dataset. Supplied by The British Library - 'The world's knowledge' Supplied by The British Library - 'The world's knowledge
    corecore