148 research outputs found

    A possibilistic approach to latent structure analysis for symmetric fuzzy data.

    Get PDF
    In many situations the available amount of data is huge and can be intractable. When the data set is single valued, latent structure models are recognized techniques, which provide a useful compression of the information. This is done by considering a regression model between observed and unobserved (latent) fuzzy variables. In this paper, an extension of latent structure analysis to deal with fuzzy data is proposed. Our extension follows the possibilistic approach, widely used both in the cluster and regression frameworks. In this case, the possibilistic approach involves the formulation of a latent structure analysis for fuzzy data by optimization. Specifically, a non-linear programming problem in which the fuzziness of the model is minimized is introduced. In order to show how our model works, the results of two applications are given.Latent structure analysis, symmetric fuzzy data set, possibilistic approach.

    Possibilistic and fuzzy clustering methods for robust analysis of non-precise data

    Get PDF
    This work focuses on robust clustering of data affected by imprecision. The imprecision is managed in terms of fuzzy sets. The clustering process is based on the fuzzy and possibilistic approaches. In both approaches the observations are assigned to the clusters by means of membership degrees. In fuzzy clustering the membership degrees express the degrees of sharing of the observations to the clusters. In contrast, in possibilistic clustering the membership degrees are degrees of typicality. These two sources of information are complementary because the former helps to discover the best fuzzy partition of the observations while the latter reflects how well the observations are described by the centroids and, therefore, is helpful to identify outliers. First, a fully possibilistic k-means clustering procedure is suggested. Then, in order to exploit the benefits of both the approaches, a joint possibilistic and fuzzy clustering method for fuzzy data is proposed. A selection procedure for choosing the parameters of the new clustering method is introduced. The effectiveness of the proposal is investigated by means of simulated and real-life data

    On Fuzzy Regression Adapting Partial Least Squares

    Get PDF
    Partial Least Squared (PLS) regression is a model linking a dependent variable y to a set of X (numerical or categorical) explanatory variables. It can be obtained as a series of simple and multiple regressions of simple and multiple regressions. PLS is an alternative to classical regression model when there are many variables or the variables are correlated. On the other hand, an alternative method to regression in order to model data has been studied is called Fuzzy Linear Regression (FLR). FLR is one of the modelling techniques based on fuzzy set theory. It is applied to many diversified areas such as engineering, biology, finance and so on. Development of FLR follows mainly two paths. One of which depends on improving the parameter estimation methods. This enables to compute more reliable and more accurate parameter estimation in fuzzy setting. Second of which is related to applying these methods to data, which usually do not follow strict assumptions. The application point of view of FLR has not been examined widely except outlier case. For example, it has not been widely examined how FLR behaves under the multivariate case. To overcome such a problem in classic setting, one of the methods that are practically useful is PLS. In this paper, FLR is examined based on application point of view when it has several explanatory variables by adapting PLS

    Image annotation and retrieval based on multi-modal feature clustering and similarity propagation.

    Get PDF
    The performance of content-based image retrieval systems has proved to be inherently constrained by the used low level features, and cannot give satisfactory results when the user\u27s high level concepts cannot be expressed by low level features. In an attempt to bridge this semantic gap, recent approaches started integrating both low level-visual features and high-level textual keywords. Unfortunately, manual image annotation is a tedious process and may not be possible for large image databases. In this thesis we propose a system for image retrieval that has three mains components. The first component of our system consists of a novel possibilistic clustering and feature weighting algorithm based on robust modeling of the Generalized Dirichlet (GD) finite mixture. Robust estimation of the mixture model parameters is achieved by incorporating two complementary types of membership degrees. The first one is a posterior probability that indicates the degree to which a point fits the estimated distribution. The second membership represents the degree of typicality and is used to indentify and discard noise points. Robustness to noisy and irrelevant features is achieved by transforming the data to make the features independent and follow Beta distribution, and learning optimal relevance weight for each feature subset within each cluster. We extend our algorithm to find the optimal number of clusters in an unsupervised and efficient way by exploiting some properties of the possibilistic membership function. We also outline a semi-supervised version of the proposed algorithm. In the second component of our system consists of a novel approach to unsupervised image annotation. Our approach is based on: (i) the proposed semi-supervised possibilistic clustering; (ii) a greedy selection and joining algorithm (GSJ); (iii) Bayes rule; and (iv) a probabilistic model that is based on possibilistic memebership degrees to annotate an image. The third component of the proposed system consists of an image retrieval framework based on multi-modal similarity propagation. The proposed framework is designed to deal with two data modalities: low-level visual features and high-level textual keywords generated by our proposed image annotation algorithm. The multi-modal similarity propagation system exploits the mutual reinforcement of relational data and results in a nonlinear combination of the different modalities. Specifically, it is used to learn the semantic similarities between images by leveraging the relationships between features from the different modalities. The proposed image annotation and retrieval approaches are implemented and tested with a standard benchmark dataset. We show the effectiveness of our clustering algorithm to handle high dimensional and noisy data. We compare our proposed image annotation approach to three state-of-the-art methods and demonstrate the effectiveness of the proposed image retrieval system

    Metabolic flux understanding of Pichia pastoris grown on heterogenous culture media

    Full text link
    [EN] Within the emergent field of Systems Biology, mathematical models obtained from physical chemical laws (the so-called first principles-based models) of microbial systems are employed to discern the principles that govern cellular behaviour and achieve a predictive understanding of cellular functions. The reliance on this biochemical knowledge has the drawback that some of the assumptions (specific kinetics of the reaction system, unknown dynamics and values of the model parameters) may not be valid for all the metabolic possible states of the network. In this uncertainty context, the combined use of fundamental knowledge and data measured in the fermentation that describe the behaviour of the microorganism in the manufacturing process is paramount to overcome this problem. In this paper, a grey modelling approach is presented combining data-driven and first principles information at different scales, developed for Pichia pastoris cultures grown on different carbon sources. This approach will allow us to relate patterns of recombinant protein production to intracellular metabolic states and correlate intra and extracellular reactions in order to understand how the internal state of the cells determines the observed behaviour in P. pastoris cultivations.Research in this study was partially supported by the Spanish Ministry of Science and Innovation and FEDER funds from the European Union through grants DPI2011-28112-C04-01 and DPI2011-28112-C04-02. The authors are also grateful to Biopolis SL for supporting this research. We also gratefully acknowledge Associate Professor Jose Camacho for providing the Exploratory Data Analysis Toolbox.González Martínez, JM.; Folch-Fortuny, A.; Llaneras Estrada, F.; Tortajada Serra, M.; Picó Marco, JA.; Ferrer, A. (2014). Metabolic flux understanding of Pichia pastoris grown on heterogenous culture media. Chemometrics and Intelligent Laboratory Systems. 134:89-99. https://doi.org/10.1016/j.chemolab.2014.02.003S899913

    Improving Monitoring and Diagnosis for Process Control using Independent Component Analysis

    Get PDF
    Statistical Process Control (SPC) is the general field concerned with monitoring the operation and performance of systems. SPC consists of a collection of techniques for characterizing the operation of a system using a probability distribution consistent with the system\u27s inputs and outputs. Classical SPC monitors a single variable to characterize the operation of a single machine tool or process step using tools such as Shewart charts. The traditional approach works well for simple small to medium size processes. For more complex processes a number of multivariate SPC techniques have been developed in recent decades. These advanced methods suffer from several disadvantages compared to univariate techniques: they tend to be statistically less powerful, and they tend to complicate process diagnosis when a disturbance is detected. This research introduces a general method for simplifying multivariate process monitoring in such a manner as to allow the use of traditional SPC tools while facilitating process diagnosis. Latent variable representations of complex processes are developed which directly relate disturbances with process steps or segments. The method models disturbances in the process rather than the process itself. The basic tool used is Independent Component Analysis (ICA). The methodology is illustrated on the problem of monitoring Electrical Test (E-Test) data from a semiconductor manufacturing process. Development and production data from a working semiconductor plant are used to estimate a factor model that is then used to develop univariate control charts for particular types of process disturbances. Detection and false alarm rates for data with known disturbances are given. The charts correctly detect and classify all the disturbance cases with a very low false alarm rate. A secondary contribution is the introduction of a method for performing an ICA like analysis using possibilistic data instead of probabilistic data. This technique extends the general ICA framework to apply to a broader range of uncertainty types. Further development of this technique could lead to the capability to use extremely sparse data to estimate ICA process models

    Possibilistic classifiers for numerical data

    Get PDF
    International audienceNaive Bayesian Classifiers, which rely on independence hypotheses, together with a normality assumption to estimate densities for numerical data, are known for their simplicity and their effectiveness. However, estimating densities, even under the normality assumption, may be problematic in case of poor data. In such a situation, possibility distributions may provide a more faithful representation of these data. Naive Possibilistic Classifiers (NPC), based on possibility theory, have been recently proposed as a counterpart of Bayesian classifiers to deal with classification tasks. There are only few works that treat possibilistic classification and most of existing NPC deal only with categorical attributes. This work focuses on the estimation of possibility distributions for continuous data. In this paper we investigate two kinds of possibilistic classifiers. The first one is derived from classical or flexible Bayesian classifiers by applying a probability–possibility transformation to Gaussian distributions, which introduces some further tolerance in the description of classes. The second one is based on a direct interpretation of data in possibilistic formats that exploit an idea of proximity between data values in different ways, which provides a less constrained representation of them. We show that possibilistic classifiers have a better capability to detect new instances for which the classification is ambiguous than Bayesian classifiers, where probabilities may be poorly estimated and illusorily precise. Moreover, we propose, in this case, an hybrid possibilistic classification approach based on a nearest-neighbour heuristics to improve the accuracy of the proposed possibilistic classifiers when the available information is insufficient to choose between classes. Possibilistic classifiers are compared with classical or flexible Bayesian classifiers on a collection of benchmarks databases. The experiments reported show the interest of possibilistic classifiers. In particular, flexible possibilistic classifiers perform well for data agreeing with the normality assumption, while proximity-based possibilistic classifiers outperform others in the other cases. The hybrid possibilistic classification exhibits a good ability for improving accuracy

    EGMM: an Evidential Version of the Gaussian Mixture Model for Clustering

    Full text link
    The Gaussian mixture model (GMM) provides a convenient yet principled framework for clustering, with properties suitable for statistical inference. In this paper, we propose a new model-based clustering algorithm, called EGMM (evidential GMM), in the theoretical framework of belief functions to better characterize cluster-membership uncertainty. With a mass function representing the cluster membership of each object, the evidential Gaussian mixture distribution composed of the components over the powerset of the desired clusters is proposed to model the entire dataset. The parameters in EGMM are estimated by a specially designed Expectation-Maximization (EM) algorithm. A validity index allowing automatic determination of the proper number of clusters is also provided. The proposed EGMM is as convenient as the classical GMM, but can generate a more informative evidential partition for the considered dataset. Experiments with synthetic and real datasets demonstrate the good performance of the proposed method as compared with some other prototype-based and model-based clustering techniques

    Robustness and Outliers

    Get PDF
    Producción CientíficaUnexpected deviations from assumed models as well as the presence of certain amounts of outlying data are common in most practical statistical applications. This fact could lead to undesirable solutions when applying non-robust statistical techniques. This is often the case in cluster analysis, too. The search for homogeneous groups with large heterogeneity between them can be spoiled due to the lack of robustness of standard clustering methods. For instance, the presence of (even few) outlying observations may result in heterogeneous clusters artificially joined together or in the detection of spurious clusters merely made up of outlying observations. In this chapter we will analyze the effects of different kinds of outlying data in cluster analysis and explore several alternative methodologies designed to avoid or minimize their undesirable effects.Ministerio de Economía, Industria y Competitividad (MTM2014-56235-C2-1-P)Junta de Castilla y León (programa de apoyo a proyectos de investigación – Ref. VA212U13

    Recent advances in directional statistics

    Get PDF
    Mainstream statistical methodology is generally applicable to data observed in Euclidean space. There are, however, numerous contexts of considerable scientific interest in which the natural supports for the data under consideration are Riemannian manifolds like the unit circle, torus, sphere and their extensions. Typically, such data can be represented using one or more directions, and directional statistics is the branch of statistics that deals with their analysis. In this paper we provide a review of the many recent developments in the field since the publication of Mardia and Jupp (1999), still the most comprehensive text on directional statistics. Many of those developments have been stimulated by interesting applications in fields as diverse as astronomy, medicine, genetics, neurology, aeronautics, acoustics, image analysis, text mining, environmetrics, and machine learning. We begin by considering developments for the exploratory analysis of directional data before progressing to distributional models, general approaches to inference, hypothesis testing, regression, nonparametric curve estimation, methods for dimension reduction, classification and clustering, and the modelling of time series, spatial and spatio-temporal data. An overview of currently available software for analysing directional data is also provided, and potential future developments discussed.Comment: 61 page
    corecore