47 research outputs found

    An empirical comparison between grade of membership and principal component analysis

    Get PDF
    WOS:000321021700005 (Nº de Acesso Web of Science)It is the purpose of this paper to contribute to the discussion initiated by Wachter about the parallelism between principal component (PC) and a typological grade of membership (GoM) analysis. The author tested empirically the close relationship between both analysis in a low dimensional framework comprising up to nine dichotomous variables and two typologies. Our contribution to the subject is also empirical. It relies on a dataset from a survey which was especially designed to study the reward of skills in the banking sector in Portugal. The statistical data comprise thirty polythomous variables and were decomposed in four typologies using an optimality criterion. The empirical evidence shows a high correlation between the first PC scores and individual GoM scores. No correlation with the remaining PCs was found, however. In addtion to that, the first, PC also proved effective to rank individuals by skill following the particularity of data distribution meanwhile unveiled in GoM analysis

    Index of satisfaction with public transport: a fuzzy clustering approach

    Get PDF
    Increasing public transport use is recognized by many countries as crucial to the pursuit of a global strategy for environmental sustainability and improving urban mobility. Understanding what users value in a public transport service is essential to carry out this strategy. Using fuzzy clustering, we developed an index that measures individual user satisfaction with the public transport service in the metropolitan area of Lisbon and subsequently identified the possible determinants of satisfaction by means of a regression tree model. The results achieved unveil a hierarchical partition of the data, highlighting the diversified level of satisfaction among public transport users that is reflected in the distribution of the index. The managerial implications of the findings for the public transport service are addressed.info:eu-repo/semantics/publishedVersio

    A Modified Fuzzy k-Partition Based on Indiscernibility Relation for Categorical Data Clustering

    Get PDF
    Categorical data clustering has been adopted by many scientific communities to classify objects from large databases. In order to classify the objects, Fuzzy k-Partition approach has been proposed for categorical data clustering. However, existing Fuzzy k-Partition approaches suffer from high computa-tional time and low clustering accuracy. Moreover, the parameter maximize of the classification like-lihood function in Fuzzy k-Partition approach will always have the same categories, hence producing the same results. To overcome these issues, we propose a modified Fuzzy k-Partition based on indiscern-ibility relation. The indiscernibility relation induces an approximation space which is constructed by equivalence classes of indiscernible objects, thus it can be applied to classify categorical data. The novelty of the proposed approach is that unlike previous approach that use the likelihood function of multi-variate multinomial distributions, the proposed approach is based on indescernibility relation. We per-formed an extensive theoretical analysis of the proposed approach to show its effectiveness in achieving lower computational complexity. Further, we compared the proposed approach with Fuzzy Centroid and Fuzzy k-Partition approaches in terms of response time and clustering accuracy on several UCI bench-mark and real world datasets. The results show that the proposed approach achieves lower response time and higher clustering accuracy as compared to other Fuzzy k-based approaches

    Latent class analysis by regularized spectral clustering

    Full text link
    The latent class model is a powerful tool for identifying latent classes within populations that share common characteristics for categorical data in social, psychological, and behavioral sciences. In this article, we propose two new algorithms to estimate a latent class model for categorical data. Our algorithms are developed by using a newly defined regularized Laplacian matrix calculated from the response matrix. We provide theoretical convergence rates of our algorithms by considering a sparsity parameter and show that our algorithms stably yield consistent latent class analysis under mild conditions. Additionally, we propose a metric to capture the strength of latent class analysis and several procedures designed based on this metric to infer how many latent classes one should use for real-world categorical data. The efficiency and accuracy of our algorithms are verified by extensive simulated experiments, and we further apply our algorithms to real-world categorical data with promising results.Comment: 22 pages, 7 figures, 2 table

    Sub-Cluster Dynamics of an Organizational Population Ecological Analysis of Wine making in Tokaj-Hegyalja 1989-2014

    Get PDF
    This thesis addresses two aspects of contrast dependence theory. Firstly it aimed to test whether it applied to similarity clusters. In other words whether it already takes effect in early stages of the legitimation process. Secondly, the theory was tested at the sub-category level with multiple overlapping clusters that dynamically changed in terms of their defining features. The empirical setting was the wine producer population of Tokaj-Hegyalja, a traditional wine region in Hungary, which went through a major transition in terms of winemaking technology, cultivation method and products between 1989 and 2014. This work argues that the groups of wineries that took different paths in terms of these features were perceived as fuzzy sub-clusters within the main population by the audience. Thus, their yearly vital rates were (also) determined by their contrast level, even though these similarity clusters never became legitimate sub-categories. Besides that, introduction of novel methods and innovations were perceived as the expansion of the relevant feature set, thus the clustering system of the audience was dynamic. In terms of methodology the research significantly differed from existing studies. Instead of gathering membership data directly from the audience, similarity sub-clusters were modelled by using the retrospectively collected relevant features of the main population. As the relevant feature set changed during the studied period, this approach allowed the modelling of a dynamic space of fuzzy similarity clusters at the sub-population level. The steps in the analysis where as follows. First the main population was defined as a crisp set of wineries. Second the yearly sets of relevant features were modelled, which was based on past publications of wine experts. Third, the feature vectors of the wineries were coded according to collected feature value data. Fourth, fuzzy cluster analysis was conducted for each year, which determined the number of similarity clusters, their centres, their contrast levels and grade of memberships of organizations. Finally, a statistically significant correlation was found between the entry rates and the contrasts of the sub-clusters. In addition, the analysis showed that initial membership of new entrants correlated with the cluster contrasts

    Relatedness, National Boarders, Perceptions of Firms and the Value of Their innovations

    Get PDF
    The main goal of this dissertation is to better understand how external corporate stakeholder perceptions of relatedness affect important outcomes for companies. In pursuit of this goal, I apply the lens of category studies. Categories not only help audiences to distinguish between members of different categories, they also convey patterns of relatedness. In turn, this may have implications for understanding how audiences search, what they attend to, and how the members are ultimately valued. In the first chapter, I apply incites from social psychology to show how the nationality of audience members affects the way that they cognitively group objects into similar categories. I find that the geographic location of stock market analysts affect the degree to which they will revise their earnings estimates for a given company in the wake of an earnings miss by another firm in the same industry. Foreign analysts revise their earnings estimates downward more so than do local analysts, suggesting that foreign analysts ascribe the earnings miss more broadly and tend to lump companies located in the same country into larger groups than do local analysts. In the second chapter, I demonstrate that the structure of inter-category relationships can have consequential effects for the members of a focal category. Leveraging an experimental-like design, I study the outcomes of nanotechnology patents and the pattern of forward citations across multiple patent jurisdictions. I find that members of technology categories with many close category \u27neighbors\u27 are more broadly cited than members of categories with few category \u27neighbors.’ My findings highlight how category embeddedness and category system structure affect the outcomes of category members as well as the role that classification plays in the valuation of innovation. In the third chapter, I propose a novel and dynamic measure of corporate similarity that is constructed from the two-mode analyst and company coverage network. The approach creates a fine-grained continuous measure of company similarity that can be used as an alternative or supplement to existing static industry classification systems. I demonstrate the value of this new measure in the context of predicting financial market responses to merger and acquisition deals

    Automated Detection of Electric Energy Consumption Load Profile Patterns

    Full text link
    [EN] Load profiles of energy consumption from smart meters are becoming more and more available, and the amount of data to analyse is huge. In order to automate this analysis, the application of state-of-the-art data mining techniques for time series analysis is reviewed. In particular, the use of dynamic clustering techniques to obtain and visualise temporal patterns characterising the users of electrical energy is deeply studied. The performed review can be used as a guide for those interested in the automatic analysis and groups of behaviour detection within load profile databases. Additionally, a selection of dynamic clustering algorithms have been implemented and the performances compared using an available electric energy consumption load profile database. The results allow experts to easily evaluate how users consume energy, to assess trends and to predict future scenarios.The data analysed has been facilitated by the Spanish Distributor Iberdrola Electrical Distribution S.A. as part of the research project GAD (Active Management of the Demand), national project by DEVISE 2010 funded by the INGENIIO 2010 program and the CDTI (Centre for Industrial Technology Development), Business Public Entity dependent of the Ministry of Economy and Competitiveness of the Government of Spain.Benítez, I.; Diez, J. (2022). Automated Detection of Electric Energy Consumption Load Profile Patterns. Energies. 15(6):1-26. https://doi.org/10.3390/en1506217612615

    Spatiotemporal Wireless Sensor Network Field Approximation with Multilayer Perceptron Artificial Neural Network Models

    Get PDF
    As sensors become increasingly compact and dependable in natural environments, spatially-distributed heterogeneous sensor network systems steadily become more pervasive. However, any environmental monitoring system must account for potential data loss due to a variety of natural and technological causes. Modeling a natural spatial region can be problematic due to spatial nonstationarities in environmental variables, and as particular regions may be subject to specific influences at different spatial scales. Relationships between processes within these regions are often ephemeral, so models designed to represent them cannot remain static. Integrating temporal factors into this model engenders further complexity. This dissertation evaluates the use of multilayer perceptron neural network models in the context of sensor networks as a possible solution to many of these problems given their data-driven nature, their representational flexibility and straightforward fitting process. The relative importance of parameters is determined via an adaptive backpropagation training process, which converges to a best-fit model for sensing platforms to validate collected data or approximate missing readings. As conditions evolve over time such that the model can no longer adapt to changes, new models are trained to replace the old. We demonstrate accuracy results for the MLP generally on par with those of spatial kriging, but able to integrate additional physical and temporal parameters, enabling its application to any region with a collection of available data streams. Potential uses of this model might be not only to approximate missing data in the sensor field, but also to flag potentially incorrect, unusual or atypical data returned by the sensor network. Given the potential for spatial heterogeneity in a monitored phenomenon, this dissertation further explores the benefits of partitioning a space and applying individual MLP models to these partitions. A system of neural models using both spatial and temporal parameters can be envisioned such that a spatiotemporal space partitioned by k-means is modeled by k neural models with internal weightings varying individually according to the dominant processes within the assigned region of each. Evaluated on simulated and real data on surface currents of theGulf ofMaine, partitioned models show significant improved results over single global models

    Bayesian inference for tensor factorization models

    Get PDF
    Multivariate categorical data are routinely collected in several applications, including epidemiology, biology, and sociology, among many others. Popular models dealing with these variables include log-linear and tensor factorization models, with these lasts having the advantage of flexibly characterizing the dependence structure underlying the data. Under such framework, this Thesis aims to provide novel approaches to define compact representations of the dependence structures and to introduce new inference possibilities in tensor factorization approaches. We introduce a new class of GROuped Tensor (GROT) factorizations, which have superior performance in terms of data compression if compared to standard Parafac approach, using relatively few components to represent the joint probability mass function of the data. While popular Parafac factorizations rely on mixing together independent components, GROT mixes together grouped factorizations, equivalent to replacing vector arms in Parafac with low-dimensional tensor arms. We consider a Bayesian approach to inference with Dirichlet priors on the mixing weights and arm components, to obtain a combined low-rank and sparse structure, while facilitating efficient posterior computation via Markov chain Monte Carlo. Motivated by an application on malaria risk assessment, we also introduce a novel multivariate generalization of mixed membership models, which allows identification of correlated profiles related to different domains corresponding to separate groups of variables. We consider as a case study the Machadinho settlement project in Brazil, with the aim of defining survey based environmental and behavioral risk profiles and studying their interaction and evolution. To achieve this goal, we show that the use of correlated multiple membership vectors leads to interpretable inference requiring a lower number of profiles compared to standard formulations while inducing a more compact representation of the population level model. We propose a novel multivariate logistic normal distribution for the membership vectors, which allows easy introduction of auxiliary information in the membership profiles leveraging a multivariate latent logistic regression. A Bayesian approach to inference, relying on Pólya gamma data augmentation, facilitates efficient posterior computation via Markov chain Monte Carlo. The proposed approach is shown to outperform the classical mixed membership model in simulations, and the malaria diffusion application

    Phenotyping Risk Profiles of Substance Use and Exploring the Dynamic Transitions in Use Patterns: Machine Learning Models using the COMPASS Data

    Get PDF
    Background Polysubstance use is on the rise among Canadian youth. Examining risk profiles and understanding how the transition occurs in use patterns can inform the design and implementation of polysubstance risk reduction intervention. The COMPASS study is longitudinal research examining health-related behaviours among Canadian secondary school students, capturing data from multiple sources. Machine learning (ML) techniques can reveal non-linearity and multivariate couplings associated with population-level longitudinal data to inform public health policies. Objectives The overarching goal of this thesis is to identify phenotypes of risk profiles of youth polysubstance use and examine the dynamic transitions of use patterns across time, utilizing both unsupervised ML methods and a latent variable modelling approach. This thesis also aims to understand how ML techniques are best used in modelling transitions and discovering the “hidden” patterns from large complex population-based health survey data, using the COMPASS dataset as a showcase. Methods A linked sample (N = 8824) of three annual waves of the COMPASS data collected starting from the school year of 2016-17 was used. Multiple imputations for missing values were performed. Substance use indicators, including cigarette smoking, e-cigarette use, alcohol drinking, and marijuana consumption, were categorized into “never use,” “occasional use,” and “current use.” To examine phenotypes of risk profiles, hierarchical clustering, partitioning around medoids (PAM), and fuzzy clustering algorithms were applied. The Boruta algorithm was used to identify a subset of features for cluster analysis. Both the internal and external indices were employed to evaluate the clustering validity. A multivariate latent Markov model (LMM) was implemented to explore the dynamic transitions of use patterns over time. The least absolute shrinkage and selection operator (LASSO) approach was applied to select the appropriate covariates for entering the LMM. Model selection was based on the Bayesian information criterion (BIC) and the goodness-of-fit test. Results The top factors impacting youth polysubstance use included the number of smoking friends, the number of skipped classes, the weekly money to spend/save oneself, and others. Four risk profiles of polysubstance use were identified across the three waves: low, medium-low, medium-high, and high-risk profiles. The heterogeneity in the prevalence and phenotype across these four risk profiles was confirmed. The internal measures of clustering performance measured by average silhouette width ranged from 0.51 to 0.55 across the three waves using different clustering algorithms. The clustering algorithms achieved a relatively high degree of agreement on cluster membership. Comparing the fuzzy (FANNY) clustering with PAM clustering, the adjusted Rand indices were 0.9698, 0.7676, and 0.6452 for the three waves. Four distinct use patterns were identified: no use (S1), occasional single-use of alcohol (S2), dual-use of e-cigarette and alcohol (S3), and current multi-use (S4). The initial probabilities of each subgroup were 0.5887, 0.2156, 0.1487, and 0.0470. The marginal distribution of S1 decreased, while that of S3 and S4 increased over time, indicating a tendency towards increased substance use as the students grew older. Although, generally, most students remained in the same subgroup across time, particularly the individuals in S4 with the highest transition probability (0.8668). Over time, those who transitioned typically moved towards a more severe use pattern group, e.g., S3 -> S4. Factors that impact the initial membership of use patterns and the dynamic transitions were multifaceted and complex across the four use patterns across the three waves. Not only do use patterns change with time, but so does the evidence in use patterns. Conclusion As the first study of its kind to ascertain risk profiles and dynamics of use patterns in youth polysubstance use, by employing ML approaches to the COMPASS dataset, this thesis provides insights into the opportunities and possibilities ahead for ML in Public Health. Findings from this thesis can be beneficial to practitioners in the field, such as school program managers or policymakers, in their capacity to develop interventions to prevent or remedy polysubstance use among youth
    corecore