192 research outputs found

    Topology-based Clusterwise Regression for User Segmentation and Demand Forecasting

    Full text link
    Topological Data Analysis (TDA) is a recent approach to analyze data sets from the perspective of their topological structure. Its use for time series data has been limited. In this work, a system developed for a leading provider of cloud computing combining both user segmentation and demand forecasting is presented. It consists of a TDA-based clustering method for time series inspired by a popular managerial framework for customer segmentation and extended to the case of clusterwise regression using matrix factorization methods to forecast demand. Increasing customer loyalty and producing accurate forecasts remain active topics of discussion both for researchers and managers. Using a public and a novel proprietary data set of commercial data, this research shows that the proposed system enables analysts to both cluster their user base and plan demand at a granular level with significantly higher accuracy than a state of the art baseline. This work thus seeks to introduce TDA-based clustering of time series and clusterwise regression with matrix factorization methods as viable tools for the practitioner

    Including Item Characteristics in the Probabilistic Latent Semantic Analysis Model for Collaborative Filtering

    Get PDF
    We propose a new hybrid recommender system that combines some advantages of collaborative and content-based recommender systems. While it uses ratings data of all users, as do collaborative recommender systems, it is also able to recommend new items and provide an explanation of its recommendations, as do content-based systems. Our approach is based on the idea that there are communities of users that find the same characteristics important to like or dislike a product. This model is an extension of the probabilistic latent semantic model for collaborative filtering with ideas based on clusterwise linear regression. On a movie data set, we show that the model is competitive to other recommenders and can be used to explain the recommendations to the users.algorithms;probabilistic latent semantic analysis;hybrid recommender systems;recommender systems

    Novel clustering methods for complex cluster structures in behavioral sciences

    Get PDF
    Large-scale data sets with a large number of variables become increasingly available in behavioral research. Encompassing a wide range of measurements and indicators, they provide behavioral scientists with unprecedented opportunities to synthesize different pieces of information so that novel - and sometimes subtle – subgroups (also called clusters) of populations can be identified. The successful detection of clusters is of great practical significance for a wide range of social and behavioral research topics. For example, in treating depressed patients, the first step in generating personalized recommendations is to accurately link the patients to the many subtypes of depression. In the organization context, it is highly problematic to assume that all leaders should follow the same developmental paths; in fact, tailoring training programs to the unique strengths of different leadership subgroups (e.g., the down-to-earth leaders and the excessively charismatic leaders) is always more effective than general developmental programs. When trying to understand the cognitive process underlying one’s voting behavior, once again, a one-size-fits-all approach likely produces erroneous descriptions. The broad social context as well as the surrounding environment in which a person grows up likely yields clusters of voters; only those belonging to the same cluster share a similar decision-making process for voting. To provide behavioral researchers with the best tool for accurately recovering the clusters hidden in large, complex data sets, this dissertation developed new statistical models and computational tools and implemented these novel approaches in publicly accessible software. Generally speaking, the novel methods developed here advance previous approaches by addressing the following three major challenges. First, as noise is ubiquitous in psychological measures, a considerable number of variables collected may be completely irrelevant to the hidden clusters. These irrelevant variables have to be completely and automatically filtered out during data analysis. Second, when integrating variables from diverse data sources (for example questionnaires and genetic information, GPS coordinates, social media footprints, etc.), it is desirable to capture both the unique characteristics pertaining to each data source and the shared or connected characteristics across the many data sources. Third, when translating data analytics results into substantive conclusions so as to inform critical decisions (e.g., medical decisions, personnel selection, etc.), effective and accurate communication is vital yet not necessarily easy to achieve. The two most prominent difficulties are communicating the confidence and (un)certainty in the clusters recovered and visualizing the results through very accessible graphs. With a variety of computer-simulated data and empirical behavioral data covering topics in clinical, social, personality, and organizational psychology, we were able to conclude that the various methods developed in the dissertation are more versatile, effective, and accurate in identifying subtle clusters in complex data sets, provide rich and unique insights in interpreting these clusters, and, thanks to the development of many software, can be readily accessed without many technical barriers. These methods are therefore useful for behavioral researchers to navigate in an increasingly digitized world and to recognize structures from massive information

    50 Years of Data Analysis: From Exploratory Data Analysis to Predictive Modeling and Machine Learning

    Get PDF
    International audienc

    Including Item Characteristics in the Probabilistic Latent Semantic Analysis Model for Collaborative Filtering

    Get PDF
    We propose a new hybrid recommender system that combines some advantages of collaborative and content-based recommender systems. While it uses ratings data of all users, as do collaborative recommender systems, it is also able to recommend new items and provide an explanation of its recommendations, as do content-based systems. Our approach is based on the idea that there are communities of users that find the same characteristics important to like or dislike a product. This model is an extension of the probabilistic latent semantic model for collaborative filtering with ideas based on clusterwise linear regression. On a movie data set, we show that the model is competitive to other recommenders and can be used to explain the recommendations to the users

    A unified framework for bivariate clustering and regression problems via mixed-integer linear programming

    Get PDF
    Clustering and regression are two of the most important problems in data analysis and machine learning. Recently, mixed-integer linear programs (MILPs) have been presented in the literature to solve these problems. By modelling the problems as MILPs, they are able to be solved very quickly by commercial solvers. In particular, MILPs for bivariate clusterwise linear regression (CLR) and (continuous) piecewise linear regression (PWLR) have recently appeared. These MILP models make use of binary variables and logical implications modelled through big-M\mathcal{M} constraints. In this paper, we present these models in the context of a unifying MILP framework for bivariate clustering and regression problems. We then present two new formulations within this framework, the first for ordered CLR, and the second for clusterwise piecewise linear regression (CPWLR). The CPWLR problem concerns simultaneously clustering discrete data, while modelling each cluster with a continuous PWL function. Extending upon the framework, we discuss how outlier detection can be implemented within the models, and how specific decomposition methods can be used to find speedups in the runtime. Experimental results show when each model is the most effective
    • 

    corecore