16 research outputs found

    A Regulatory Impact Analysis (RIA) approach based on evolutionary association patterns

    Get PDF
    The present paper focuses on ex post analysis to asses the impact of an adopted policy by measuring system performance. Since accurate impact assessment requires in-depth knowledge of the structure underlying the system, this contribution proposes a suitable use of multidimensional data analysis (MDA) to investigate the associations characterizing the indicators/attributes of the system. The general aim is to identify homogeneous subsets of objects that are described by subsets of attributes. This approach was planned to study students performance in Italian universities: the focus is on student careers. The example data set is a data mart selected from the University of Macerata data base and refers to the students at the Economics Faculty from 2001 to 2007

    Dynamic Visualization of Changes in Association Patterns

    Get PDF
    The present proposal deals with high-dimensional binary data collected in different occasions in time or space. Studying the associations of data collected at different occasions, a primary aim is to detect changes in the association structure from one occasion to another. A suitable exploratory technique for the analysis of multiple associations in high-dimensional data is the multiple correspondence analysis (MCA; Greenacre, 2007). However, the comparison of MCA factorial displays referring to different occasions is meaningless. A possible solution to link the association structures of different data batches is to start from an MCA display of a reference and incrementally update the solution with further batches (Iodice D'Enza and Greenacre, 2010). This approach, does not take into account the presence of a cluster structure in the set of statistical units. This contribution intend to present an approach that, through the combination of clustering and factorial techniques, aims to visualize the evolution of the association structure of binary attributes over different data batches. The proposal is to introduce a latent categorical variable which is determined and updated at each incoming batch; in other words this variable is determined according to the association structure and represents the 'link' among the solutions. The latent categorical variable is endogenously determined by the procedure; in particular, it refers to the cluster structure characterizing the data set in question. A starting solution is updated incrementally as new data sets are analysed. The factorial display will describe the patterns of change in the multiple associations when shifting the analysis from one occasion to the other. Procedures suitably combining clustering with factorial analysis techniques have been proposed. Vichi and Kiers (2001) propose a combination of principal component analysis (PCA) with k-means clustering method. In the framework of categorical data, another interesting approach combining clustering and multiple correspondence analysis (MCA) is proposed by Hwang et al. (2006). Similarly, yet dealing with binary data, Palumbo and Iodice D'Enza (2010) propose a suitable dimension reduction and clustering. The present proposal is an enhancement of the latter approach to the comparative analysis of multiple batches

    Dynamic data analysis of evolving association patterns

    No full text
    Dealing with large amounts of data or data flows, it can be convenient or necessary to process them in different 'pieces'; if the data in question refer to different occasions or positions in time or space, a comparative analysis of data stratified in batches can be suitable. The present approach combines clustering and factorial techniques to study the association structure of binary attributes over homogeneous subsets of data; moreover, it seeks to update the result as new statistical units are processed in order to monitor and describe the evolutionary patterns of association. © Springer-Verlag Berlin Heidelberg 2013

    Clustering and Dimensionality Reduction to Discover Interesting Patterns in Binary Data

    No full text
    The attention towards binary data coding increased consistently in the last decade due to several reasons. The analysis of binary data characterizes several fields of application, such as market basket analysis, DNA microarray data, image mining, text mining and web-clickstream mining. The paper illustrates two different approaches exploiting a profitable combination of clustering and dimensionality reduction for the identification of non-trivial association structures in binary data. An application in the Association Rules framework supports the theory with the empirical evidence. © 2010 Springer-Verlag Berlin Heidelberg

    A Two-Step Iterative Procedure for Clustering of Binary Sequences

    No full text
    Association Rules (AR) are a well known data mining tool aiming to detect patterns of association in data bases. The major drawback to knowledge extraction through AR mining is the huge number of rules produced when dealing with large amounts of data. Several proposals in the literature tackle this problem with different approaches. In this framework, the general aim of the present proposal is to identify patterns of association in large binary data. We propose an iterative procedure combining clustering and dimensionality reduction techniques: each iteration involves a quantification of the starting binary attributes and an agglomerative algorithm on the obtained quantitative variables. The objective is to find a quantification that emphasizes the presence of groups of co-occurring attributes in data

    A Two-Step Iterative Procedure for Clustering of Binary Sequences

    No full text
    Association Rules (AR) are a well known data mining tool aiming to detect patterns of association in data bases. The major drawback to knowledge extraction through AR mining is the huge number of rules produced when dealing with large amounts of data. Several proposals in the literature tackle this problem with different approaches. In this framework, the general aim of the present proposal is to identify patterns of association in large binary data. We propose an iterative procedure combining clustering and dimensionality reduction techniques: each iteration involves a quantification of the starting binary attributes and an agglomerative algorithm on the obtained quantitative variables. The objective is to find a quantification that emphasizes the presence of groups of co-occurring attributes in data

    Iterative factor clustering of binary data

    No full text
    Binary data represent a very special condition where both measures of distance and co-occurrence can be adopted. Euclidean distance-based non-hierarchical methods, like the k-means algorithm, or one of its versions, can be profitably used. When the number of available attributes increases the global clustering performance usually worsens. In such cases, to enhance group separability it is necessary to remove the irrelevant and redundant noisy information from the data. The present approach belongs to the category of attribute transformation strategy, and combines clustering and factorial techniques to identify attribute associations that characterize one or more homogeneous groups of statistical units. Furthermore, it provides graphical representations that facilitate the interpretation of the results. © 2012 Springer-Verlag

    Chunk-wise regularised PCA-based imputation of missing data

    No full text
    Standard multivariate techniques like Principal Component Analysis (PCA) are based on the eigendecomposition of a matrix and therefore require complete data sets. Recent comparative reviews of PCA algorithms for missing data showed the regularised iterative PCA algorithm (RPCA) to be effective. This paper presents two chunk-wise implementations of RPCA suitable for the imputation of “tall” data sets, that is, data sets with many observations. A “chunk” is a subset of the whole set of available observations. In particular, one implementation is suitable for distributed computation as it imputes each chunk independently. The other implementation, instead, is suitable for incremental computation, where the imputation of each new chunk is based on all the chunks analysed that far. The proposed procedures were compared to batch RPCA considering different data sets and missing data mechanisms. Experimental results showed that the distributed approach had similar performance to batch RPCA for data with entries missing completely at random. The incremental approach showed appreciable performance when the data is missing not completely at random, and the first analysed chunks contain sufficient information on the data structure
    corecore