14 research outputs found

    A Fast Minimal Infrequent Itemset Mining Algorithm

    Get PDF
    A novel fast algorithm for finding quasi identifiers in large datasets is presented. Performance measurements on a broad range of datasets demonstrate substantial reductions in run-time relative to the state of the art and the scalability of the algorithm to realistically-sized datasets up to several million records

    DISCOVERING RARE DATA CORRELATIONS FOR DIGGING BULK ITEMS

    Get PDF
    Within the contexts of information center resource management and application profiling, transactions may represent CPU usage readings collected in a fixed sampling rate. Frequent weighted item sets represent correlations frequently holding in data by which products may weight differently. However, in certain contexts, e.g., once they require is to reduce a particular cost function, finding rare data correlations is much more interesting than mining frequent ones. In addition, two algorithms that perform IWI and Minimal IWI mining efficiently, driven through the suggested measures, are presented. Experimental results show effectiveness and efficiency from the suggested approach. This paper tackles the problem of finding rare and weighted item sets, i.e., the infrequent weighted item set (IWI) mining problem. Two novel quality measures are suggested they are driving the IWI mining process. Thinking about minimal IWIs enables the expert to concentrate her/his attention around the tiniest CPU sets which contain a minimum of one underutilized/idle CPU and, thus, cuts down on the bias because of the possible inclusion of highly weighted products within the extracted patterns

    Feedback-based integration of the whole process of data anonymization in a graphical interface

    Get PDF
    The interactive, web-based point-and-click application presented in this article, allows anonymizing data without any knowledge in a programming language. Anonymization in data mining, but creating safe, anonymized data is by no means a trivial task. Both the methodological issues as well as know-how from subject matter specialists should be taken into account when anonymizing data. Even though specialized software such as sdcMicro exists, it is often difficult for nonexperts in a particular software and without programming skills to actually anonymize datasets without an appropriate app. The presented app is not restricted to apply disclosure limitation techniques but rather facilitates the entire anonymization process. This interface allows uploading data to the system, modifying them and to create an object defining the disclosure scenario. Once such a statistical disclosure control (SDC) problem has been defined, users can apply anonymization techniques to this object and get instant feedback on the impact on risk and data utility after SDC methods have been applied. Additional features, such as an Undo Button, the possibility to export the anonymized dataset or the required code for reproducibility reasons, as well its interactive features, make it convenient both for experts and nonexperts in R – the free software environment for statistical computing and graphics – to protect a dataset using this app

    Infrequent Weighted Itemset Mining Using Frequent Pattern Growth

    Get PDF
    Frequent weighted itemsets represent correlations frequently holding in data in which items may weight differently. However, in some contexts, e.g., when the need is to minimize a certain cost function, discovering rare data correlations is more interesting than mining frequent ones. This paper tackles the issue of discovering rare and weighted itemsets, i.e., the infrequent weighted itemset (IWI) mining problem. Two novel quality measures are proposed to drive the IWI mining process. Furthermore, two algorithms that perform IWI and Minimal IWI mining efficiently, driven by the proposed measures, are presented. Experimental results show efficiency and effectiveness of the proposed approach

    Privacy of study participants in open-access health and demographic surveillance system data : requirements analysis for data anonymization

    Get PDF
    Background: Data anonymization and sharing have become popular topics for individuals, organizations, and countries worldwide. Open-access sharing of anonymized data containing sensitive information about individuals makes the most sense whenever the utility of the data can be preserved and the risk of disclosure can be kept below acceptable levels. In this case, researchers can use the data without access restrictions and limitations. Objective: This study aimed to highlight the requirements and possible solutions for sharing health surveillance event history data. The challenges lie in the anonymization of multiple event dates and time-varying variables. Methods: A sequential approach that adds noise to event dates is proposed. This approach maintains the event order and preserves the average time between events. In addition, a nosy neighbor distance-based matching approach to estimate the risk is proposed. Regarding the key variables that change over time, such as educational level or occupation, we make 2 proposals: one based on limiting the intermediate statuses of the individual and the other to achieve k-anonymity in subsets of the data. The proposed approaches were applied to the Karonga health and demographic surveillance system (HDSS) core residency data set, which contains longitudinal data from 1995 to the end of 2016 and includes 280,381 events with time-varying socioeconomic variables and demographic information. Results: An anonymized version of the event history data, including longitudinal information on individuals over time, with high data utility, was created. Conclusions: The proposed anonymization of event history data comprising static and time-varying variables applied to HDSS data led to acceptable disclosure risk, preserved utility, and being sharable as public use data. It was found that high utility was achieved, even with the highest level of noise added to the core event dates. The details are important to ensure consistency or credibility. Importantly, the sequential noise addition approach presented in this study does not only maintain the event order recorded in the original data but also maintains the time between events. We proposed an approach that preserves the data utility well but limits the number of response categories for the time-varying variables. Furthermore, using distance-based neighborhood matching, we simulated an attack under a nosy neighbor situation and by using a worst-case scenario where attackers have full information on the original data. We showed that the disclosure risk is very low, even when assuming that the attacker’s database and information are optimal. The HDSS and medical science research communities in low- and middle-income country settings will be the primary beneficiaries of the results and methods presented in this paper; however, the results will be useful for anyone working on anonymizing longitudinal event history data with time-varying variables for the purposes of sharing

    A universal global measure of univariate and bivariate data utility for anonymised microdata

    Get PDF
    A universal global measure of univariate and bivariate data utility for anonymised microdata. This paper presents a new global data utility measure, based on a benchmarking approach. Data utility measures assess the utility of anonymised microdata by measuring changes in distributions and their impact on bias, variance and other statistics derived from the data. Most existing data utility measures have significant shortcomings – that is, they are limited to continuous variables, to univariate utility assessment, or to local information loss measurements. Several solutions are presented in the proposed global data utility model. It combines univariate and bivariate data utility measures, which calculate information loss using various statistical tests and association measures, such as two-sample Kolmogorov–Smirnov test, chi-squared test (Cramer’s V), ANOVA F test (eta squared), Kruskal-Wallis H test (epsilon squared), Spearman coefficient (rho) and Pearson correlation coefficient (r). The model is universal, since it also includes new local utility measures for global recoding and variable removal data reduction approaches, and it can be used for data protected with all common masking methods and techniques, from data reduction and data perturbation to generation of synthetic data and sampling. At the bivariate level, the model includes all required data analysis steps: assumptions for statistical tests, statistical significance of the association, direction of the association and strength of the association (size effect). Since the model should be executed automatically with statistical software code or a package, our aim was to allow all steps to be done with no additional user input. For this reason, we propose approaches to automatically establish the direction of the association between two variables using test-reported standardised residuals and sums of squares between groups. Although the model is a global data utility model, individual local univariate and bivariate utility can still be assessed for different types of variables, as well as for both normal and non-normal distributions. The next important step in global data utility assessment would be to develop either program code or an R statistical software package for measuring data utility, and to establish the relationship between univariate, bivariate and multivariate data utility of anonymised data

    A Fast Minimal Infrequent Itemset Mining Algorithm

    Get PDF
    A novel fast algorithm for finding quasi identifiers in large datasets is presented. Performance measurements on a broad range of datasets demonstrate substantial reductions in run-time relative to the state of the art and the scalability of the algorithm to realistically-sized datasets up to several million records
    corecore