Search CORE

14 research outputs found

A Fast Minimal Infrequent Itemset Mining Algorithm

Author: Demchuk Kostyantyn
Leith Douglas J.
Publication venue
Publication date: 01/01/2014
Field of study

A novel fast algorithm for finding quasi identifiers in large datasets is presented. Performance measurements on a broad range of datasets demonstrate substantial reductions in run-time relative to the state of the art and the scalability of the algorithm to realistically-sized datasets up to several million records

arXiv.org e-Print Archive

MURAL - Maynooth University Research Archive Library

NUI Maynooth Eprint Archive

Maynooth University ePrints and eTheses Archive

DISCOVERING RARE DATA CORRELATIONS FOR DIGGING BULK ITEMS

Author: Mallikarjuna Kotthari
Venkatadri N
Publication venue: International Journal of Innovative Technology and Research
Publication date: 23/10/2016
Field of study

Within the contexts of information center resource management and application profiling, transactions may represent CPU usage readings collected in a fixed sampling rate. Frequent weighted item sets represent correlations frequently holding in data by which products may weight differently. However, in certain contexts, e.g., once they require is to reduce a particular cost function, finding rare data correlations is much more interesting than mining frequent ones. In addition, two algorithms that perform IWI and Minimal IWI mining efficiently, driven through the suggested measures, are presented. Experimental results show effectiveness and efficiency from the suggested approach. This paper tackles the problem of finding rare and weighted item sets, i.e., the infrequent weighted item set (IWI) mining problem. Two novel quality measures are suggested they are driving the IWI mining process. Thinking about minimal IWIs enables the expert to concentrate her/his attention around the tiniest CPU sets which contain a minimum of one underutilized/idle CPU and, thus, cuts down on the bias because of the possible inclusion of highly weighted products within the extracted patterns

International Journal of Innovative Technology and Research (IJITR)

Feedback-based integration of the whole process of data anonymization in a graphical interface

Author: Meindl Bernhard
Templ Matthias
Publication venue: 'MDPI AG'
Publication date: 01/01/2019
Field of study

The interactive, web-based point-and-click application presented in this article, allows anonymizing data without any knowledge in a programming language. Anonymization in data mining, but creating safe, anonymized data is by no means a trivial task. Both the methodological issues as well as know-how from subject matter specialists should be taken into account when anonymizing data. Even though specialized software such as sdcMicro exists, it is often difficult for nonexperts in a particular software and without programming skills to actually anonymize datasets without an appropriate app. The presented app is not restricted to apply disclosure limitation techniques but rather facilitates the entire anonymization process. This interface allows uploading data to the system, modifying them and to create an object defining the disclosure scenario. Once such a statistical disclosure control (SDC) problem has been defined, users can apply anonymization techniques to this object and get instant feedback on the impact on risk and data utility after SDC methods have been applied. Additional features, such as an Undo Button, the possibility to export the anonymized dataset or the required code for reproducibility reasons, as well its interactive features, make it convenient both for experts and nonexperts in R – the free software environment for statistical computing and graphics – to protect a dataset using this app

ZHAW digitalcollection

Providing Data With High Utility And No Disclosure Risk For The Public and Researchers: An Evaluation By Advanced Statistical Disclosure Risk Methods

Author
Publication venue: 'Austrian Statistical Society'
Publication date
Field of study

Crossref

Infrequent Weighted Itemset Mining Using Frequent Pattern Growth

Author: Cagliero L.
Garza P.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/04/2014
Field of study

Frequent weighted itemsets represent correlations frequently holding in data in which items may weight differently. However, in some contexts, e.g., when the need is to minimize a certain cost function, discovering rare data correlations is more interesting than mining frequent ones. This paper tackles the issue of discovering rare and weighted itemsets, i.e., the infrequent weighted itemset (IWI) mining problem. Two novel quality measures are proposed to drive the IWI mining process. Furthermore, two algorithms that perform IWI and Minimal IWI mining efficiently, driven by the proposed measures, are presented. Experimental results show efficiency and effectiveness of the proposed approach

Crossref

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Privacy of study participants in open-access health and demographic surveillance system data : requirements analysis for data anonymization

Author: Kanjala Chifundo
Siems Inken
Templ Matthias
Publication venue: 'JMIR Publications Inc.'
Publication date: 01/01/2022
Field of study

Background: Data anonymization and sharing have become popular topics for individuals, organizations, and countries worldwide. Open-access sharing of anonymized data containing sensitive information about individuals makes the most sense whenever the utility of the data can be preserved and the risk of disclosure can be kept below acceptable levels. In this case, researchers can use the data without access restrictions and limitations. Objective: This study aimed to highlight the requirements and possible solutions for sharing health surveillance event history data. The challenges lie in the anonymization of multiple event dates and time-varying variables. Methods: A sequential approach that adds noise to event dates is proposed. This approach maintains the event order and preserves the average time between events. In addition, a nosy neighbor distance-based matching approach to estimate the risk is proposed. Regarding the key variables that change over time, such as educational level or occupation, we make 2 proposals: one based on limiting the intermediate statuses of the individual and the other to achieve k-anonymity in subsets of the data. The proposed approaches were applied to the Karonga health and demographic surveillance system (HDSS) core residency data set, which contains longitudinal data from 1995 to the end of 2016 and includes 280,381 events with time-varying socioeconomic variables and demographic information. Results: An anonymized version of the event history data, including longitudinal information on individuals over time, with high data utility, was created. Conclusions: The proposed anonymization of event history data comprising static and time-varying variables applied to HDSS data led to acceptable disclosure risk, preserved utility, and being sharable as public use data. It was found that high utility was achieved, even with the highest level of noise added to the core event dates. The details are important to ensure consistency or credibility. Importantly, the sequential noise addition approach presented in this study does not only maintain the event order recorded in the original data but also maintains the time between events. We proposed an approach that preserves the data utility well but limits the number of response categories for the time-varying variables. Furthermore, using distance-based neighborhood matching, we simulated an attack under a nosy neighbor situation and by using a worst-case scenario where attackers have full information on the original data. We showed that the disclosure risk is very low, even when assuming that the attacker’s database and information are optimal. The HDSS and medical science research communities in low- and middle-income country settings will be the primary beneficiaries of the results and methods presented in this paper; however, the results will be useful for anyone working on anonymizing longitudinal event history data with time-varying variables for the purposes of sharing

PubMed Central

ZHAW digitalcollection

A universal global measure of univariate and bivariate data utility for anonymised microdata

Author: Kocar Sebastian
Publication venue: 'The Australian National University'
Publication date: 01/01/2018
Field of study

A universal global measure of univariate and bivariate data utility for anonymised microdata. This paper presents a new global data utility measure, based on a benchmarking approach. Data utility measures assess the utility of anonymised microdata by measuring changes in distributions and their impact on bias, variance and other statistics derived from the data. Most existing data utility measures have significant shortcomings – that is, they are limited to continuous variables, to univariate utility assessment, or to local information loss measurements. Several solutions are presented in the proposed global data utility model. It combines univariate and bivariate data utility measures, which calculate information loss using various statistical tests and association measures, such as two-sample Kolmogorov–Smirnov test, chi-squared test (Cramer’s V), ANOVA F test (eta squared), Kruskal-Wallis H test (epsilon squared), Spearman coefficient (rho) and Pearson correlation coefficient (r). The model is universal, since it also includes new local utility measures for global recoding and variable removal data reduction approaches, and it can be used for data protected with all common masking methods and techniques, from data reduction and data perturbation to generation of synthetic data and sampling. At the bivariate level, the model includes all required data analysis steps: assumptions for statistical tests, statistical significance of the association, direction of the association and strength of the association (size effect). Since the model should be executed automatically with statistical software code or a package, our aim was to allow all steps to be done with no additional user input. For this reason, we propose approaches to automatically establish the direction of the association between two variables using test-reported standardised residuals and sums of squares between groups. Although the model is a global data utility model, individual local univariate and bivariate utility can still be assessed for different types of variables, as well as for both normal and non-normal distributions. The next important step in global data utility assessment would be to develop either program code or an R statistical software package for measuring data utility, and to establish the relationship between univariate, bivariate and multivariate data utility of anonymised data

The Australian National University

A Fast Minimal Infrequent Itemset Mining Algorithm

Author: Demchuk Kostiantyn
Publication venue
Publication date: 01/05/2014
Field of study

MURAL - Maynooth University Research Archive Library