Search CORE

4,163 research outputs found

Improving Efficiency of Incremental Mining by Trie Structure and Pre-Large Itemsets

Author: Hong Tzung-Pei
Hwang Dosam
Le Bac
Le Thien-Phuong
Vo Bay
Publication venue: Institute of Informatics, Slovak Academy of Sciences
Publication date: 04/02/2015
Field of study

Incremental data mining has been discussed widely in recent years, as it has many practical applications, and various incremental mining algorithms have been proposed. Hong et al. proposed an efficient incremental mining algorithm for handling newly inserted transactions by using the concept of pre-large itemsets. The algorithm aimed to reduce the need to rescan the original database and also cut maintenance costs. Recently, Lin et al. proposed the Pre-FUFP algorithm to handle new transactions more efficiently, and make it easier to update the FP-tree. However, frequent itemsets must be mined from the FP-growth algorithm. In this paper, we propose a Pre-FUT algorithm (Fast-Update algorithm using the Trie data structure and the concept of pre-large itemsets), which not only builds and updates the trie structure when new transactions are inserted, but also mines all the frequent itemsets easily from the tree. Experimental results show the good performance of the proposed algorithm

Computing and Informatics (E-Journal - Institute of Informatics, SAS, Bratislava)

Cloud based privacy preserving data mining model using hybrid k-anonymity and partial homomorphic encryption

Author: Mansour Osman Huda Osman
Publication venue
Publication date: 01/01/2022
Field of study

The evolution of information and communication technologies have encourage numerous organizations to outsource their business and data to cloud computing to perform data mining and other data processing operations. Despite the great benefits of the cloud, it has a real problem in the security and privacy of data. Many studies explained that attackers often reveal the information from third-party services or third-party clouds. When a data owners outsource their data to the cloud, especially the SaaS cloud model, it is difficult to preserve the confidentiality and integrity of the data. Privacy-Preserving Data Mining (PPDM) aims to accomplish data mining operations while protecting the owner's data from violation. The current models of PPDM have some limitations. That is, they suffer from data disclosure caused by identity and attributes disclosure where some private information is revealed which causes the success of different types of attacks. Besides, existing solutions have poor data utility and high computational performance overhead. Therefore, this research aims to design and develop Hybrid Anonymization Cryptography PPDM (HAC-PPDM) model to improve the privacy-preserving level by reducing data disclosure before outsourcing data for mining over the cloud while maintaining data utility. The proposed HAC-PPDM model is further aimed reducing the computational performance overhead to improve efficiency. The Quasi-Identifiers Recognition algorithm (QIR) is defined and designed depending on attributes classification and Quasi-Identifiers dimension determine to overcome the identity disclosure caused by Quasi-Identifiers linking to reduce privacy leakage. An Enhanced Homomorphic Scheme is designed based on hybridizing Cloud-RSA encryption scheme, Extended Euclidean algorithm (EE), Fast Modular Exponentiation algorithm (FME), and Chinese Remainder Theorem (CRT) to minimize the computational time complexity while reducing the attribute disclosure. The proposed QIR, Enhanced Homomorphic Scheme and k-anonymity privacy model have been hybridized to obtain optimal data privacy-preservation before outsourced it on the cloud while maintaining the utility of data that meets the needs of mining with good efficiency. Real-world datasets have been used to evaluate the proposed algorithms and model. The experimental results show that the proposed QIR algorithm improved the data privacy-preserving percentage by 23% while maintaining the same or slightly better data utility. Meanwhile, the proposed Enhanced Homomorphic Scheme is more efficient comparing to the related works in terms of time complexity as represented by Big O notation. Moreover, it reduced the computational time of the encryption, decryption, and key generation time. Finally, the proposed HAC-PPDM model successfully reduced the data disclosures and improved the privacy-preserving level while preserved the data utility as it reduced the information loss. In short, it achieved improvement of privacy preserving and data mining (classification) accuracy by 7.59 % and 0.11 % respectively

Universiti Teknologi Malaysia Institutional Repository

Challenges and opportunities beyond structured data in analysis of electronic health records

Author: Budrionis Andrius
Chomutare Taridzo
Dalianis Hercules
Godtliebsen Fred
Ngo Phuong
Salvi Elisa
Tayefi Maryam
Publication venue: 'Wiley'
Publication date: 01/01/2021
Field of study

Electronic health records (EHR) contain a lot of valuable information about individual patients and the whole population. Besides structured data, unstructured data in EHRs can provide extra, valuable information but the analytics processes are complex, time-consuming, and often require excessive manual effort. Among unstructured data, clinical text and images are the two most popular and important sources of information. Advanced statistical algorithms in natural language processing, machine learning, deep learning, and radiomics have increasingly been used for analyzing clinical text and images. Although there exist many challenges that have not been fully addressed, which can hinder the use of unstructured data, there are clear opportunities for well-designed diagnosis and decision support tools that efficiently incorporate both structured and unstructured data for extracting useful information and provide better outcomes. However, access to clinical data is still very restricted due to data sensitivity and ethical issues. Data quality is also an important challenge in which methods for improving data completeness, conformity and plausibility are needed. Further, generalizing and explaining the result of machine learning models are important problems for healthcare, and these are open challenges. A possible solution to improve data quality and accessibility of unstructured data is developing machine learning methods that can generate clinically relevant synthetic data, and accelerating further research on privacy preserving techniques such as deidentification and pseudonymization of clinical text

Munin - Open Research Archive

NORA - Norwegian Open Research Archives

LACE: Supporting Privacy-Preserving Data Sharing in Transfer Defect Learning

Author: Peters Fayola
Publication venue: The Research Repository @ WVU
Publication date: 01/01/2014
Field of study

Cross Project Defect Prediction (CPDP) is a field of study where an organization lacking enough local data can use data from other organizations or projects for building defect predictors. Research in CPDP has shown challenges in using ``other\u27\u27 data, therefore transfer defect learning has emerged to improve on the quality of CPDP results. With this new found success in CPDP, it is now increasingly important to focus on the privacy concerns of data owners.;To support CPDP, data must be shared. There are many privacy threats that inhibit data sharing. We focus on sensitive attribute disclosure threats or attacks, where an attacker seeks to associate a record(s) in a data set to its sensitive information. Solutions to this sharing problem comes from the field of Privacy Preserving Data Publishing (PPDP) which has emerged as a means to confuse the efforts of sensitive attribute disclosure attacks and therefore reduce privacy concerns. PPDP covers methods and tools used to disguise raw data for publishing. However, prior work warned that increasing data privacy decreases the efficacy of data mining on privatized data.;The goal of this research is to help encourage organizations and individuals to share their data publicly and/or with each other for research purposes and/or improving the quality of their software product through defect prediction. The contributions of this work allow three benefits for data owners willing to share privatized data: 1) that they are fully aware of the sensitive attribute disclosure risks involved so they can make an informed decision about what to share, 2) they are provided with the ability to privatize their data and have it remain useful, and 3) the ability to work with others to share their data based on what they learn from each others data. We call this private multiparty data sharing.;To achieve these benefits, this dissertation presents LACE (Large-scale Assurance of Confidentiality Environment). LACE incorporates a privacy metric called IPR (Increased Privacy Ratio) which calculates the risk of sensitive attribute disclosure of data through comparing results of queries (attacks) on the original data and a privatized version of that data. LACE also includes a privacy algorithm which uses intelligent instance selection to prune the data to as low as 10% of the original data (thus offering complete privacy to the other 90%). It then mutates the remaining data making it possible that over 70% of sensitive attribute disclosure attacks are unsuccessful. Finally, LACE can facilitate private multiparty data sharing via a unique leader-follower algorithm (developed for this dissertation). The algorithm allows data owners to serially build a privatized data set, by allowing them to only contribute data that are not already in the private cache. In this scenario, each data owner shares even less of their data, some as low as 2%.;The experiments of this thesis, lead to the following conclusion: at least for the defect data studied here, data can be minimized, privatized and shared without a significant degradation in utility. Specifically, in comparative studies with standard privacy models (k-anonymity and data swapping), applied to 10 open-source data sets and 3 proprietary data sets, LACE produces privatized data sets that are significantly smaller than the original data (as low as 2%). As a result LACE offers better protection against sensitive attribute disclosure attacks than other methods

The Research Repository @ WVU (West Virginia University)

Recommended from our members

Tackling food marketing to children in a digital world: trans-disciplinary perspectives. Children’s rights, evidence of impact, methodological challenges, regulatory options and policy implications for the WHO European Region

Author: Boyland Emma
Breda Joao
Handsley Elizabeth
Jewell Jo
Tatlow-Golden Mimi
World Health Organization Regional Office for Europe
Zalnieriute Monika
Publication venue: 'World Health Organization, Western Pacific Regional Office'
Publication date: 01/11/2016
Field of study

There is unequivocal evidence that childhood obesity is influenced by marketing of foods and non-alcoholic beverages high in saturated fat, salt and/or free sugars (HFSS), and a core recommendation of the WHO Commission on Ending Childhood Obesity is to reduce children’s exposure to all such marketing. As a result, WHO has called on Member States to introduce restrictions on marketing of HFSS foods to children, covering all media, including digital, and to close any regulatory loopholes. This publication provides up-to-date information on the marketing of foods and non-alcoholic beverages to children and the changes that have occurred in recent years, focusing in particular on the major shift to digital marketing. It examines trends in media use among children, marketing methods in the new digital media landscape and children’s engagement with such marketing. It also considers the impact on children and their ability to counter marketing as well as the implications for children’s rights and digital privacy. Finally the report discusses the policy implications and some of the recent policy action by WHO European Member States

Open Research Online (The Open University)

Privacy by Design in Data Mining

Author: MONREALE ANNA
Publication venue: 'Pisa University Press'
Publication date: 20/06/2011
Field of study

Privacy is ever-growing concern in our society: the lack of reliable privacy safeguards in many current services and devices is the basis of a diffusion that is often more limited than expected. Moreover, people feel reluctant to provide true personal data, unless it is absolutely necessary. Thus, privacy is becoming a fundamental aspect to take into account when one wants to use, publish and analyze data involving sensitive information. Many recent research works have focused on the study of privacy protection: some of these studies aim at individual privacy, i.e., the protection of sensitive individual data, while others aim at corporate privacy, i.e., the protection of strategic information at organization level. Unfortunately, it is in- creasingly hard to transform the data in a way that it protects sensitive information: we live in the era of big data characterized by unprecedented opportunities to sense, store and analyze complex data which describes human activities in great detail and resolution. As a result anonymization simply cannot be accomplished by de-identification. In the last few years, several techniques for creating anonymous or obfuscated versions of data sets have been proposed, which essentially aim to find an acceptable trade-off between data privacy on the one hand and data utility on the other. So far, the common result obtained is that no general method exists which is capable of both dealing with “generic personal data” and preserving “generic analytical results”. In this thesis we propose the design of technological frameworks to counter the threats of undesirable, unlawful effects of privacy violation, without obstructing the knowledge discovery opportunities of data mining technologies. Our main idea is to inscribe privacy protection into the knowledge discovery technol- ogy by design, so that the analysis incorporates the relevant privacy requirements from the start. Therefore, we propose the privacy-by-design paradigm that sheds a new light on the study of privacy protection: once specific assumptions are made about the sensitive data and the target mining queries that are to be answered with the data, it is conceivable to design a framework to: a) transform the source data into an anonymous version with a quantifiable privacy guarantee, and b) guarantee that the target mining queries can be answered correctly using the transformed data instead of the original ones. This thesis investigates on two new research issues which arise in modern Data Mining and Data Privacy: individual privacy protection in data publishing while preserving specific data mining analysis, and corporate privacy protection in data mining outsourcing

Electronic Thesis and Dissertation Archive - Università di Pisa

On security and privacy of consensus-based protocols in blockchain and smart grid

Author: Mandal Avikarsha
Publication venue
Publication date: 01/01/2020
Field of study

In recent times, distributed consensus protocols have received widespread attention in the area of blockchain and smart grid. Consensus algorithms aim to solve an agreement problem among a set of nodes in a distributed environment. Participants in a blockchain use consensus algorithms to agree on data blocks containing an ordered set of transactions. Similarly, agents in the smart grid employ consensus to agree on specific values (e.g., energy output, market-clearing price, control parameters) in distributed energy management protocols. This thesis focuses on the security and privacy aspects of a few popular consensus-based protocols in blockchain and smart grid. In the blockchain area, we analyze the consensus protocol of one of the most popular payment systems: Ripple. We show how the parameters chosen by the Ripple designers do not prevent the occurrence of forks in the system. Furthermore, we provide the conditions to prevent any fork in the Ripple network. In the smart grid area, we discuss the privacy issues in the Economic Dispatch (ED) optimization problem and some of its recent solutions using distributed consensus-based approaches. We analyze two state of the art consensus-based ED protocols from Yang et al. (2013) and Binetti et al. (2014). We show how these protocols leak private information about the participants. We propose privacy-preserving versions of these consensus-based ED protocols. In some cases, we also improve upon the communication cost

MAnnheim DOCument Server

Beyond the “Nature” of Data: Obstacles to Protecting Sensitive Information in the European Union and the United States

Author: Müge Fazlioglu
Publication venue: FLASH: The Fordham Law Archive of Scholarship and History
Publication date: 31/12/2002
Field of study

Fordham University School of Law

Erciyes University - AVESIS

Beyond the “Nature” of Data: Obstacles to Protecting Sensitive Information in the European Union and the United States

Author: Müge Fazlioglu
Publication venue: FLASH: The Fordham Law Archive of Scholarship and History
Publication date: 01/01/2019
Field of study

bepress Legal Repository

Fordham University School of Law