211 research outputs found

    Improved Technique for Preserving Privacy while Mining Real Time Big Data

    Get PDF
    With the evolution of Big data, data owners require the assistance of a third party (e.g.,cloud) to store, analyse the data and obtain information at a lower cost. However, maintaining privacy is a challenge in such scenarios. It may reveal sensitive information. The existing research discusses different techniques to implement privacy in original data using anonymization, randomization, and suppression techniques. But those techniques are not scalable, suffers from information loss, does not support real time data and hence not suitable for privacy preserving big data mining. In this research, a novel approach of two level privacy is proposed using pseudonymization and homomorphic encryption in spark framework. Several simulations are carried out on the collected dataset. Through the results obtained, we observed that execution time is reduced by 50%, privacy is enhanced by 10%. This scheme is suitable for both privacy preserving Big Data publishing and mining

    The Danish National Energy Data Lake:Requirements, Technical Architecture, and Tool Selection

    Get PDF

    An Integrated Big and Fast Data Analytics Platform for Smart Urban Transportation Management

    Full text link
    (c) 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other users, including reprinting/ republishing this material for advertising or promotional purposes, creating new collective works for resale or redistribution to servers or lists, or reuse of any copyrighted components of this work in other works.[EN] Smart urban transportation management can be considered as a multifaceted big data challenge. It strongly relies on the information collected into multiple, widespread, and heterogeneous data sources as well as on the ability to extract actionable insights from them. Besides data, full stack (from platform to services and applications) Information and Communications Technology (ICT) solutions need to be specifically adopted to address smart cities challenges. Smart urban transportation management is one of the key use cases addressed in the context of the EUBra-BIGSEA (Europe-Brazil Collaboration of Big Data Scientific Research through Cloud-Centric Applications) project. This paper specifically focuses on the City Administration Dashboard, a public transport analytics application that has been developed on top of the EUBra-BIGSEA platform and used by the Municipality stakeholders of Curitiba, Brazil, to tackle urban traffic data analysis and planning challenges. The solution proposed in this paper joins together a scalable big and fast data analytics platform, a flexible and dynamic cloud infrastructure, data quality and entity matching algorithms as well as security and privacy techniques. By exploiting an interoperable programming framework based on Python Application Programming Interface (API), it allows an easy, rapid and transparent development of smart cities applications.This work was supported by the European Commission through the Cooperation Programme under EUBra-BIGSEA Horizon 2020 Grant [Este projeto e resultante da 3a Chamada Coordenada BR-UE em Tecnologias da Informacao e Comunicacao (TIC), anunciada pelo Ministerio de Ciencia, Tecnologia e Inovacao (MCTI)] under Grant 690116.Fiore, S.; Elia, D.; Pires, CE.; Mestre, DG.; Cappiello, C.; Vitali, M.; Andrade, N.... (2019). An Integrated Big and Fast Data Analytics Platform for Smart Urban Transportation Management. IEEE Access. 7:117652-117677. https://doi.org/10.1109/ACCESS.2019.2936941S117652117677

    Analytics infrastructure for aggregated data

    Get PDF
    Medical data aggregated at a national level can bring a large number of insights to medical researchers and pharmaceutical companies. Research on the aggregated data will allow improving patient care, improving preventive healthcare, and measuring the effectiveness of medications and medical treatment. Medworq has developed a solution that aggregates medical data at a regional level from different healthcare sources. The goal of this project is to design and implement a solution that will aggregate healthcare data at a national level. Due to the fact that medical data is highly sensitive information, the main challenge of the data aggregation is to preserve data privacy of individuals. This report describes a project to analyze the data aggregation process, its challenges and use cases within Medworq, and to design a system that provides the data aggregation functionality. An inventory and comparative analysis of existing data privacy preserving techniques are provided in the report. The main deliverable of this project is a system that provides the data aggregation functionality together with a configurable choice of data privacy preserving techniques. The system was verified based on real patient data. The report also gives recommendations for future improvements

    Analytics infrastructure for aggregated data

    Get PDF
    Medical data aggregated at a national level can bring a large number of insights to medical researchers and pharmaceutical companies. Research on the aggregated data will allow improving patient care, improving preventive healthcare, and measuring the effectiveness of medications and medical treatment. Medworq has developed a solution that aggregates medical data at a regional level from different healthcare sources. The goal of this project is to design and implement a solution that will aggregate healthcare data at a national level. Due to the fact that medical data is highly sensitive information, the main challenge of the data aggregation is to preserve data privacy of individuals. This report describes a project to analyze the data aggregation process, its challenges and use cases within Medworq, and to design a system that provides the data aggregation functionality. An inventory and comparative analysis of existing data privacy preserving techniques are provided in the report. The main deliverable of this project is a system that provides the data aggregation functionality together with a configurable choice of data privacy preserving techniques. The system was verified based on real patient data. The report also gives recommendations for future improvements

    Anonymization of Event Logs for Network Security Monitoring

    Get PDF
    A managed security service provider (MSSP) must collect security event logs from their customers’ network for monitoring and cybersecurity protection. These logs need to be processed by the MSSP before displaying it to the security operation center (SOC) analysts. The employees generate event logs during their working hours at the customers’ site. One challenge is that collected event logs consist of personally identifiable information (PII) data; visible in clear text to the SOC analysts or any user with access to the SIEM platform. We explore how pseudonymization can be applied to security event logs to help protect individuals’ identities from the SOC analysts while preserving data utility when possible. We compare the impact of using different pseudonymization functions on sensitive information or PII. Non-deterministic methods provide higher level of privacy but reduced utility of the data. Our contribution in this thesis is threefold. First, we study available architectures with different threat models, including their strengths and weaknesses. Second, we study pseudonymization functions and their application to PII fields; we benchmark them individually, as well as in our experimental platform. Last, we obtain valuable feedbacks and lessons from SOC analysts based on their experience. Existing works[43, 44, 48, 39] are generally restricting to the anonymization of the IP traces, which is only one part of the SOC analysts’ investigation of PCAP files inspection. In one of the closest work[47], the authors provide useful, practical anonymization methods for the IP addresses, ports, and raw logs

    LETEO: Scalable anonymization of big data and its application to learning analytics

    Get PDF
    ANII Fondo sectorial de investigaciĂłn con datos - 2018Created in 2007, Plan Ceibal is an inclusion and equal opportunities plan with the aim of supporting Uruguayan educational policies with technology. Throughout these years, and within the framework of its tasks, Ceibal has an important amount of data related to the use of technology in education, necessary to manage the plan and fulfill the assigned legal tasks. However, the data does not they can be studied without accounting for the problem of de identifying the users of the Plan. To exploit this data, Ceibal has deployed an instance of the Hortonworks Data Platform (HDP), a open source platform for the storage and parallel processing of massive data (big data). HDP offers a wide range of functional components ranging from large file storage (HDFS) to distributed programming of machine learning algorithms (Apache Spark / MLlib). However, as of today there are no solutions for the de-identification of personal code data open and integrated into the Hortonworks ecosystem. On the one hand, the deidentification tools existing data have not been designed so that they can easily scale to large volumes of data, and they also do not offer easy integration mechanisms with HDFS. This forces you to export the data outside of the platform that stores them to be able to anonymize them, with the consequent risk of exposure of confidential information. On the other hand, the few integrated solutions in the Hortonworks ecosystem are owners and the cost of their licenses is very significant. The objective of this project is to promote the use of the enormous amount of educational and technological data that Ceibal possesses, lifting one of the greatest obstacles that exist for that, namely, the preservation of privacy and the protection of the personal data of the beneficiaries of the Plan. To this end, this project seeks to generate anonymization tools that extend the HDP platform. On In particular, it seeks to develop open source modules to integrate into said platform, which implement a set of programmed anonymization techniques and algorithms in a distributed manner using Apache Spark and that can be applied to data sets stored in HDFS files

    BIG DATA ANALYTICS - AN OVERVIEW

    Get PDF
       Big Data Analytics has been in advance more attention recently since researchers in business and academic world are trying to successfully mine and use all possible knowledge from the vast amount of data generated and obtained. Demanding a paradigm shift in the storage, processing and analysis of Big Data, traditional data analysis methods stumble upon large amounts of data in a short period of time. Because of its importance, the U.S. Many agencies, including the government, have in recent years released large funds for research in Big Data and related fields. This gives a concise summary of investigate growth in various areas related to big data processing and analysis and terminate with a discussion of research guidelines in the similar areas. &nbsp

    Data Profiling in Cloud Migration: Data Quality Measures while Migrating Data from a Data Warehouse to the Google Cloud Platform

    Get PDF
    Internship Report presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsIn today times, corporations have gained a vast interest in data. More and more, companies realized that the key to improving their efficiency and effectiveness and understanding their customers’ needs and preferences better was reachable by mining data. However, as the amount of data grow, so must the companies necessities for storage capacity and ensuring data quality for more accurate insights. As such, new data storage methods must be considered, evolving from old ones, still keeping data integrity. Migrating a company’s data from an old method like a Data Warehouse to a new one, Google Cloud Platform is an elaborate task. Even more so when data quality needs to be assured and sensible data, like Personal Identifiable Information, needs to be anonymized in a Cloud computing environment. To ensure these points, profiling data, before or after it migrated, has a significant value by design a profile for the data available in each data source (e.g., Databases, files, and others) based on statistics, metadata information, and pattern rules. Thus, ensuring data quality is within reasonable standards through statistics metrics, and all Personal Identifiable Information is identified and anonymized accordingly. This work will reflect the required process of how profiling Data Warehouse data can improve data quality to better migrate to the Cloud

    Big Data Processing Attribute Based Access Control Security

    Get PDF
    The purpose of this research is to analyze the security of next-generation big data processing (BDP) and examine the feasibility of applying advanced security features to meet the needs of modern multi-tenant, multi-level data analysis. The research methodology was to survey of the status of security mechanisms in BDP systems and identify areas that require further improvement. Access control (AC) security services were identified as priority area, specifically Attribute Based Access Control (ABAC). The exemplar BDP system analyzed is the Apache Hadoop ecosystem. We created data generation software, analysis programs, and posted the detailed the experiment configuration on GitHub. Overall, our research indicates that before a BDP system, such as Hadoop, can be used in operational environment significant security configurations are required. We believe that the tools are available to achieve a secure system, with ABAC, using Apache Ranger and Apache Atlas. However, these systems are immature and require verification by an independent third party. We identified the following specific actions for overall improvement: consistent provisioning of security services through a data analyst workstation, a common backplane of security services, and a management console. These areas are partially satisfied in the current Hadoop ecosystem, continued AC improvements through the open source community, and rigorous independent testing should further address remaining security challenges. Robust security will enable further use of distributed, cluster BDP, such as Apache Hadoop and Hadoop-like systems, to meet future government and business requirements
    • …
    corecore