605 research outputs found

    Privacy Preservation in Analyzing E-Health Records in Big Data Environment

    Get PDF
    Increased use of the Internet and progress in Cloud computing creates a large new datasets with increasing value to business. Data need to be processed by cloud applications are emerging much faster than the computing power. Hadoop-MapReduce has become powerful computation model to address these problems. Nowadays many cloud services require users to share their confidential data like electronic health records for research analysis or data mining, which brings privacy concerns. K-anonymity is one of the widely used privacy model. The scale of data in cloud applications rises extremely in agreement with the Big Data tendency, thereby creating it a dispute for conventional software tools to process such large scale data within an endurable lapsed time. As a consequence, it is a dispute for current anonymization techniques to preserve privacy on confidential extensible data sets due to their inadequacy of scalability. In this project, we propose an extensible two-phase approach to anonymize scalable data sets using dynamic MapReduce framework, Top Down Specialization (TDS) Algorithm and k-Anonymity privacy model. The resources are optimized via three key aspects. First, the under-utilization of map and reduce tasks is improved based on Dynamic Hadoop Slot Allocation (DHSA). Second, the performance tradeoff between the single job and a batch of jobs is balanced using the Speculative Execution Performance Balancing (SEPB). Third, data locality can be improved without any impact on fairness using Slot Pre Scheduling. Experimental evaluation results demonstrate that with this project, the scalability, efficiency and privacy of data sets can be significantly improved over existing approaches. DOI: 10.17762/ijritcc2321-8169.160413

    Building Efficient Large-Scale Big Data Processing Platforms

    Get PDF
    In the era of big data, many cluster platforms and resource management schemes are created to satisfy the increasing demands on processing a large volume of data. A general setting of big data processing jobs consists of multiple stages, and each stage represents generally defined data operation such as ltering and sorting. To parallelize the job execution in a cluster, each stage includes a number of identical tasks that can be concurrently launched at multiple servers. Practical clusters often involve hundreds or thousands of servers processing a large batch of jobs. Resource management, that manages cluster resource allocation and job execution, is extremely critical for the system performance. Generally speaking, there are three main challenges in resource management of the new big data processing systems. First, while there are various pending tasks from dierent jobs and stages, it is difficult to determine which ones deserve the priority to obtain the resources for execution, considering the tasks\u27 different characteristics such as resource demand and execution time. Second, there exists dependency among the tasks that can be concurrently running. For any two consecutive stages of a job, the output data of the former stage is the input data of the later one. The resource management has to comply with such dependency. The third challenge is the inconsistent performance of the cluster nodes. In practice, run-time performance of every server is varying. The resource management needs to dynamically adjust the resource allocation according to the performance change of each server. The resource management in the existing platforms and prior work often rely on fixed user-specific configurations, and assumes consistent performance in each node. The performance, however, is not satisfactory under various workloads. This dissertation aims to explore new approaches to improving the eciency of large-scale big data processing platforms. In particular, the run-time dynamic factors are carefully considered when the system allocates the resources. New algorithms are developed to collect run-time data and predict the characteristics of jobs and the cluster. We further develop resource management schemes that dynamically tune the resource allocation for each stage of every running job in the cluster. New findings and techniques in this dissertation will certainly provide valuable and inspiring insights to other similar problems in the research community

    A Survey on Job and Task Scheduling in Big Data

    Get PDF
    Bigdata handles the datasets which exceeds the ability of commonly used software tools for storing, sharing and processing the data. Classification of workload is a major issue to the Big Data community namely job type evolution and job size evolution. On the basis of job type, job size and disk performance, clusters are been formed with data node, name node and secondary name node. To classify the workload and to perform the job scheduling, mapreduce algorithm is going to be applied. Based on the performance of individual machine, workload has been allocated. Mapreduce has two phases for processing the data: map and reduce phases. In map phase, the input dataset taken is splitted into keyvalue pairs and an intermediate output is obtained and in reduce phase that key value pair undergoes shuffle and sort operation. Intermediate files are created from map tasks are written to local disk and output files are written to distributed file system of Hadoop. Scheduling of different jobs to different disks are identified after completing mapreduce tasks. Johnson algorithm is used to schedule the jobs and used to find out the optimal solution of different jobs. It schedules the jobs into different pools and performs the scheduling. The main task to be carried out is to minimize the computation time for entire jobs and analyze the performance using response time factors in hadoop distributed file system. Based on the dataset size and number of nodes which is formed in hadoop cluster, the performance of individual jobs are identified\ud Keywords — \ud hadoop; mapreduce; johnson algorith
    • …
    corecore