190 research outputs found

    A genetic algorithm coupled with tree-based pruning for mining closed association rules

    Get PDF
    Due to the voluminous amount of itemsets that are generated, the association rules extracted from these itemsets contain redundancy, and designing an effective approach to address this issue is of paramount importance. Although multiple algorithms were proposed in recent years for mining closed association rules most of them underperform in terms of run time or memory. Another issue that remains challenging is the nature of the dataset. While some of the existing algorithms perform well on dense datasets others perform well on sparse datasets. This paper aims to handle these drawbacks by using a genetic algorithm for mining closed association rules. Recent studies have shown that genetic algorithms perform better than conventional algorithms due to their bitwise operations of crossover and mutation. Bitwise operations are predominantly faster than conventional approaches and bits consume lesser memory thereby improving the overall performance of the algorithm. To address the redundancy in the mined association rules a tree-based pruning algorithm has been designed here. This works on the principle of minimal antecedent and maximal consequent. Experiments have shown that the proposed approach works well on both dense and sparse datasets while surpassing existing techniques with regard to run time and memory

    Improvement of Data-Intensive Applications Running on Cloud Computing Clusters

    Get PDF
    MapReduce, designed by Google, is widely used as the most popular distributed programming model in cloud environments. Hadoop, an open-source implementation of MapReduce, is a data management framework on large cluster of commodity machines to handle data-intensive applications. Many famous enterprises including Facebook, Twitter, and Adobe have been using Hadoop for their data-intensive processing needs. Task stragglers in MapReduce jobs dramatically impede job execution on massive datasets in cloud computing systems. This impedance is due to the uneven distribution of input data and computation load among cluster nodes, heterogeneous data nodes, data skew in reduce phase, resource contention situations, and network configurations. All these reasons may cause delay failure and the violation of job completion time. One of the key issues that can significantly affect the performance of cloud computing is the computation load balancing among cluster nodes. Replica placement in Hadoop distributed file system plays a significant role in data availability and the balanced utilization of clusters. In the current replica placement policy (RPP) of Hadoop distributed file system (HDFS), the replicas of data blocks cannot be evenly distributed across cluster\u27s nodes. The current HDFS must rely on a load balancing utility for balancing the distribution of replicas, which results in extra overhead for time and resources. This dissertation addresses data load balancing problem and presents an innovative replica placement policy for HDFS. It can perfectly balance the data load among cluster\u27s nodes. The heterogeneity of cluster nodes exacerbates the issue of computational load balancing; therefore, another replica placement algorithm has been proposed in this dissertation for heterogeneous cluster environments. The timing of identifying the straggler map task is very important for straggler mitigation in data-intensive cloud computing. To mitigate the straggler map task, Present progress and Feedback based Speculative Execution (PFSE) algorithm has been proposed in this dissertation. PFSE is a new straggler identification scheme to identify the straggler map tasks based on the feedback information received from completed tasks beside the progress of the current running task. Straggler reduce task aggravates the violation of MapReduce job completion time. Straggler reduce task is typically the result of bad data partitioning during the reduce phase. The Hash partitioner employed by Hadoop may cause intermediate data skew, which results in straggler reduce task. In this dissertation a new partitioning scheme, named Balanced Data Clusters Partitioner (BDCP), is proposed to mitigate straggler reduce tasks. BDCP is based on sampling of input data and feedback information about the current processing task. BDCP can assist in straggler mitigation during the reduce phase and minimize the job completion time in MapReduce jobs. The results of extensive experiments corroborate that the algorithms and policies proposed in this dissertation can improve the performance of data-intensive applications running on cloud platforms

    Understanding and Optimizing Flash-based Key-value Systems in Data Centers

    Get PDF
    Flash-based key-value systems are widely deployed in today’s data centers for providing high-speed data processing services. These systems deploy flash-friendly data structures, such as slab and Log Structured Merge(LSM) tree, on flash-based Solid State Drives(SSDs) and provide efficient solutions in caching and storage scenarios. With the rapid evolution of data centers, there appear plenty of challenges and opportunities for future optimizations. In this dissertation, we focus on understanding and optimizing flash-based key-value systems from the perspective of workloads, software, and hardware as data centers evolve. We first propose an on-line compression scheme, called SlimCache, considering the unique characteristics of key-value workloads, to virtually enlarge the cache space, increase the hit ratio, and improve the cache performance. Furthermore, to appropriately configure increasingly complex modern key-value data systems, which can have more than 50 parameters with additional hardware and system settings, we quantitatively study and compare five multi-objective optimization methods for auto-tuning the performance of an LSM-tree based key-value store in terms of throughput, the 99th percentile tail latency, convergence time, real-time system throughput, and the iteration process, etc. Last but not least, we conduct an in-depth, comprehensive measurement work on flash-optimized key-value stores with recently emerging 3D XPoint SSDs. We reveal several unexpected bottlenecks in the current key-value store design and present three exemplary case studies to showcase the efficacy of removing these bottlenecks with simple methods on 3D XPoint SSDs. Our experimental results show that our proposed solutions significantly outperform traditional methods. Our study also contributes to providing system implications for auto-tuning the key-value system on flash-based SSDs and optimizing it on revolutionary 3D XPoint based SSDs

    A novel approach for the hardware implementation of a PPMC statistical data compressor

    Get PDF
    This thesis aims to understand how to design high-performance compression algorithms suitable for hardware implementation and to provide hardware support for an efficient compression algorithm. Lossless data compression techniques have been developed to exploit the available bandwidth of applications in data communications and computer systems by reducing the amount of data they transmit or store. As the amount of data to handle is ever increasing, traditional methods for compressing data become· insufficient. To overcome this problem, more powerful methods have been developed. Among those are the so-called statistical data compression methods that compress data based on their statistics. However, their high complexity and space requirements have prevented their hardware implementation and the full exploitation of their potential benefits. This thesis looks into the feasibility of the hardware implementation of one of these statistical data compression methods by exploring the potential for reorganising and restructuring the method for hardware implementation and investigating ways of achieving efficient and effective designs to achieve an efficient and cost-effective algorithm. [Continues.

    Towards a secure and efficient search over encrypted cloud data

    Get PDF
    Includes bibliographical references.2016 Summer.Cloud computing enables new types of services where the computational and network resources are available online through the Internet. One of the most popular services of cloud computing is data outsourcing. For reasons of cost and convenience, public as well as private organizations can now outsource their large amounts of data to the cloud and enjoy the benefits of remote storage and management. At the same time, confidentiality of remotely stored data on untrusted cloud server is a big concern. In order to reduce these concerns, sensitive data, such as, personal health records, emails, income tax and financial reports, are usually outsourced in encrypted form using well-known cryptographic techniques. Although encrypted data storage protects remote data from unauthorized access, it complicates some basic, yet essential data utilization services such as plaintext keyword search. A simple solution of downloading the data, decrypting and searching locally is clearly inefficient since storing data in the cloud is meaningless unless it can be easily searched and utilized. Thus, cloud services should enable efficient search on encrypted data to provide the benefits of a first-class cloud computing environment. This dissertation is concerned with developing novel searchable encryption techniques that allow the cloud server to perform multi-keyword ranked search as well as substring search incorporating position information. We present results that we have accomplished in this area, including a comprehensive evaluation of existing solutions and searchable encryption schemes for ranked search and substring position search

    Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations

    Get PDF
    The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov

    Unipept: computational exploration of metaproteome data

    Get PDF

    Métodos computacionais para a caracterização de genes e extração de conhecimento genómico

    Get PDF
    Doutoramento conjunto MAPi em Ciências da ComputaçãoMotivation: Medicine and health sciences are changing from the classical symptom-based to a more personalized and genetics-based paradigm, with an invaluable impact in health-care. While advancements in genetics were already contributing significantly to the knowledge of the human organism, the breakthrough achieved by several recent initiatives provided a comprehensive characterization of the human genetic differences, paving the way for a new era of medical diagnosis and personalized medicine. Data generated from these and posterior experiments are now becoming available, but its volume is now well over the humanly feasible to explore. It is then the responsibility of computer scientists to create the means for extracting the information and knowledge contained in that data. Within the available data, genetic structures contain significant amounts of encoded information that has been uncovered in the past decades. Finding, reading and interpreting that information are necessary steps for building computational models of genetic entities, organisms and diseases; a goal that in due course leads to human benefits. Aims: Numerous patterns can be found within the human variome and exome. Exploring these patterns enables the computational analysis and manipulation of digital genomic data, but requires specialized algorithmic approaches. In this work we sought to create and explore efficient methodologies to computationally calculate and combine known biological patterns for various purposes, such as the in silico optimization of genetic structures, analysis of human genes, and prediction of pathogenicity from human genetic variants. Results: We devised several computational strategies to evaluate genes, explore genomes, manipulate sequences, and analyze patients’ variomes. By resorting to combinatorial and optimization techniques we were able to create and combine sequence redesign algorithms to control genetic structures; by combining the access to several web-services and external resources we created tools to explore and analyze available genetic data and patient data; and by using machine learning we developed a workflow for analyzing human mutations and predicting their pathogenicity.Motivação: A medicina e as ciências da saúde estão atualmente num processo de alteração que muda o paradigma clássico baseado em sintomas para um personalizado e baseado na genética. O valor do impacto desta mudança nos cuidados da saúde é inestimável. Não obstante as contribuições dos avanços na genética para o conhecimento do organismo humano até agora, as descobertas realizadas recentemente por algumas iniciativas forneceram uma caracterização detalhada das diferenças genéticas humanas, abrindo o caminho a uma nova era de diagnóstico médico e medicina personalizada. Os dados gerados por estas e outras iniciativas estão disponíveis mas o seu volume está muito para lá do humanamente explorável, e é portanto da responsabilidade dos cientistas informáticos criar os meios para extrair a informação e conhecimento contidos nesses dados. Dentro dos dados disponíveis estão estruturas genéticas que contêm uma quantidade significativa de informação codificada que tem vindo a ser descoberta nas últimas décadas. Encontrar, ler e interpretar essa informação são passos necessários para construir modelos computacionais de entidades genéticas, organismos e doenças; uma meta que, em devido tempo, leva a benefícios humanos. Objetivos: É possível encontrar vários padrões no varioma e exoma humano. Explorar estes padrões permite a análise e manipulação computacional de dados genéticos digitais, mas requer algoritmos especializados. Neste trabalho procurámos criar e explorar metodologias eficientes para o cálculo e combinação de padrões biológicos conhecidos, com a intenção de realizar otimizações in silico de estruturas genéticas, análises de genes humanos, e previsão da patogenicidade a partir de diferenças genéticas humanas. Resultados: Concebemos várias estratégias computacionais para avaliar genes, explorar genomas, manipular sequências, e analisar o varioma de pacientes. Recorrendo a técnicas combinatórias e de otimização criámos e conjugámos algoritmos de redesenho de sequências para controlar estruturas genéticas; através da combinação do acesso a vários web-services e recursos externos criámos ferramentas para explorar e analisar dados genéticos, incluindo dados de pacientes; e através da aprendizagem automática desenvolvemos um procedimento para analisar mutações humanas e prever a sua patogenicidade

    Algorithmic tools for data-oriented law enforcement

    Get PDF
    The increase in capabilities of information technology of the last decade has led to a large increase in the creation of raw data. Data mining, a form of computer guided, statistical data analysis, attempts to draw knowledge from these sources that is usable, human understandable and was previously unknown. One of the potential application domains is that of law enforcement. This thesis describes a number of efforts in this direction and reports on the results reached on the application of its resulting algorithms on actual police data. The usage of specifically tailored data mining algorithms is shown to have a great potential in this area, which forebodes a future where algorithmic assistance in "combating" crime will be a valuable asset.NWOUBL - phd migration 201
    corecore