72 research outputs found

    Acceleration of Computational Geometry Algorithms for High Performance Computing Based Geo-Spatial Big Data Analysis

    Get PDF
    Geo-Spatial computing and data analysis is the branch of computer science that deals with real world location-based data. Computational geometry algorithms are algorithms that process geometry/shapes and is one of the pillars of geo-spatial computing. Real world map and location-based data can be huge in size and the data structures used to process them extremely big leading to huge computational costs. Furthermore, Geo-Spatial datasets are growing on all V’s (Volume, Variety, Value, etc.) and are becoming larger and more complex to process in-turn demanding more computational resources. High Performance Computing is a way to breakdown the problem in ways that it can run in parallel on big computers with massive processing power and hence reduce the computing time delivering the same results but much faster.This dissertation explores different techniques to accelerate the processing of computational geometry algorithms and geo-spatial computing like using Many-core Graphics Processing Units (GPU), Multi-core Central Processing Units (CPU), Multi-node setup with Message Passing Interface (MPI), Cache optimizations, Memory and Communication optimizations, load balancing, Algorithmic Modifications, Directive based parallelization with OpenMP or OpenACC and Vectorization with compiler intrinsic (AVX). This dissertation has applied at least one of the mentioned techniques to the following problems. Novel method to parallelize plane sweep based geometric intersection for GPU with directives is presented. Parallelization of plane sweep based Voronoi construction, parallelization of Segment tree construction, Segment tree queries and Segment tree-based operations has been presented. Spatial autocorrelation, computation of getis-ord hotspots are also presented. Acceleration performance and speedup results are presented in each corresponding chapter

    Content sensitivity based access control model for big data

    Get PDF
    Big data technologies have seen tremendous growth in recent years. They are being widely used in both industry and academia. In spite of such exponential growth, these technologies lack adequate measures to protect the data from misuse or abuse. Corporations that collect data from multiple sources are at risk of liabilities due to exposure of sensitive information. In the current implementation of Hadoop, only file level access control is feasible. Providing users, the ability to access data based on attributes in a dataset or based on their role is complicated due to the sheer volume and multiple formats (structured, unstructured and semi-structured) of data. In this dissertation an access control framework, which enforces access control policies dynamically based on the sensitivity of the data is proposed. This framework enforces access control policies by harnessing the data context, usage patterns and information sensitivity. Information sensitivity changes over time with the addition and removal of datasets, which can lead to modifications in the access control decisions and the proposed framework accommodates these changes. The proposed framework is automated to a large extent and requires minimal user intervention. The experimental results show that the proposed framework is capable of enforcing access control policies on non-multimedia datasets with minimal overhea

    Comparing restricted propagation grafhs for the similarity flooding algorithm

    Get PDF
    Orientador : Prof. Dr. Marcos Didonet Del FabroDissertação (mestrado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa: Curitiba, 26/05/2015Inclui referências : f.51-54Resumo: A Engenharia de Software Orientada a Modelos é uma metodologia que utiliza modelos no processo de desenvolvimento de software. Muitas operações sobre esse modelos são necessárias estabelecer links entre modelos distintos, como por exemplo, nas transformação de modelos, nas rastreabilidade de modelos e nas integração de modelos. Neste trabalho, os links são estabelecidos através da operação matching. Com os links estabelecidos é comum calcular os valores de similaridades a eles, além de se indicar um grau de igualdade entre esses links. O Similarity Flooding é um algoritmo bem estabelecido que pode aumentar a similaridade entre os links. O algoritmo é genérico e está provado sua eficiência. Contudo, ele depende de uma estrutura menos genérica para manter a sua eficiência. Neste trabalho, foram codificados 9 métodos distintos de propagações para o Similarity Flooding entre os elementos de metamodelos e modelos. Esses elementos compreendem classes, atributos, referências, instâncias e o tipo dos elementos, por exemplo, Integer, String ou Boolean. Além de verificar a viabilidade desses métodos, 2 casos de estudos são discutidos. No primeiro caso de estudo, foram executados os métodos entre os metamodelos e modelos de Mantis e Bugzilla. Em seguida, foram executados os métodos entre os metamodelos e modelos de AccountOwner e Customer. Por fim, é apresentado um estudo comparativo entre os métodos de propagações codificados com um método genérico, com o objetivo de verificar quais métodos podem ser mais (ou menos) eficiente para o Similarity Flooding, dentre os metamodelos e modelos utilizados. De acordo com os resultados, utilizando técnicas restritas de propagações do SF, as similaridades entre os links melhoraram em relação a execução genérica do algoritmo. Isso porque diminuindo a quantidade de links o SF pode ter um melhor desempenho.Abstract: In Model-Driven Software Engineering (MDSE), different approaches can be used to establish links between elements of different models for distinct purposes, such as serving as specifications for model transformations. Once the links have been established, it is common to set up a similarity value to indicate equivalence (or lack of) between the elements. Similarity Flooding (SF) is one of the best known algorithms for enhancing the similarity of structurally similar elements. The algorithm is generic and has proven to be efficient. However, it depends on graph-based structure and a less generic encoding. We created nine generic methods to propagate the similarities between links of elements of models. These elements comprise classes, attributes, references, instances and the type of element, e.g., Integer, String or Boolean. In order to verify the viability of these methods, 2 case studies are discussed. In the first case study, we execute our methods between metamodels and models of Mantis and Bugzilla. In the following, the metamodels and models of AccountOwner and Customer are used. At the end, a comparative study of the metamodel-based encoding is presented for the purpose of verifying whether a less generic implementation, involving a lesser number of model elements, based on the metamodel and model structures, might be a viable implementation and adaptation of the SF algorithm. We compare these methods with an implementation comprising all the propagation strutures (non-restricted propagation), which are more similar (though not equivalent) to the original SF implementation. According to the results, using the restricted propagation graphs of the SF, the similarity values between the links has increased in relation to the non-restricted algorithm. This is because reducing the amount of links, will increase the propagation values between the links of elements

    Uma abordagem de agrupamento baseada na técnica de divisão e conquista e floresta de caminhos ótimos

    Get PDF
    Orientador: Alexandre Xavier FalcãoDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: O agrupamento de dados é um dos principais desafios em problemas de Ciência de Dados. Apesar do seu progresso científico em quase um século de existência, algoritmos de agrupamento ainda falham na identificação de grupos (clusters) naturalmente relacionados com a semântica do problema. Ademais, os avanços das tecnologias de aquisição, comunicação, e armazenamento de dados acrescentam desafios cruciais com o aumento considerável de dados, os quais não são tratados pela maioria das técnicas. Essas questões são endereçadas neste trabalho através da proposta de uma abordagem de divisão e conquista para uma técnica de agrupamento única em encontrar um grupo por domo da função de densidade de probabilidade dos dados --- o algoritmo de agrupamento por floresta de caminhos ótimos (OPF - Optimum-Path Forest). Nesta técnica, amostras são interpretadas como nós de um grafo cujos arcos conectam os kk-vizinhos mais próximos no espaço de características. Os nós são ponderados pela sua densidade de probabilidade e um mapa de conexidade é maximizado de modo que cada máximo da função densidade de probabilidade se torna a raiz de uma árvore de caminhos ótimos (grupo). O melhor valor de kk é estimado por otimização em um intervalo de valores dependente da aplicação. O problema com este método é que um número alto de amostras torna o algoritmo inviável, devido ao espaço de memória necessário para armazenar o grafo e o tempo computacional para encontrar o melhor valor de kk. Visto que as soluções existentes levam a resultados ineficazes, este trabalho revisita o problema através da proposta de uma abordagem de divisão e conquista com dois níveis. No primeiro nível, o conjunto de dados é dividido em subconjuntos (blocos) menores e as amostras pertencentes a cada bloco são agrupadas pelo algoritmo OPF. Em seguida, as amostras representativas de cada grupo (mais especificamente as raízes da floresta de caminhos ótimos) são levadas ao segundo nível, onde elas são agrupadas novamente. Finalmente, os rótulos de grupo obtidos no segundo nível são transferidos para todas as amostras do conjunto de dados através de seus representantes do primeiro nível. Nesta abordagem, todas as amostras, ou pelo menos muitas delas, podem ser usadas no processo de aprendizado não supervisionado, sem afetar a eficácia do agrupamento e, portanto, o procedimento é menos susceptível a perda de informação relevante ao agrupamento. Os resultados mostram agrupamentos satisfatórios em dois cenários, segmentação de imagem e agrupamento de dados arbitrários, tendo como base a comparação com abordagens populares. No primeiro cenário, a abordagem proposta atinge os melhores resultados em todas as bases de imagem testadas. No segundo cenário, os resultados são similares aos obtidos por uma versão otimizada do método original de agrupamento por floresta de caminhos ótimosAbstract: Data clustering is one of the main challenges when solving Data Science problems. Despite its progress over almost one century of research, clustering algorithms still fail in identifying groups naturally related to the semantics of the problem. Moreover, the advances in data acquisition, communication, and storage technologies add crucial challenges with a considerable data increase, which are not handled by most techniques. We address these issues by proposing a divide-and-conquer approach to a clustering technique, which is unique in finding one group per dome of the probability density function of the data --- the Optimum-Path Forest (OPF) clustering algorithm. In the OPF-clustering technique, samples are taken as nodes of a graph whose arcs connect the kk-nearest neighbors in the feature space. The nodes are weighted by their probability density values and a connectivity map is maximized such that each maximum of the probability density function becomes the root of an optimum-path tree (cluster). The best value of kk is estimated by optimization within an application-specific interval of values. The problem with this method is that a high number of samples makes the algorithm prohibitive, due to the required memory space to store the graph and the computational time to obtain the clusters for the best value of kk. Since the existing solutions lead to ineffective results, we decided to revisit the problem by proposing a two-level divide-and-conquer approach. At the first level, the dataset is divided into smaller subsets (blocks) and the samples belonging to each block are grouped by the OPF algorithm. Then, the representative samples (more specifically the roots of the optimum-path forest) are taken to a second level where they are clustered again. Finally, the group labels obtained in the second level are transferred to all samples of the dataset through their representatives of the first level. With this approach, we can use all samples, or at least many samples, in the unsupervised learning process without affecting the grouping performance and, therefore, the procedure is less likely to lose relevant grouping information. We show that our proposal can obtain satisfactory results in two scenarios, image segmentation and the general data clustering problem, in comparison with some popular baselines. In the first scenario, our technique achieves better results than the others in all tested image databases. In the second scenario, it obtains outcomes similar to an optimized version of the traditional OPF-clustering algorithmMestradoCiência da ComputaçãoMestre em Ciência da ComputaçãoCAPE

    DKWS: A Distributed System for Keyword Search on Massive Graphs (Complete Version)

    Full text link
    Due to the unstructuredness and the lack of schemas of graphs, such as knowledge graphs, social networks, and RDF graphs, keyword search for querying such graphs has been proposed. As graphs have become voluminous, large-scale distributed processing has attracted much interest from the database research community. While there have been several distributed systems, distributed querying techniques for keyword search are still limited. This paper proposes a novel distributed keyword search system called \DKWS. First, we \revise{present} a {\em monotonic} property with keyword search algorithms that guarantees correct parallelization. Second, we present a keyword search algorithm as monotonic backward and forward search phases. Moreover, we propose new tight bounds for pruning nodes being searched. Third, we propose a {\em notify-push} paradigm and \PINE {\em programming model} of \DKWS. The notify-push paradigm allows {\em asynchronously} exchanging the upper bounds of matches across the workers and the coordinator in \DKWS. The \PINE programming model naturally fits keyword search algorithms, as they have distinguished phases, to allow {\em preemptive} searches to mitigate staleness in a distributed system. Finally, we investigate the performance and effectiveness of \DKWS through experiments using real-world datasets. We find that \DKWS is up to two orders of magnitude faster than related techniques, and its communication costs are 7.67.6 times smaller than those of other techniques

    Exploring Hidden Coherent Feature Groups and Temporal Semantics for Multimedia Big Data Analysis

    Get PDF
    Thanks to the advanced technologies and social networks that allow the data to be widely shared among the Internet, there is an explosion of pervasive multimedia data, generating high demands of multimedia services and applications in various areas for people to easily access and manage multimedia data. Towards such demands, multimedia big data analysis has become an emerging hot topic in both industry and academia, which ranges from basic infrastructure, management, search, and mining to security, privacy, and applications. Within the scope of this dissertation, a multimedia big data analysis framework is proposed for semantic information management and retrieval with a focus on rare event detection in videos. The proposed framework is able to explore hidden semantic feature groups in multimedia data and incorporate temporal semantics, especially for video event detection. First, a hierarchical semantic data representation is presented to alleviate the semantic gap issue, and the Hidden Coherent Feature Group (HCFG) analysis method is proposed to capture the correlation between features and separate the original feature set into semantic groups, seamlessly integrating multimedia data in multiple modalities. Next, an Importance Factor based Temporal Multiple Correspondence Analysis (i.e., IF-TMCA) approach is presented for effective event detection. Specifically, the HCFG algorithm is integrated with the Hierarchical Information Gain Analysis (HIGA) method to generate the Importance Factor (IF) for producing the initial detection results. Then, the TMCA algorithm is proposed to efficiently incorporate temporal semantics for re-ranking and improving the final performance. At last, a sampling-based ensemble learning mechanism is applied to further accommodate the imbalanced datasets. In addition to the multimedia semantic representation and class imbalance problems, lack of organization is another critical issue for multimedia big data analysis. In this framework, an affinity propagation-based summarization method is also proposed to transform the unorganized data into a better structure with clean and well-organized information. The whole framework has been thoroughly evaluated across multiple domains, such as soccer goal event detection and disaster information management

    Advances in Data Mining Knowledge Discovery and Applications

    Get PDF
    Advances in Data Mining Knowledge Discovery and Applications aims to help data miners, researchers, scholars, and PhD students who wish to apply data mining techniques. The primary contribution of this book is highlighting frontier fields and implementations of the knowledge discovery and data mining. It seems to be same things are repeated again. But in general, same approach and techniques may help us in different fields and expertise areas. This book presents knowledge discovery and data mining applications in two different sections. As known that, data mining covers areas of statistics, machine learning, data management and databases, pattern recognition, artificial intelligence, and other areas. In this book, most of the areas are covered with different data mining applications. The eighteen chapters have been classified in two parts: Knowledge Discovery and Data Mining Applications


    Get PDF
    Processing, mining and analyzing big data adds significant value towards solving previously unverified research questions or improving our ability to understand problems in geographical sciences. This dissertation contributes to developing a solution that supports researchers who may not otherwise have access to traditional high-performance computing resources so they benefit from the “big data” era, and implement big geographical research in ways that have not been previously possible. Using approaches from the fields of geographic information science, remote sensing and computer science, this dissertation addresses three major challenges in big geographical research: 1) how to exploit cloud computing to implement a universal scalable solution to classify multi-sourced remotely sensed imagery datasets with high efficiency; 2) how to overcome the missing data issue in land use land cover studies with a high-performance framework on the cloud through the use of available auxiliary datasets; and 3) the design considerations underlying a universal massive scale voxel geographical simulation model to implement complex geographical systems simulation using a three dimensional spatial perspective. This dissertation implements an in-memory distributed remotely sensed imagery classification framework on the cloud using both unsupervised and supervised classifiers, and classifies remotely sensed imagery datasets of the Suez Canal area, Egypt and Inner Mongolia, China under different cloud environments. This dissertation also implements and tests a cloud-based gap filling model with eleven auxiliary datasets in biophysical and social-economics in Inner Mongolia, China. This research also extends a voxel-based Cellular Automata model using graph theory and develops this model as a massive scale voxel geographical simulation framework to simulate dynamic processes, such as air pollution particles dispersal on cloud