33 research outputs found

    Segment Oriented Compression Scheme for MOLAP Based on Extendible Multidimensional Arrays

    Get PDF
    Many statistical and MOLAP applications use multidimensional arrays as the basic data structure to allow the efficient and convenient storage and retrieval of large volumes of business data for decision making. Allocation of data or data compression is a key performance factor for this purpose because performance strongly depends on the amount of storage required and availability of memory. This holds especially for data warehousing environments in which huge amounts of data have to be dealt with. The most evident consequence of data compression is that it reduces storage cost by packing more logical data per unit of physical capacity. And improved performance is a net outcome because less physical data need to be retrieved during scan-oriented queries. In this paper, an efficient data compression technique is proposed based on the notion of extendible array. The main idea of the scheme is to compress each of the segments of the extendible array using the position information only. We compare the proposed scheme for different performance issues with prominent compression schemes.</p

    Chunking of Large Multidimensional Arrays

    Full text link

    On indexing highly dynamic multidimensional datasets for interactive analytics

    Get PDF
    Orientador : Prof. Dr. Luis Carlos Erpen de BonaTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa: Curitiba, 15/04/2016Inclui referências : f. 77-91Área de concentração : Ciência da computaçãoResumo: Indexação de dados multidimensionais tem sido extensivamente pesquisada nas últimas décadas. Neste trabalho, um novo workload OLAP identificado no Facebook é apresentado, caracterizado por (a) alta dinamicidade e dimensionalidade, (b) escala e (c) interatividade e simplicidade de consultas, inadequado para os SGBDs OLAP e técnicas de indexação de dados multidimensionais atuais. Baseado nesse caso de uso, uma nova estratégia de indexação e organização de dados multidimensionais para SGBDs em memória chamada Granular Partitioning é proposta. Essa técnica extende a visão tradicional de partitionamento em banco de dados, particionando por intervalo todas as dimensões do conjunto de dados e formando pequenos blocos que armazenam dados de forma não coordenada e esparsa. Desta forma, é possível atingir altas taxas de ingestão de dados sem manter estrutura auxiliar alguma de indexação. Este trabalho também descreve como um SGBD OLAP capaz de suportar um modelo de dados composto por cubos, dimensões e métricas, além de operações como roll-ups, drill-downs e slice and dice (filtros) eficientes pode ser construído com base nessa nova técnica de organização de dados. Com objetivo de validar experimentalmente a técnica apresentada, este trabalho apresenta o Cubrick, um novo SGBD OLAP em memória distribuída e otimizada para a execução de consultas analíticas baseado em Granular Partitioning, escritas desde a primeira linha de código para este trabalho. Finalmente, os resultados de uma avaliação experimental extensiva contendo conjuntos de dados e consultas coletadas de projetos pilotos que utilizam Cubrick é apresentada; em seguida, é mostrado que a escala desejada pode ser alcançada caso os dados sejam organizados de acordo com o Granular Partitioning e o projeto seja focado em simplicidade, ingerindo milhões de registros por segundo continuamente de uxos de dados em tempo real, e concorrentemente executando consultas com latência inferior a 1 segundo.Abstrct: Indexing multidimensional data has been an active focus of research in the last few decades. In this work, we present a new type of OLAP workload found at Facebook and characterized by (a) high dynamicity and dimensionality, (b) scale and (c) interactivity and simplicity of queries, that is unsuited for most current OLAP DBMSs and multidimensional indexing techniques. To address this use case, we propose a novel multidimensional data organization and indexing strategy for in-memory DBMSs called Granular Partitioning. This technique extends the traditional view of database partitioning by range partitioning every dimension of the dataset and organizing the data within small containers in an unordered and sparse fashion, in such a way to provide high ingestion rates and indexed access through every dimension without maintaining any auxiliary data structures. We also describe how an OLAP DBMS able to support a multidimensional data model composed of cubes, dimensions and metrics and operations such as roll-up, drill-down as well as efficient slice and dice filtering) can be built on top of this new data organization technique. In order to experimentally validate the described technique we present Cubrick, a new in-memory distributed OLAP DBMS for interactive analytics based on Granular Partitioning we have written from the ground up at Facebook. Finally, we present results from a thorough experimental evaluation that leveraged datasets and queries collected from a few pilot Cubrick deployments. We show that by properly organizing the dataset according to Granular Partitioning and focusing the design on simplicity, we are able to achieve the target scale and store tens of terabytes of in-memory data, continuously ingest millions of records per second from realtime data streams and still execute sub-second queries

    Parallel Access of Out-Of-Core Dense Extendible Arrays

    Get PDF
    Datasets used in scientific and engineering applications are often modeled as dense multi-dimensional arrays. For very large datasets, the corresponding array models are typically stored out-of-core as array files. The array elements are mapped onto linear consecutive locations that correspond to the linear ordering of the multi-dimensional indices. Two conventional mappings used are the row-major order and the column-major order of multi-dimensional arrays. Such conventional mappings of dense array files highly limit the performance of applications and the extendibility of the dataset. Firstly, an array file that is organized in say row-major order causes applications that subsequently access the data in column-major order, to have abysmal performance. Secondly, any subsequent expansion of the array file is limited to only one dimension. Expansions of such out-of-core conventional arrays along arbitrary dimensions, require storage reorganization that can be very expensive. Wepresent a solution for storing out-of-core dense extendible arrays that resolve the two limitations. The method uses a mapping function F*(), together with information maintained in axial vectors, to compute the linear address of an extendible array element when passed its k-dimensional index. We also give the inverse function, F-1*() for deriving the k-dimensional index when given the linear address. We show how the mapping function, in combination with MPI-IO and a parallel file system, allows for the growth of the extendible array without reorganization and no significant performance degradation of applications accessing elements in any desired order. We give methods for reading and writing sub-arrays into and out of parallel applications that run on a cluster of workstations. The axial-vectors are replicated and maintained in each node that accesses sub-array elements

    AMCIS 2002 Panels and Workshops II: Spreadsheet-Based DSS Curriculum Issues

    Get PDF
    When challenged to justify the value of information systems (IS) research, decision support systems (DSS) is usually cited as one the most compelling examples of where IS research made the transition successfully from theoretical academic journals into the real-world . In light of this assessment, it is somewhat surprising that offerings of DSS courses waned over the years. This paper identifies several possible reasons for the decline in DSS course offerings and suggests innovative approaches using spreadsheets for breathing new-life into this cornerstone of the IS field

    A Survey on Spatial Indexing

    Get PDF
    Spatial information processing has been a centre of attention of research in the previous decade. In spatial databases, data related with spatial coordinates and extents are retrieved based on spatial proximity. A large number of spatial indexes have been proposed to make ease of efficient indexing of spatial objects in large databases and spatial data retrieval. The goal of this paper is to review the advance techniques of the access methods. This paper tries to classify the existing multidimensional access methods, according to the types of indexing, and their performance over spatial queries. K-d trees out performs quad tress without requiring additional memory usage

    Query processing techniques for arrays

    Get PDF

    Enabling Model-Driven Live Analytics For Cyber-Physical Systems: The Case of Smart Grids

    Get PDF
    Advances in software, embedded computing, sensors, and networking technologies will lead to a new generation of smart cyber-physical systems that will far exceed the capabilities of today’s embedded systems. They will be entrusted with increasingly complex tasks like controlling electric grids or autonomously driving cars. These systems have the potential to lay the foundations for tomorrow’s critical infrastructures, to form the basis of emerging and future smart services, and to improve the quality of our everyday lives in many areas. In order to solve their tasks, they have to continuously monitor and collect data from physical processes, analyse this data, and make decisions based on it. Making smart decisions requires a deep understanding of the environment, internal state, and the impacts of actions. Such deep understanding relies on efficient data models to organise the sensed data and on advanced analytics. Considering that cyber-physical systems are controlling physical processes, decisions need to be taken very fast. This makes it necessary to analyse data in live, as opposed to conventional batch analytics. However, the complex nature combined with the massive amount of data generated by such systems impose fundamental challenges. While data in the context of cyber-physical systems has some similar characteristics as big data, it holds a particular complexity. This complexity results from the complicated physical phenomena described by this data, which makes it difficult to extract a model able to explain such data and its various multi-layered relationships. Existing solutions fail to provide sustainable mechanisms to analyse such data in live. This dissertation presents a novel approach, named model-driven live analytics. The main contribution of this thesis is a multi-dimensional graph data model that brings raw data, domain knowledge, and machine learning together in a single model, which can drive live analytic processes. This model is continuously updated with the sensed data and can be leveraged by live analytic processes to support decision-making of cyber-physical systems. The presented approach has been developed in collaboration with an industrial partner and, in form of a prototype, applied to the domain of smart grids. The addressed challenges are derived from this collaboration as a response to shortcomings in the current state of the art. More specifically, this dissertation provides solutions for the following challenges: First, data handled by cyber-physical systems is usually dynamic—data in motion as opposed to traditional data at rest—and changes frequently and at different paces. Analysing such data is challenging since data models usually can only represent a snapshot of a system at one specific point in time. A common approach consists in a discretisation, which regularly samples and stores such snapshots at specific timestamps to keep track of the history. Continuously changing data is then represented as a finite sequence of such snapshots. Such data representations would be very inefficient to analyse, since it would require to mine the snapshots, extract a relevant dataset, and finally analyse it. For this problem, this thesis presents a temporal graph data model and storage system, which consider time as a first-class property. A time-relative navigation concept enables to analyse frequently changing data very efficiently. Secondly, making sustainable decisions requires to anticipate what impacts certain actions would have. Considering complex cyber-physical systems, it can come to situations where hundreds or thousands of such hypothetical actions must be explored before a solid decision can be made. Every action leads to an independent alternative from where a set of other actions can be applied and so forth. Finding the sequence of actions that leads to the desired alternative, requires to efficiently create, represent, and analyse many different alternatives. Given that every alternative has its own history, this creates a very high combinatorial complexity of alternatives and histories, which is hard to analyse. To tackle this problem, this dissertation introduces a multi-dimensional graph data model (as an extension of the temporal graph data model) that enables to efficiently represent, store, and analyse many different alternatives in live. Thirdly, complex cyber-physical systems are often distributed, but to fulfil their tasks these systems typically need to share context information between computational entities. This requires analytic algorithms to reason over distributed data, which is a complex task since it relies on the aggregation and processing of various distributed and constantly changing data. To address this challenge, this dissertation proposes an approach to transparently distribute the presented multi-dimensional graph data model in a peer-to-peer manner and defines a stream processing concept to efficiently handle frequent changes. Fourthly, to meet future needs, cyber-physical systems need to become increasingly intelligent. To make smart decisions, these systems have to continuously refine behavioural models that are known at design time, with what can only be learned from live data. Machine learning algorithms can help to solve this unknown behaviour by extracting commonalities over massive datasets. Nevertheless, searching a coarse-grained common behaviour model can be very inaccurate for cyber-physical systems, which are composed of completely different entities with very different behaviour. For these systems, fine-grained learning can be significantly more accurate. However, modelling, structuring, and synchronising many fine-grained learning units is challenging. To tackle this, this thesis presents an approach to define reusable, chainable, and independently computable fine-grained learning units, which can be modelled together with and on the same level as domain data. This allows to weave machine learning directly into the presented multi-dimensional graph data model. In summary, this thesis provides an efficient multi-dimensional graph data model to enable live analytics of complex, frequently changing, and distributed data of cyber-physical systems. This model can significantly improve data analytics for such systems and empower cyber-physical systems to make smart decisions in live. The presented solutions combine and extend methods from model-driven engineering, [email protected], data analytics, database systems, and machine learning

    An investigation into the issues of multi-agent data mining

    Get PDF
    Multi-agent systems (MAS) often deal with complex applications that require distributedproblem solving. In many applications the individual and collective behaviourof the agents depends on the observed data from distributed sources. The field of DistributedData Mining (DDM) deals with these challenges in analyzing distributed dataand offers many algorithmic solutions to perform different data analysis and miningoperations in a fundamentally distributed manner that pays careful attention to the resourceconstraints. Since multi-agent systems are often distributed and agents haveproactive and reactive features, combining DM with MAS for data intensive applicationsis therefore appealing.This Chapter discusses a number of research issues concerned with the use ofMulti-Agent Systems for Data Mining (MADM), also known as agent-driven datamining. The Chapter also examines the issues affecting the design and implementationof a generic and extendible agent-based data mining framework. An ExtendibleMulti-Agent Data mining System (EMADS) Framework for integrating distributeddata sources is presented. This framework achieves high-availability and highperformance without compromising the data integrity and security. © 2010 Nova Science Publishers, Inc. All rights reserved

    New data structures and algorithms for the efficient management of large spatial datasets

    Get PDF
    [Resumen] En esta tesis estudiamos la representación eficiente de matrices multidimensionales, presentando nuevas estructuras de datos compactas para almacenar y procesar grids en distintos ámbitos de aplicación. Proponemos varias estructuras de datos estáticas y dinámicas para la representación de matrices binarias o de enteros y estudiamos aplicaciones a la representación de datos raster en Sistemas de Información Geográfica, bases de datos RDF, etc. En primer lugar proponemos una colección de estructuras de datos estáticas para la representación de matrices binarias y de enteros: 1) una nueva representación de matrices binarias con grandes grupos de valores uniformes, con aplicaciones a la representación de datos raster binarios; 2) una nueva estructura de datos para representar matrices multidimensionales; 3) una nueva estructura de datos para representar matrices de enteros con soporte para consultas top-k de rango. También proponemos una nueva representación dinámica de matrices binarias, una nueva estructura de datos que proporciona las mismas funcionalidades que nuestras propuestas estáticas pero también soporta cambios en la matriz. Nuestras estructuras de datos pueden utilizarse en distintos dominios. Proponemos variantes específicas y combinaciones de nuestras propuestas para representar grafos temporales, bases de datos RDF, datos raster binarios o generales y datos raster temporales. También proponemos un nuevo algoritmo para consultar conjuntamente un conjuto de datos raster (almacenado usando nuestras propuestas) y un conjunto de datos vectorial almacenado en una estructura de datos clásica, mostrando que nuestra propuesta puede ser más rápida y usar menos espacio que otras alternativas. Nuestras representaciones proporcionan interesantes trade-offs y son competitivas en espacio y tiempos de consulta con representaciones habituales en los diferentes dominios.[Resumo] Nesta tese estudiamos a representación eficiente de matrices multidimensionais, presentando novas estruturas de datos compactas para almacenar e procesar grids en distintos ámbitos de aplicación. Propoñemos varias estruturas de datos estáticas e dinámicas para a representación de matrices binarias ou de enteiros e estudiamos aplicacións á representación de datos raster en Sistemas de Información Xeográfica, bases de datos RDF, etc. En primeiro lugar propoñemos unha colección de estruturas de datos estáticas para a representación de matrices binarias e de enteiros: 1) unha nova representación de matrices binarias con grandes grupos de valores uniformes, con aplicacións á representación de datos raster binarios; 2) unha nova estrutura de datos para representar matrices multidimensionais; 3) unha nova estrutura de datos para representar matrices de enteiros con soporte para consultas top-k. Tamén propoñemos unha nova representación dinámica de matrices binarias, unha nova estrutura de datos que proporciona as mesmas funcionalidades que as nosas propostas estáticas pero tamén soporta cambios na matriz. As nosas estruturas de datos poden utilizarse en distintos dominios. Propoñemos variantes específicas e combinacións das nosas propostas para representar grafos temporais, bases de datos RDF, datos raster binarios ou xerais e datos raster temporais. Tamén propoñemos un novo algoritmo para consultar conxuntamente datos raster (almacenados usando as nosas propostas) con datos vectoriais almacenados nunha estrutura de datos clásica, amosando que a nosa proposta pode ser máis rápida e usar menos espazo que outras alternativas. As nosas representacións proporcionan interesantes trade-offs e son competitivas en espazo e tempos de consulta con representacións habituais nos diferentes dominios.[Abstract] In this thesis we study the efficient representation of multidimensional grids, presenting new compact data structures to store and query grids in different application domains. We propose several static and dynamic data structures for the representation of binary grids and grids of integers, and study applications to the representation of raster data in Geographic Information Systems, RDF databases, etc. We first propose a collection of static data structures for the representation of binary grids and grids of integers: 1) a new representation of bi-dimensional binary grids with large clusters of uniform values, with applications to the representation of binary raster data; 2) a new data structure to represent multidimensional binary grids; 3) a new data structure to represent grids of integers with support for top-k range queries. We also propose a new dynamic representation of binary grids, a new data structure that provides the same functionalities that our static representations of binary grids but also supports changes in the grid. Our data structures can be used in several application domains. We propose specific variants and combinations of our generic proposals to represent temporal graphs, RDF databases, OLAP databases, binary or general raster data, and temporal raster data. We also propose a new algorithm to jointly query a raster dataset (stored using our representations) and a vectorial dataset stored in a classic data structure, showing that our proposal can be faster and require less space than the usual alternatives. Our representations provide interesting trade-offs and are competitive in terms of space and query times with usual representations in the different domains
    corecore