11 research outputs found

    Study and Performance Analysis of Different Techniques for Computing Data Cubes

    Get PDF
    Data is an integrated form of observable and recordable facts in operational or transactional systems in the data warehouse. Usually, data warehouse stores aggregated and historical data in multi-dimensional schemas. Data only have value to end-users when it is formulated and represented as information. And Information is a composed collection of facts for decision making. Cube computation is the most efficient way for answering this decision making queries and retrieve information from data. Online Analytical Process (OLAP) used in this purpose of the cube computation. There are two types of OLAP: Relational Online Analytical Processing (ROLAP) and Multidimensional Online Analytical Processing (MOLAP). This research worked on ROLAP and MOLAP and then compare both methods to find out the computation times by the data volume. Generally, a large data warehouse produces an extensive output, and it takes a larger space with a huge amount of empty data cells. To solve this problem, data compression is inevitable. Therefore, Compressed Row Storage (CRS) is applied to reduce empty cell overhead

    Study and Performance Analysis of Different Techniques for Computing Data Cubes

    Get PDF
    Data is an integrated form of observable and recordable facts in operational or transactional systems in the data warehouse. Usually, data warehouse stores aggregated and historical data in multi-dimensional schemas. Data only have value to end-users when it is formulated and represented as information. And Information is a composed collection of facts for decision making. Cube computation is the most efficient way for answering this decision making queries and retrieve information from data. Online Analytical Process (OLAP) used in this purpose of the cube computation. There are two types of OLAP: Relational Online Analytical Processing (ROLAP) and Multidimensional Online Analytical Processing (MOLAP). This research worked on ROLAP and MOLAP and then compare both methods to find out the computation times by the data volume. Generally, a large data warehouse produces an extensive output, and it takes a larger space with a huge amount of empty data cells. To solve this problem, data compression is inevitable. Therefore, Compressed Row Storage (CRS) is applied to reduce empty cell overhead

    SPARSITY HANDLING AND DATA EXPLOSION IN OLAP SYSTEMS

    Get PDF
    A common problem with OnLine Analytical Processing (OLAP) databases is data explosion - data size multiplies, when it is loaded from the source data into multidimensional cubes. Data explosion is not an issue for small databases, but can be serious problems with large databases. In this paper we discuss the sparsity and data explosion phenomenon in multidimensional data model, which lie at the core of OLAP systems. Our researches over five companies with different branch of business confirm the observations that in reality most of the cubes are extremely sparse. We also consider a different method that relational and multidimensional severs applies to reduce the data explosion and sparsity problems as compression and indexes techniques, partitioning, preliminary aggregations

    Benchmarking Big Data OLAP NoSQL Databases

    Get PDF
    With the advent of Big Data, new challenges have emerged regarding the evaluation of decision support systems (DSS). Existing evaluation benchmarks are not configured to handle a massive data volume and wide data diversity. In this paper, we introduce a new DSS benchmark that supports multiple data storage systems, such as relational and Not Only SQL (NoSQL) systems. Our scheme recognizes numerous data models (snowflake, star and flat topologies) and several data formats (CSV, JSON, TBL, XML, etc.). It entails complex data generation characterized within “volume, variety, and velocity” framework (3 V). Next, our scheme enables distributed and parallel data generation. Furthermore, we exhibit some experimental results with KoalaBench

    Interacting with Statistical Linked Data via OLAP Operations

    Get PDF
    Abstract. Online Analytical Processing (OLAP) promises an interface to analyse Linked Data containing statistics going beyond other interaction paradigms such as follow-your-nose browsers, faceted-search interfaces and query builders. Transforming statistical Linked Data into a star schema to populate a relational database and applying a common OLAP engine do not allow to optimise OLAP queries on RDF or to directly propagate changes of Linked Data sources to clients. Therefore, as a new way to interact with statistics published as Linked Data, we investigate the problem of executing OLAP queries via SPARQL on an RDF store. For that, we first define projection, slice, dice and roll-up operations on single data cubes published as Linked Data reusing the RDF Data Cube vocabulary and show how a nested set of operations lead to an OLAP query. Second, we show how to transform an OLAP query to a SPARQL query which generates all required tuples from the data cube. In a small experiment, we show the applicability of our OLAPto-SPARQL mapping in answering a business question in the financial domain

    Multi-Dimensional Partitioning in BUC for Data Cubes

    Get PDF
    Bottom-Up Computation (BUC) is one of the most studied algorithms for data cube generation in on-line analytical processing. Its computation in the bottom-up style allows the algorithm to efficiently generate a data cube for memory-sized input data. When the entire input data cannot fit into memory, many literatures suggest partitioning the data by a dimension and run the algorithm on each of the single-dimensional partitioned data. For very large sized input data, the partitioned data might still not be able to fit into the memory and partitioning by additional dimensions is required; however, this multi- dimensional partitioning is more complicated than single-dimensional partitioning and it has not been fully discussed before. Our goal is to provide a heuristic implementation on multi-dimensional partitioning in BUC. To confirm our design, we compare it with our implemented PipeSort, which is a top-down data cubing algorithm; meanwhile, we confirm the advantages and disadvantages between the top-down data cubing algorithm and the bottom-up data cubing algorithm

    View and Index Selection on Graph Databases

    Get PDF
    Μια από τις σημαντικότερες πτυχές των βάσεων δεδομένων γραφημάτων με εγγενή επεξεργασία γράφων είναι η ιδιότητα γειτνίασης χωρίς ευρετήριο (index-free-adjacency), βάση της οποίας όλοι οι κόμβοι του γράφου έχουν άμεση φυσική διεύθυνση RAM και δείκτες σε άλλους γειτονικούς κόμβους. Η ιδιότητα γειτνίασης χωρίς ευρετήριο επιταχύνει την απάντηση ερωτημάτων για ερωτήματα που συνδέονται με έναν (ή περισσότερους) συγκεκριμένους κόμβους εντός του γραφήματος, δηλαδή τους κόμβους αγκύρωσης (anchor nodes). Ο αντίστοιχος κόμβος αγκύρωσης χρησιμοποιείται ως σημείο εκκίνησης για την απάντηση στο ερώτημα εξετάζοντας τους παρακείμενους κόμβους του αντί για ολόκληρο το γράφημα. Παρόλα αυτά, τα ερωτήματα που δεν αρχίζουν από κόμβους αγκύρωσης απαντώνται πολύ πιο δύσκολα, καθώς ο σχεδιαστής ερωτημάτων(query planner) θα πρέπει να εξετάσει ένα μεγάλο μέρος του γραφήματος για να απαντήσει στο αντίστοιχο ερώτημα. Σε αυτή την εργασία μελετάμε τεχνικές επιλογής όψεων και ευρετηρίων προκειμένου να επιταχύνουμε την προαναφερθείσα κατηγορία ερωτημάτων. Αναλύουμε διαφορετικές στρατηγικές επιλογής όψεων και ευρετηρίων για την απάντηση ερωτημάτων και δείχνουμε ότι, ανάλογα με τα χαρακτηριστικά του ερωτήματος, τη βάση δεδομένων γραφημάτων και το αντίστοιχο σύνολο απαντήσεων, μια διαφορετική στρατηγική μπορεί να είναι βέλτιστη μεταξύ των εναλλακτικών λύσεων ευρετηρίασης και υλοποίησης προβολής. Πριν από την επιλογή των όψεων και των ευρετηρίων, το σύστημά μας χρησιμοποιεί τεχνικές εξόρυξης προτύπων για να μαντέψει τα χαρακτηριστικά των μελλοντικών ερωτημάτων. Έτσι, ο αρχικός φόρτος εργασίας του ερωτήματος αντιπροσωπεύεται από μια πολύ μικρότερη σύνοψη των μοτίβων ερωτημάτων που είναι πιο πιθανό να εμφανιστούν σε μελλοντικά ερωτήματα, με κάθε μοτίβο να έχει τον αντίστοιχο αναμενόμενο αριθμό εμφανίσεων. Η στρατηγική επιλογής μας βασίζεται σε μια στρατηγική επιλογής άπληστης όψεων & ευρετηρίων που σε κάθε βήμα της εκτέλεσής της προσπαθεί να μεγιστοποιήσει την αναλογία του οφέλους από την υλοποίηση μιας/ενός όψεως/ευρετηρίου, προς το αντίστοιχο κόστος αποθήκευσής τους. Ο αλγόριθμος επιλογής μας είναι εμπνευσμένος από τον αντίστοιχο άπληστο αλγόριθμο για τη «Μεγιστοποίηση μιας μη ελαττούμενης συνάρτησης υποδομοστοιχειωτού συνόλου που υπόκειται σε περιορισμό σακιδίου». Η πειραματική μας αξιολόγηση δείχνει ότι όλα τα βήματα της διαδικασίας επιλογής ευρετηρίου ολοκληρώνονται σε λίγα δευτερόλεπτα, ενώ οι αντίστοιχες επανεγγραφές επιταχύνουν το 15,44% των ερωτημάτων στον φόρτο εργασίας των ερωτημάτων της DbPedia. Αυτά τα ερωτήματα εκτελούνται στο 1,63% του αρχικού τους χρόνου κατά μέσο όρο.One of the most important aspects of native graph-database systems is their index-free adjacency property that enforces the nodes to have direct physical RAM addresses and physically point to other adjacent nodes. The index-free adjacency property accelerates query answering for queries that are bound to one (or more) specific nodes within the graph, namely anchor nodes. The corresponding anchor node is used as the starting point for answering the query by examining its adjacent nodes instead of the whole graph. Nevertheless, non-anchored-node queries are much harder to answer since the query planner should examine a large portion of the graph in order to answer the corresponding query. In this work we study view and index selection techniques in order to accelerate the aforementioned class of queries. We analyze different index and view selection strategies for query answering and show that, depending on the characteristics of the query, the graph database, and the corresponding answer set, a different strategy may be optimal among the indexing and view materialization alternatives. Before selecting the views and indices, our system employs pattern mining techniques in order to guess the characteristics of future queries. Thus, the initial query workload is represented by a much smaller summary of the query patterns that are most likely to appear in future queries, each pattern having a corresponding expected number of appearances. Our selection strategy is based on a greedy view & index selection strategy that at each step of its execution tries to maximize the ratio of the benefit of materializing a view/index, to the corresponding cost of storing it. Our selection algorithm is inspired by the corresponding greedy algorithm for “Maximizing a Nondecreasing Submodular Set Function Subject to a Knapsack Constraint”. Our experimental evaluation shows that all the steps of the index selection process are completed in a few seconds, while the corresponding rewritings accelerate 15.44% of the queries in the DbPedia query workload. Those queries are executed in 1.63% of their initial time on average

    Materialização à medida de vistas multidimensionais de dados

    Get PDF
    Dissertação de mestrado em Engenharia de InformáticaCom o emergir da era da informação foram muitas as empresas que recorreram a data warehouses para armazenar a crescente quantidade de dados que dispõem sobre os seus negócios. Com essa evolução dos volumes de dados surge também a necessidade da sua melhor exploração para que sejam úteis de alguma forma nas avaliações e decisões sobre o negócio. Os sistemas de processamento analítico (ou OLAP – On-Line Analytical Processing) vêm dar resposta a essas necessidades de auxiliar o analista de negócio na exploração e avaliação dos dados, dotando-o de autonomia de exploração, disponibilizando-lhe uma estrutura multiperspetiva e de rápida resposta. Contudo para que o acesso a essa informação seja rápido existe a necessidade de fazer a materialização de estruturas multidimensionais com esses dados já pré-calculados, reduzindo o tempo de interrogação ao tempo de leitura da resposta e evitando o tempo de processamento de cada query. A materialização completa dos dados necessários torna-se na prática impraticável dada a volumetria de dados a que os sistemas estão sujeitos e ao tempo de processamento necessário para calcular todas as combinações possíveis. Dado que o analista do negócio é o elemento diferenciador na utilização efetiva das estruturas, ou pelo menos aquele que seleciona os dados que são consultados nessas estruturas, este trabalho propõe um conjunto de técnicas que estudam o comportamento do utilizador, de forma a perceber o seu comportamento sazonal e as vistas alvo das suas explorações, para que seja possível fazer a definição de novas estruturas contendo as vistas mais apropriadas à materialização e assim melhor satisfaçam as necessidades de exploração dos seus utilizadores. Nesta dissertação são definidas estruturas que acolhem os registos de consultas dos utilizadores e com esses dados são aplicadas técnicas de identificação de perfis de utilização e padrões de utilização, nomeadamente a definição de sessões OLAP, a aplicação de cadeias de Markov e a determinação de classes de equivalência de atributos consultados. No final deste estudo propomos a definição de uma assinatura OLAP capaz de definir o comportamento OLAP do utilizador com os elementos identificados nas técnicas estudadas e, assim, possibilitar ao administrador de sistema uma definição de reestruturação das estruturas multidimensionais “à medida” da utilização feita pelos analistas.With the emergence of the information era many companies resorted to data warehouses to store an increasing amount of their business data. With this evolution of data volume the need to better explore this data arises in order to be somewhat useful in evaluating and making business decisions. OLAP (On-Line Analytical Processing) systems respond to the need of helping the business analyst in exploring the data by giving him the autonomy of exploration, providing him with a multi-perspective and quick answer structure. However, in order to provide quick access to this information the materialization of multi-dimensional structures with this data already calculated is required, reducing the query time to the answer reading time and avoiding the processing time of each query. The complete materialization of the required data is practically impossible due to the volume of data that the systems are subjected to and due to the processing time needed to calculate all combinations possible. Since the business analyst is the differentiating element in the effective use of these structures, this work proposes a set of techniques that study the user‟s behaviour in order to understand his seasonal behaviour and the target views of his explorations, so that it becomes possible to define new structures containing the most appropriate views for materialization and in this way better satisfying the exploration needs of its users. In this dissertation, structures that collect the query records of the users will be defined and with this data techniques of identification of user profiles and utilization patterns are applied, namely the definition of OLAP sessions, the application of Markov chains and the determination of equivalence classes of queried attributes. In the end of this study, the definition of an OLAP signature capable of defining the OLAP behaviour of the user with the elements identified in the studied techniques will be proposed and this way allowing the system administrator a definition for restructuring of the multi-dimensional structures in “size” with the use done by the analysts

    Flexible Integration and Efficient Analysis of Multidimensional Datasets from the Web

    Get PDF
    If numeric data from the Web are brought together, natural scientists can compare climate measurements with estimations, financial analysts can evaluate companies based on balance sheets and daily stock market values, and citizens can explore the GDP per capita from several data sources. However, heterogeneities and size of data remain a problem. This work presents methods to query a uniform view - the Global Cube - of available datasets from the Web and builds on Linked Data query approaches

    Flexible Integration and Efficient Analysis of Multidimensional Datasets from the Web

    Get PDF
    If numeric data from the Web are brought together, natural scientists can compare climate measurements with estimations, financial analysts can evaluate companies based on balance sheets and daily stock market values, and citizens can explore the GDP per capita from several data sources. However, heterogeneities and size of data remain a problem. This work presents methods to query a uniform view - the Global Cube - of available datasets from the Web and builds on Linked Data query approaches
    corecore