11 research outputs found
Study and Performance Analysis of Different Techniques for Computing Data Cubes
Data is an integrated form of observable and recordable facts in operational or transactional systems in the data warehouse. Usually, data warehouse stores aggregated and historical data in multi-dimensional schemas. Data only have value to end-users when it is formulated and represented as information. And Information is a composed collection of facts for decision making. Cube computation is the most efficient way for answering this decision making queries and retrieve information from data. Online Analytical Process (OLAP) used in this purpose of the cube computation. There are two types of OLAP: Relational Online Analytical Processing (ROLAP) and Multidimensional Online Analytical Processing (MOLAP). This research worked on ROLAP and MOLAP and then compare both methods to find out the computation times by the data volume. Generally, a large data warehouse produces an extensive output, and it takes a larger space with a huge amount of empty data cells. To solve this problem, data compression is inevitable. Therefore, Compressed Row Storage (CRS) is applied to reduce empty cell overhead
Study and Performance Analysis of Different Techniques for Computing Data Cubes
Data is an integrated form of observable and recordable facts in operational or transactional systems in the data warehouse. Usually, data warehouse stores aggregated and historical data in multi-dimensional schemas. Data only have value to end-users when it is formulated and represented as information. And Information is a composed collection of facts for decision making. Cube computation is the most efficient way for answering this decision making queries and retrieve information from data. Online Analytical Process (OLAP) used in this purpose of the cube computation. There are two types of OLAP: Relational Online Analytical Processing (ROLAP) and Multidimensional Online Analytical Processing (MOLAP). This research worked on ROLAP and MOLAP and then compare both methods to find out the computation times by the data volume. Generally, a large data warehouse produces an extensive output, and it takes a larger space with a huge amount of empty data cells. To solve this problem, data compression is inevitable. Therefore, Compressed Row Storage (CRS) is applied to reduce empty cell overhead
SPARSITY HANDLING AND DATA EXPLOSION IN OLAP SYSTEMS
A common problem with OnLine Analytical Processing (OLAP) databases is data explosion - data size multiplies, when it is loaded from the source data into multidimensional cubes. Data explosion is not an issue for small databases, but can be serious problems with large databases. In this paper we discuss the sparsity and data explosion phenomenon in multidimensional data model, which lie at the core of OLAP systems. Our researches over five companies with different branch of business confirm the observations that in reality most of the cubes are extremely sparse. We also consider a different method that relational and multidimensional severs applies to reduce the data explosion and sparsity problems as compression and indexes techniques, partitioning, preliminary aggregations
Benchmarking Big Data OLAP NoSQL Databases
With the advent of Big Data, new challenges have emerged regarding the evaluation of decision support systems (DSS). Existing evaluation benchmarks are not configured to handle a massive data volume and wide data diversity. In this paper, we introduce a new DSS benchmark that supports multiple data storage systems, such as relational and Not Only SQL (NoSQL) systems. Our scheme recognizes numerous data models (snowflake, star and flat topologies) and several data formats (CSV, JSON, TBL, XML, etc.). It entails complex data generation characterized within “volume, variety, and velocity” framework (3 V). Next, our scheme enables distributed and parallel data generation. Furthermore, we exhibit some experimental results with KoalaBench
Interacting with Statistical Linked Data via OLAP Operations
Abstract. Online Analytical Processing (OLAP) promises an interface to analyse Linked Data containing statistics going beyond other interaction paradigms such as follow-your-nose browsers, faceted-search interfaces and query builders. Transforming statistical Linked Data into a star schema to populate a relational database and applying a common OLAP engine do not allow to optimise OLAP queries on RDF or to directly propagate changes of Linked Data sources to clients. Therefore, as a new way to interact with statistics published as Linked Data, we investigate the problem of executing OLAP queries via SPARQL on an RDF store. For that, we first define projection, slice, dice and roll-up operations on single data cubes published as Linked Data reusing the RDF Data Cube vocabulary and show how a nested set of operations lead to an OLAP query. Second, we show how to transform an OLAP query to a SPARQL query which generates all required tuples from the data cube. In a small experiment, we show the applicability of our OLAPto-SPARQL mapping in answering a business question in the financial domain
Multi-Dimensional Partitioning in BUC for Data Cubes
Bottom-Up Computation (BUC) is one of the most studied algorithms for data cube generation in on-line analytical processing. Its computation in the bottom-up style allows the algorithm to efficiently generate a data cube for memory-sized input data. When the entire input data cannot fit into memory, many literatures suggest partitioning the data by a dimension and run the algorithm on each of the single-dimensional partitioned data. For very large sized input data, the partitioned data might still not be able to fit into the memory and partitioning by additional dimensions is required; however, this multi- dimensional partitioning is more complicated than single-dimensional partitioning and it has not been fully discussed before. Our goal is to provide a heuristic implementation on multi-dimensional partitioning in BUC. To confirm our design, we compare it with our implemented PipeSort, which is a top-down data cubing algorithm; meanwhile, we confirm the advantages and disadvantages between the top-down data cubing algorithm and the bottom-up data cubing algorithm
View and Index Selection on Graph Databases
Μια από τις σημαντικότερες πτυχές των βάσεων δεδομένων γραφημάτων με εγγενή
επεξεργασία γράφων είναι η ιδιότητα γειτνίασης χωρίς ευρετήριο (index-free-adjacency),
βάση της οποίας όλοι οι κόμβοι του γράφου έχουν άμεση φυσική διεύθυνση RAM και
δείκτες σε άλλους γειτονικούς κόμβους. Η ιδιότητα γειτνίασης χωρίς ευρετήριο επιταχύνει
την απάντηση ερωτημάτων για ερωτήματα που συνδέονται με έναν (ή περισσότερους)
συγκεκριμένους κόμβους εντός του γραφήματος, δηλαδή τους κόμβους αγκύρωσης
(anchor nodes). Ο αντίστοιχος κόμβος αγκύρωσης χρησιμοποιείται ως σημείο εκκίνησης
για την απάντηση στο ερώτημα εξετάζοντας τους παρακείμενους κόμβους του αντί για
ολόκληρο το γράφημα. Παρόλα αυτά, τα ερωτήματα που δεν αρχίζουν από κόμβους
αγκύρωσης απαντώνται πολύ πιο δύσκολα, καθώς ο σχεδιαστής ερωτημάτων(query
planner) θα πρέπει να εξετάσει ένα μεγάλο μέρος του γραφήματος για να απαντήσει στο
αντίστοιχο ερώτημα. Σε αυτή την εργασία μελετάμε τεχνικές επιλογής όψεων και
ευρετηρίων προκειμένου να επιταχύνουμε την προαναφερθείσα κατηγορία ερωτημάτων.
Αναλύουμε διαφορετικές στρατηγικές επιλογής όψεων και ευρετηρίων για την απάντηση
ερωτημάτων και δείχνουμε ότι, ανάλογα με τα χαρακτηριστικά του ερωτήματος, τη βάση
δεδομένων γραφημάτων και το αντίστοιχο σύνολο απαντήσεων, μια διαφορετική
στρατηγική μπορεί να είναι βέλτιστη μεταξύ των εναλλακτικών λύσεων ευρετηρίασης και
υλοποίησης προβολής. Πριν από την επιλογή των όψεων και των ευρετηρίων, το σύστημά
μας χρησιμοποιεί τεχνικές εξόρυξης προτύπων για να μαντέψει τα χαρακτηριστικά των
μελλοντικών ερωτημάτων. Έτσι, ο αρχικός φόρτος εργασίας του ερωτήματος
αντιπροσωπεύεται από μια πολύ μικρότερη σύνοψη των μοτίβων ερωτημάτων που είναι
πιο πιθανό να εμφανιστούν σε μελλοντικά ερωτήματα, με κάθε μοτίβο να έχει τον
αντίστοιχο αναμενόμενο αριθμό εμφανίσεων. Η στρατηγική επιλογής μας βασίζεται σε μια
στρατηγική επιλογής άπληστης όψεων & ευρετηρίων που σε κάθε βήμα της εκτέλεσής της
προσπαθεί να μεγιστοποιήσει την αναλογία του οφέλους από την υλοποίηση μιας/ενός
όψεως/ευρετηρίου, προς το αντίστοιχο κόστος αποθήκευσής τους. Ο αλγόριθμος επιλογής
μας είναι εμπνευσμένος από τον αντίστοιχο άπληστο αλγόριθμο για τη «Μεγιστοποίηση
μιας μη ελαττούμενης συνάρτησης υποδομοστοιχειωτού συνόλου που υπόκειται σε
περιορισμό σακιδίου». Η πειραματική μας αξιολόγηση δείχνει ότι όλα τα βήματα της
διαδικασίας επιλογής ευρετηρίου ολοκληρώνονται σε λίγα δευτερόλεπτα, ενώ οι
αντίστοιχες επανεγγραφές επιταχύνουν το 15,44% των ερωτημάτων στον φόρτο εργασίας
των ερωτημάτων της DbPedia. Αυτά τα ερωτήματα εκτελούνται στο 1,63% του αρχικού
τους χρόνου κατά μέσο όρο.One of the most important aspects of native graph-database systems is their index-free
adjacency property that enforces the nodes to have direct physical RAM addresses and
physically point to other adjacent nodes. The index-free adjacency property accelerates
query answering for queries that are bound to one (or more) specific nodes within the
graph, namely anchor nodes. The corresponding anchor node is used as the starting point
for answering the query by examining its adjacent nodes instead of the whole graph.
Nevertheless, non-anchored-node queries are much harder to answer since the query
planner should examine a large portion of the graph in order to answer the corresponding
query. In this work we study view and index selection techniques in order to accelerate the
aforementioned class of queries. We analyze different index and view selection strategies
for query answering and show that, depending on the characteristics of the query, the
graph database, and the corresponding answer set, a different strategy may be optimal
among the indexing and view materialization alternatives. Before selecting the views and
indices, our system employs pattern mining techniques in order to guess the
characteristics of future queries. Thus, the initial query workload is represented by a much
smaller summary of the query patterns that are most likely to appear in future queries,
each pattern having a corresponding expected number of appearances. Our selection
strategy is based on a greedy view & index selection strategy that at each step of its
execution tries to maximize the ratio of the benefit of materializing a view/index, to the
corresponding cost of storing it. Our selection algorithm is inspired by the corresponding
greedy algorithm for “Maximizing a Nondecreasing Submodular Set Function Subject to a
Knapsack Constraint”. Our experimental evaluation shows that all the steps of the index
selection process are completed in a few seconds, while the corresponding rewritings
accelerate 15.44% of the queries in the DbPedia query workload. Those queries are
executed in 1.63% of their initial time on average
Materialização à medida de vistas multidimensionais de dados
Dissertação de mestrado em Engenharia de InformáticaCom o emergir da era da informação foram muitas as empresas que recorreram a data warehouses para armazenar a crescente quantidade de dados que dispõem sobre os seus negócios. Com essa evolução dos volumes de dados surge também a necessidade da sua melhor exploração para que sejam úteis de alguma forma nas avaliações e decisões sobre o negócio. Os sistemas de processamento analítico (ou OLAP – On-Line Analytical Processing) vêm dar resposta a essas necessidades de auxiliar o analista de negócio na exploração e avaliação dos dados, dotando-o de autonomia de exploração, disponibilizando-lhe uma estrutura multiperspetiva e de rápida resposta. Contudo para que o acesso a essa informação seja rápido existe a necessidade de fazer a materialização de estruturas multidimensionais com esses dados já pré-calculados, reduzindo o tempo de interrogação ao tempo de leitura da resposta e evitando o tempo de processamento de cada query. A materialização completa dos dados necessários torna-se na prática impraticável dada a volumetria de dados a que os sistemas estão sujeitos e ao tempo de processamento necessário para calcular todas as combinações possíveis. Dado que o analista do negócio é o elemento diferenciador na utilização efetiva das estruturas, ou pelo menos aquele que seleciona os dados que são consultados nessas estruturas, este trabalho propõe um conjunto de técnicas que estudam o comportamento do utilizador, de forma a perceber o seu comportamento sazonal e as vistas alvo das suas explorações, para que seja possível fazer a definição de novas estruturas contendo as vistas mais apropriadas à materialização e assim melhor satisfaçam as necessidades de exploração dos seus utilizadores.
Nesta dissertação são definidas estruturas que acolhem os registos de consultas dos utilizadores e com esses dados são aplicadas técnicas de identificação de perfis de utilização e padrões de utilização, nomeadamente a definição de sessões OLAP, a aplicação de cadeias de Markov e a determinação de classes de equivalência de atributos consultados. No final deste estudo propomos a definição de uma assinatura OLAP capaz de definir o comportamento OLAP do utilizador com os elementos identificados nas técnicas estudadas e, assim, possibilitar ao administrador de sistema uma definição de reestruturação das estruturas multidimensionais “à medida” da utilização feita pelos analistas.With the emergence of the information era many companies resorted to data warehouses to store
an increasing amount of their business data. With this evolution of data volume the need to better
explore this data arises in order to be somewhat useful in evaluating and making business
decisions. OLAP (On-Line Analytical Processing) systems respond to the need of helping the
business analyst in exploring the data by giving him the autonomy of exploration, providing him
with a multi-perspective and quick answer structure. However, in order to provide quick access to
this information the materialization of multi-dimensional structures with this data already calculated
is required, reducing the query time to the answer reading time and avoiding the processing time
of each query. The complete materialization of the required data is practically impossible due to
the volume of data that the systems are subjected to and due to the processing time needed to
calculate all combinations possible. Since the business analyst is the differentiating element in the
effective use of these structures, this work proposes a set of techniques that study the user‟s
behaviour in order to understand his seasonal behaviour and the target views of his explorations,
so that it becomes possible to define new structures containing the most appropriate views for
materialization and in this way better satisfying the exploration needs of its users. In this
dissertation, structures that collect the query records of the users will be defined and with this data
techniques of identification of user profiles and utilization patterns are applied, namely the
definition of OLAP sessions, the application of Markov chains and the determination of equivalence
classes of queried attributes. In the end of this study, the definition of an OLAP signature capable
of defining the OLAP behaviour of the user with the elements identified in the studied techniques will be proposed and this way allowing the system administrator a definition for restructuring of the
multi-dimensional structures in “size” with the use done by the analysts
Flexible Integration and Efficient Analysis of Multidimensional Datasets from the Web
If numeric data from the Web are brought together, natural scientists can compare climate measurements with estimations, financial analysts can evaluate companies based on balance sheets and daily stock market values, and citizens can explore the GDP per capita from several data sources. However, heterogeneities and size of data remain a problem. This work presents methods to query a uniform view - the Global Cube - of available datasets from the Web and builds on Linked Data query approaches
Flexible Integration and Efficient Analysis of Multidimensional Datasets from the Web
If numeric data from the Web are brought together, natural scientists can compare climate measurements with estimations, financial analysts can evaluate companies based on balance sheets and daily stock market values, and citizens can explore the GDP per capita from several data sources. However, heterogeneities and size of data remain a problem. This work presents methods to query a uniform view - the Global Cube - of available datasets from the Web and builds on Linked Data query approaches