39 research outputs found
Integrating analytics with relational databases
The database research community has made tremendous strides in developing powerful database engines that allow for efficient analytical query processing. However, these powerful systems have gone largely unused by analysts and data scientists. This poor adoption is caused primarily by the state of database-client integration. In this thesis we attempt to overcome this challenge by investigating how we can facilitate efficient and painless integration of analytical tools and relational database management systems. We focus our investigation on the three primary methods for database-client integration: client-server connections, in-database processing and embedding the database inside the client application.PROMIMOOCAlgorithms and the Foundations of Software technolog
An Open Source BI Approach: Concept Proof Tracking Fleet
É possível assistir nos dias de hoje, a um processo tecnológico evolutivo acentuado por toda a
parte do globo. No caso das empresas, quer as pequenas, médias ou de grandes dimensões,
estão cada vez mais dependentes dos sistemas informatizados para realizar os seus processos
de negócio, e consequentemente à geração de informação referente aos negócios e onde,
muitas das vezes, os dados não têm qualquer relacionamento entre si.
A maioria dos sistemas convencionais informáticos não são projetados para gerir e armazenar
informações estratégicas, impossibilitando assim que esta sirva de apoio como recurso
estratégico. Portanto, as decisões são tomadas com base na experiência dos administradores,
quando poderiam serem baseadas em factos históricos armazenados pelos diversos sistemas.
Genericamente, as organizações possuem muitos dados, mas na maioria dos casos extraem
pouca informação, o que é um problema em termos de mercados competitivos. Como as
organizações procuram evoluir e superar a concorrência nas tomadas de decisão, surge neste
contexto o termo Business Intelligence(BI).
A GisGeo Information Systems é uma empresa que desenvolve software baseado em SIG
(sistemas de informação geográfica) recorrendo a uma filosofia de ferramentas open-source.
O seu principal produto baseia-se na localização geográfica dos vários tipos de viaturas, na
recolha de dados, e consequentemente a sua análise (quilómetros percorridos, duração de
uma viagem entre dois pontos definidos, consumo de combustível, etc.). Neste âmbito surge o
tema deste projeto que tem objetivo de dar uma perspetiva diferente aos dados existentes,
cruzando os conceitos BI com o sistema implementado na empresa de acordo com a sua
filosofia.
Neste projeto são abordados alguns dos conceitos mais importantes adjacentes a BI como,
por exemplo, modelo dimensional, data Warehouse, o processo ETL e OLAP, seguindo a
metodologia de Ralph Kimball. São também estudadas algumas das principais ferramentas
open-source existentes no mercado, assim como quais as suas vantagens/desvantagens
relativamente entre elas.
Em conclusão, é então apresentada a solução desenvolvida de acordo com os critérios
enumerados pela empresa como prova de conceito da aplicabilidade da área Business
Intelligence ao ramo de Sistemas de informação Geográfica (SIG), recorrendo a uma
ferramenta open-source que suporte visualização dos dados através de dashboards.Nowadays it is possible to watch a sharp evolutionary process technology throughout the
globe. For businesses, whether small, medium or large, are increasingly dependent on
information systems to conduct their business processes, and hence the generation of
information regarding business and where often the data does not have any relationship
therewith.
Most conventional computer systems are not designed to manage and store strategic
information, thus making it impossible to support this as a strategic resource. Therefore,
decisions are made based on the experience of the managers, when they could be based on
historical facts stored by different systems.
Generally, organizations have a lot of data, but in most cases draw little information, which is
a problem in terms of competitive markets. As organizations seek to evolve and outperform
the competition in decision-making, it arises the term Business Intelligence (BI in this context).
The GisGeo Information Systems is IT Company which develops its own software based on GIS
(Geographic information systems) using the philosophy of open-source. Its main product is
based on the geographical location of various types of vehicles, collecting data, and
consequently its analysis (kilometres travelled, duration of a trip between two set points, fuel
consumption, etc.). The theme of this project arises in this context, which has aimed to give a
different perspective to the existing data, crossing the BI concepts with the system
implemented in the company according to its philosophy.
In this project the adjacent BI concepts such as dimensional model, data warehouse, ET Land
OLAP process, following the methodology of Ralph Kimball are generally addressed. Some of
the main open-source tools are also studied on the market, as well as their
advantages/disadvantages in relation to one another.
In conclusion, a solution developed in accordance with the criteria listed by the company, is
presented as proof of concept of the applicability of the Business Intelligence at the branch
GIS, drawing on an open-source support data visualization tool through dashboards
Yavaa: supporting data workflows from discovery to visualization
Recent years have witness an increasing number of data silos being opened up both within organizations and to the general public: Scientists publish their raw data as supplements to articles or even standalone artifacts to enable others to verify and extend their work. Governments pass laws to open up formerly protected data treasures to improve accountability and transparency as well as to enable new business ideas based on this public good. Even companies share structured information about their products and services to advertise their use and thus increase revenue. Exploiting this wealth of information holds many challenges for users, though. Oftentimes data is provided as tables whose sheer endless rows of daunting numbers are barely accessible. InfoVis can mitigate this gap. However, offered visualization options are generally very limited and next to no support is given in applying any of them. The same holds true for data wrangling. Only very few options to adjust the data to the current needs and barely any protection are in place to prevent even the most obvious mistakes. When it comes to data from multiple providers, the situation gets even bleaker. Only recently tools emerged to search for datasets across institutional borders reasonably. Easy-to-use ways to combine these datasets are still missing, though. Finally, results generally lack proper documentation of their provenance. So even the most compelling visualizations can be called into question when their coming about remains unclear. The foundations for a vivid exchange and exploitation of open data are set, but the barrier of entry remains relatively high, especially for non-expert users. This thesis aims to lower that barrier by providing tools and assistance, reducing the amount of prior experience and skills required. It covers the whole workflow ranging from identifying proper datasets, over possible transformations, up until the export of the result in the form of suitable visualizations
Recommended from our members
Physical Plan Instrumentation in Databases: Mechanisms and Applications
Database management systems (DBMSs) are designed with the goal set to compile SQL queries to physical plans that, when executed, provide results to the SQL queries. Building on this functionality, an ever-increasing number of application domains (e.g., provenance management, online query optimization, physical database design, interactive data profiling, monitoring, and interactive data visualization) seek to operate on how queries are executed by the DBMS for a wide variety of purposes ranging from debugging and data explanation to optimization and monitoring. Unfortunately, DBMSs provide little, if any, support to facilitate the development of this class of important application domains. The effect is such that database application developers and database system architects either rewrite the database internals in ad-hoc ways; work around the SQL interface, if possible, with inevitable performance penalties; or even build new databases from scratch only to express and optimize their domain-specific application logic over how queries are executed.
To address this problem in a principled manner in this dissertation, we introduce a prototype DBMS, namely, Smoke, that exposes instrumentation mechanisms in the form of a framework to allow external applications to manipulate physical plans. Intuitively, a physical plan is the underlying representation that DBMSs use to encode how a SQL query will be executed, and providing instrumentation mechanisms at this representation level allows applications to express and optimize their logic on how queries are executed.
Having such an instrumentation-enabled DBMS in-place, we then consider how to express and optimize applications that rely their logic on how queries are executed. To best demonstrate the expressive and optimization power of instrumentation-enabled DBMSs, we express and optimize applications across several important domains including provenance management, interactive data visualization, interactive data profiling, physical database design, online query optimization, and query discovery. Expressivity-wise, we show that Smoke can express known techniques, introduce novel semantics on known techniques, and introduce new techniques across domains. Performance-wise, we show case-by-case that Smoke is on par with or up-to several orders of magnitudes faster than state-of-the-art imperative and declarative implementations of important applications across domains.
As such, we believe our contributions provide evidence and form the basis towards a class of instrumentation-enabled DBMSs with the goal set to express and optimize applications across important domains with core logic over how queries are executed by DBMSs
Scalable Automated Incrementalization for Real-Time Static Analyses
This thesis proposes a framework for easy development of static analyses, whose results are incrementalized to provide instantaneous feedback in an integrated development environment (IDE).
Today, IDEs feature many tools that have static analyses as their foundation to assess software quality and catch correctness problems.
Yet, these tools often fail to provide instantaneous feedback and are thus restricted to nightly build processes. This precludes developers from fixing issues at their inception time, i.e., when the problem and the developed solution are both still fresh in mind.
In order to provide instantaneous feedback, incrementalization is a well-known technique that utilizes the fact that developers make only small changes to the code and, hence, analysis results can be re-computed fast based on these changes. Yet, incrementalization requires carefully crafted static analyses. Thus, a manual approach to incrementalization is unattractive. Automated incrementalization can alleviate these problems and allows analyses writers to formulate their analyses as queries with the full data set in mind, without worrying over the semantics of incremental changes.
Existing approaches to automated incrementalization utilize standard technologies, such as deductive databases, that provide declarative query languages, yet also require to materialize the full dataset in main-memory, i.e., the memory is permanently blocked by the data required for the analyses. Other standard technologies such as relational databases offer better scalability due to persistence, yet require large transaction times for data. Both technologies are not a perfect match for integrating static analyses into an IDE, since the underlying data, i.e., the code base, is already persisted and managed by the IDE. Hence, transitioning the data into a database is redundant work.
In this thesis a novel approach is proposed that provides a declarative query language and automated incrementalization, yet retains in memory only a necessary minimum of data, i.e., only the data that is required for the incrementalization. The approach allows to declare static analyses as incrementally maintained views, where the underlying formalism for incrementalization is the relational algebra with extensions for object-orientation and recursion. The algebra allows to deduce which data is the necessary minimum for incremental maintenance and indeed shows that many views are self-maintainable, i.e., do not require to materialize memory at all. In addition an optimization for the algebra is proposed that allows to widen the range of self-maintainable views, based on domain knowledge of the underlying data. The optimization works similar to declaring primary keys for databases, i.e., the optimization is declared on the schema of the data, and defines which data is incrementally maintained in the same scope. The scope makes all analyses (views) that correlate only data within the boundaries of the scope self-maintainable.
The approach is implemented as an embedded domain specific language in a general-purpose programming language. The implementation can be understood as a database-like engine with an SQL-style query language and the execution semantics of the relational algebra. As such the system is a general purpose database-like query engine and can be used to incrementalize other domains than static analyses. To evaluate the approach a large variety of static analyses were sampled from real-world tools and formulated as incrementally maintained views in the implemented engine
Fine-Grained Provenance And Applications To Data Analytics Computation
Data provenance tools seek to facilitate reproducible data science and auditable data analyses by capturing the analytics steps used in generating data analysis results. However, analysts must choose among workflow provenance systems, which allow arbitrary code but only track provenance at the granularity of files; prove-nance APIs, which provide tuple-level provenance, but incur overhead in all computations; and database provenance tools, which track tuple-level provenance through relational operators and support optimization, but support a limited subset of data science tasks. None of these solutions are well suited for tracing errors introduced during common ETL, record alignment, and matching tasks – for data types such as strings, images, etc.Additionally, we need a provenance archival layer to store and manage the tracked fine-grained prove-nance that enables future sophisticated reasoning about why individual output results appear or fail to appear. For reproducibility and auditing, the provenance archival system should be tamper-resistant. On the other hand, the provenance collecting over time or within the same query computation tends to be repeated partially (i.e., the same operation with the same input records in the middle computation step). Hence, we desire efficient provenance storage (i.e., it compresses repeated results). We address these challenges with novel formalisms and algorithms, implemented in the PROVision system, for reconstructing fine-grained provenance for a broad class of ETL-style workflows. We extend database-style provenance techniques to capture equivalences, support optimizations, and enable lazy evaluations. We develop solutions for storing fine-grained provenance in relational storage systems while both compressing and protecting it via cryptographic hashes. We experimentally validate our proposed solutions using both scientific and OLAP workloads
Analytical Query Processing Using Heterogeneous SIMD Instruction Sets
Numerous applications gather increasing amounts of data, which have to be managed and queried. Different hardware developments help to meet this challenge. The grow-ing capacity of main memory enables database systems to keep all their data in memory. Additionally, the hardware landscape is becoming more diverse. A plethora of homo-geneous and heterogeneous co-processors is available, where heterogeneity refers not only to a different computing power, but also to different instruction set architectures. For instance, modern Intel® CPUs offer different instruction sets supporting the Single Instruction Multiple Data (SIMD) paradigm, e.g. SSE, AVX, and AVX512.
Database systems have started to exploit SIMD to increase performance. However, this is still a challenging task, because existing algorithms were mainly developed for scalar processing and because there is a huge variety of different instruction sets, which were never standardized and have no unified interface. This requires to completely rewrite the source code for porting a system to another hardware architecture, even if those archi-tectures are not fundamentally different and designed by the same company. Moreover, operations on large registers, which are the core principle of SIMD processing, behave counter-intuitively in several cases. This is especially true for analytical query process-ing, where different memory access patterns and data dependencies caused by the com-pression of data, challenge the limits of the SIMD principle. Finally, there are physical constraints to the use of such instructions affecting the CPU frequency scaling, which is further influenced by the use of multiple cores. This is because the supply power of a CPU is limited, such that not all transistors can be powered at the same time. Hence, there is a complex relationship between performance and power, and therefore also between performance and energy consumption.
This thesis addresses the specific challenges, which are introduced by the application of SIMD in general, and the heterogeneity of SIMD ISAs in particular. Hence, the goal of this thesis is to exploit the potential of heterogeneous SIMD ISAs for increasing the performance as well as the energy-efficiency
Index-based Join Operations in Hive
ABSTRACT
INDEX-BASED JOIN OPERATIONS IN HIVE
MAHSA MOFIDPOOR
The exponential growth of data being generated, manipulated, analyzed, and archived nowadays introduces new challenges and opportunities for dealing with the so called big data. Hive is a batch-oriented big data software, well suited for query processing and data analysis. Originally developed by Facebook in 2009 and now under the Apache Software Foundation, Hive is gaining popularity for its SQL like query language HiveQL and for supporting majority of the SQL operations in relational database management systems (RDBMS). Being the expensive operation in RDBMS, join has been the focus of many query optimization techniques to improve performance of database systems. We investigate such techniques for join operations in Hive and develop an index-based join algorithm for queries in HiveQL. When a query requires only a small subset of data selected by a predicate in the WHERE clause, the brute-force method which scans the entire tables results in poor performance for redundant disk I/Os, and irrelevant maps initiation in case the query is issued using the mapreduce.
In this work, we implement the proposed index-based technique and integrate it in Hive. To add our extension, we obtain Hive architecture details by reverse engineering the code and map our design to the conceptual optimization flow.To evaluate the performance, after setting up the environment, we run relevant test queries on datasets generated using the industry standard benchmark, TPC-H. Our results indicate significant performance gain over relatively large data or highly selective queries