5 research outputs found
Supporting Database Designers in Entity-Relationship Modeling: An Ontology- Based Approach
Database design has long been recognized as a difficult problem, requiring a great deal of skill on the part of the designer. Research has been carried out that provides methodologies and rules for creating good designs. There have even been attempts to automate the design process. However, before these can be truly successful, methodologies and tools are needed that can incorporate and use domain knowledge. In this research, a methodology for supporting database design is proposed that makes use of domain-specific knowledge about an application, which is stored in the form of ontologies. The ontologies provide information that is useful in both the creation of new designs and the verification of existing ones. They also capture the constraints of an application domain. A methodology for assisting database design that takes advantage of the ontologies has been implemented in a prototype system. Initial testing of the prototype illustrates that the incorporation and use of ontologies are effective in creating database design
Formal design of data warehouse and OLAP systems : a dissertation presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Information Systems at Massey University, Palmerston North, New Zealand
A data warehouse is a single data store, where data from multiple data sources is integrated for online business analytical processing (OLAP) of an entire organisation. The rationale being single and integrated is to ensure a consistent view of the organisational business performance independent from different angels of business perspectives. Due to its wide coverage of subjects, data warehouse design is a highly complex, lengthy and error-prone process. Furthermore, the business analytical tasks change over time, which results in changes in the requirements for the OLAP systems. Thus, data warehouse and OLAP systems are rather dynamic and the design process is continuous. In this thesis, we propose a method that is integrated, formal and application-tailored to overcome the complexity problem, deal with the system dynamics, improve the quality of the system and the chance of success.
Our method comprises three important parts: the general ASMs method with types, the application tailored design framework for data warehouse and OLAP, and the schema integration method with a set of provably correct refinement rules.
By using the ASM method, we are able to model both data and operations in a uniform conceptual framework, which enables us to design an integrated approach for data warehouse and OLAP design. The freedom given by the ASM method allows us to model the system at an abstract level that is easy to understand for both users and designers. More specifically, the language allows us to use the terms from the user domain not biased by the terms used in computer systems. The pseudo-code like transition rules, which gives the simplest form of operational semantics in ASMs, give the closeness to programming languages for designers to understand. Furthermore, these rules are rooted in mathematics to assist in improving the quality of the system design.
By extending the ASMs with types, the modelling language is tailored for data warehouse with the terms that are well developed for data-intensive applications, which makes it easy to model the schema evolution as refinements in the dynamic data warehouse design.
By providing the application-tailored design framework, we break down the design complexity by business processes (also called subjects in data warehousing) and design concerns. By designing the data warehouse by subjects, our method resembles Kimball's "bottom-up" approach. However, with the schema integration method, our method resolves the stovepipe issue of the approach. By building up a data warehouse iteratively in an integrated framework, our method not only results in an integrated data warehouse, but also resolves the issues of complexity and delayed ROI (Return On Investment) in Inmon's "top-down" approach. By dealing with the user change requests in the same way as new subjects, and modelling data and operations explicitly in a three-tier architecture, namely the data sources, the data warehouse and the OLAP (online Analytical Processing), our method facilitates dynamic design with system integrity.
By introducing a notion of refinement specific to schema evolution, namely schema refinement, for capturing the notion of schema dominance in schema integration, we are able to build a set of correctness-proven refinement rules. By providing the set of refinement rules, we simplify the designers's work in correctness design verification. Nevertheless, we do not aim for a complete set due to the fact that there are many different ways for schema integration, and neither a prescribed way of integration to allow designer favored design.
Furthermore, given its °exibility in the process, our method can be extended for new emerging design issues easily
A relational algebra approach to ETL modeling
The MAP-i Doctoral Programme in Informatics, of the Universities of Minho, Aveiro and PortoInformation Technology has been one of drivers of the revolution that currently is happening in
today’s management decisions in most organizations. The amount of data gathered and processed
through the use of computing devices has been growing every day, providing a valuable source of
information for decision makers that are managing every type of organization, public or private.
Gathering the right amount of data in a centralized and unified repository like a data warehouse is
similar to build the foundations for a system that will act has a base to support decision making
processes requiring factual information. Nevertheless, the complexity of building such a repository
is very challenging, as well as developing all the components of a data warehousing system. One
of the most critical components of a data warehousing system is the Extract-Transform-Load
component, ETL for short, which is responsible for gathering data from information sources, clean,
transform and conform it in order to store it in a data warehouse. Several designing methodologies
for the ETL components have been presented in the last few years with very little impact in ETL
commercial tools. Basically, this was due to an existing gap between the conceptual design of an
ETL system and its correspondent physical implementation. The methodologies proposed ranged
from new approaches, with novel notation and diagrams, to the adoption and expansion of current
standard modeling notations, like UML or BPMN. However, all these proposals do not contain
enough detail to be translated automatically into a specific execution platform. The use of a
standard well-known notation like Relational Algebra might bridge the gap between the conceptual
design and the physical design of an ETL component, mainly due to its formal approach that is
based on a limited set of operators and also due to its functional characteristics like being a
procedural language operating over data stored in relational format. The abstraction that Relational
Algebra provides over the technological infrastructure might also be an advantage for uncommon execution platforms, like computing grids that provide an exceptional amount of processing power
that is very critical for ETL systems. Additionally, partitioning data and task distribution over
computing nodes works quite well with a Relational Algebra approach. An extensive research over
the use of Relational Algebra in the ETL context was conducted to validate its usage. To
complement this, a set of Relational Algebra patterns were also developed to support the most
common ETL tasks, like changing data capture, data quality enforcement, data conciliation and
integration, slowly changing dimensions and surrogate key pipelining. All these patterns provide a
formal approach to the referred ETL tasks by specifying all the operations needed to accomplish
them in a series of Relational Algebra operations. To evaluate the feasibility of the work done in
this thesis, we used a real ETL application scenario for the extraction of data in two different social
networks operational systems, storing hashtag usage information in a specific data mart. The
ability to analyze trends in social network usage is a hot topic in today’s media and information
coverage. A complete design of the ETL component using the patterns developed previously is also
provided, as well as a critical evaluation of its usage.As Tecnologias da Informação têm sido um dos principais catalisadores na revolução que se assiste
nas tomadas de decisão na maioria das organizações. A quantidade de dados que são angariados e
processados através do uso de dispositivos computacionais tem crescido diariamente, tornando-se
uma fonte de informação valiosa para os decisores que gerem todo o tipo de organizações,
públicas ou privadas. Concentrar o conjunto ideal de dados num repositório centralizado e
unificado, como um data warehouse, é essencial para a construção de um sistema que servirá de
suporte aos processos de tomada de decisão que necessitam de factos. No entanto, a
complexidade associada à construção deste repositório e de todos os componentes que
caracterizam um sistema de data warehousing é extremamente desafiante. Um dos componentes
mais críticos de um sistema de data warehousing é a componente de Extração-Transformação-
Alimentação (ETL) que lida com a extração de dados das fontes, que limpa, transforma e concilia
os dados com vista à sua integração no data warehouse. Nos últimos anos têm sido apresentadas
várias metodologias de desenho da componente de ETL, no entanto estas não têm sido adotadas
pelas ferramentas comerciais de ETL principalmente devido ao diferencial existente entre o
desenho concetual e as plataformas físicas de execução. As metodologias de desenho propostas
variam desde propostas que assentam em novas notações e diagramas até às propostas que usam
notações standard como a notação UML e BPMN que depois são complementadas com conceitos
de ETL. Contudo, estas propostas de modelação concetual não contêm informações detalhadas
que permitam uma tradução automática para plataformas de execução. A utilização de uma
linguagem standard e reconhecida como a linguagem de Álgebra Relacional pode servir como
complemento e colmatar o diferencial existente entre o desenho concetual e o desenho físico da
componente de ETL, principalmente devido ao facto de esta linguagem assentar numa abordagem procedimental com um conjunto limitado de operadores que atuam sobre dados armazenados num
formato relacional. A abstração providenciada pela Álgebra Relacional relativamente às plataformas
de execução pode eventualmente ser uma vantagem tendo em vista a utilização de plataformas
menos comuns, como por exemplo grids computacionais. Este tipo de arquiteturas disponibiliza por
norma um grande poder computacional o que é essencial para um sistema de ETL. O
particionamento e distribuição dos dados e tarefas pelos nodos computacionais conjugam
relativamente bem com a abordagem da Álgebra Relacional. No decorrer deste trabalho foi
efetuado um estudo extensivo às propriedades da AR num contexto de ETL com vista à avaliação
da sua usabilidade. Como complemento, foram desenhados um conjunto de padrões de AR que
suportam as atividades mais comuns de ETL como por exemplo changing data capture, data
quality enforcement, data conciliation and integration, slowly changing dimensions e surrogate key
pipelining. Estes padrões formalizam este conjunto de atividades ETL, especificando numa série de
operações de Álgebra Relacional quais os passos necessários à sua execução. Com vista à
avaliação da sustentabilidade da proposta presente neste trabalho, foi utilizado um cenário real de
ETL em que os dados fontes pertencem a duas redes sociais e os dados armazenados no data
mart identificam a utilização de hashtags por parte dos seus utilizadores. De salientar que a
deteção de tendências e de assuntos que estão na ordem do dia nas redes sociais é de vital
importância para as empresas noticiosas e para as próprias redes sociais. Por fim, é apresentado o
desenho completo do sistema de ETL para o cenário escolhido, utilizando os padrões desenvolvidos
neste trabalho, avaliando e criticando a sua utilização