2 research outputs found
Captura de dados em tempo real em sistemas de data warehousing
Dissertação de mestrado em Engenharia Informáticamassificação dos sistemas de informação tem contribuído significativamente para a
forma como os utilizadores interagem com as empresas e seus sistemas. Esta nova relação
entre cliente e fornecedor tem aumentado significativamente o volume de dados gerados
pelas organizações, criando novas necessidades de como manter e gerir toda esta
informação. Assim, as empresas têm investido cada vez mais em soluções que permitam
manter toda a informação tratada e consolidada num repositório único de dados. Estes
sistemas são vulgarmente designados por sistemas de data warehousing. Tradicionalmente,
estes sistemas são refrescados em modo offline, em períodos de tempo que podem ser
diários ou semanais. Contudo, o aumento da competitividade no mundo empresarial torna
este tipo de refrescamentos desadequados, originando uma reação atrasada à ação que
despoletou essa informação. Na realidade, períodos longos de refrescamento tornam a
informação desatualizada, diminuído consequentemente a sua importância e valor para a
organização em causa. Assim sendo, é cada vez mais necessário que a informação
armazenada num sistema de data warehousing, seja a mais recente possível, evitando
interrupções na disponibilização da informação. A necessidade de obter a informação em
tempo real, coloca alguns desafios, tais como manter os dados acessíveis 24 horas por dia,
7 dias por semana, 365 dias por ano, reduzir o período de latência dos dados ou evitar
estrangulamentos operacionais nos sistemas transacionais. Assim, é imperativo a utilização de técnicas de coleta de dados não intrusivas, que atuem no momento em que determinado
evento ocorreu num sistema operacional e reflitam a sua informação de forma imediata (ou
quase imediata) num sistema de data warehousing. Neste trabalho de dissertação pretendese
estudar a problemática relacionada com a captura de dados em tempo real e conceber
um componente que capaz de suportar um sistema de extração de dados em tempo real
universal, que capture as mudanças ocorridas nos sistemas transacionais, de forma não
intrusiva, e as comunique na altura certa ao seu sistema de data warehousing.The mass of information systems has contributed significantly to the way users interact
with companies and their systems. This new relation between customer and supplier
hassignificantly increased the amount of data generated by organizations, creating new
needs to maintain and manage all this information. Thus, companies haveincreasingly
invested in solutions that allow them to maintain all the information processed and
consolidated on a unique data repository. These systems are commonly called Data
Warehousing Systems. Traditionally, these systems are refreshed in offline mode in
periods of time that can be daily or weekly. Although, the increase of the competitively in
the business world, makes this kind of refreshments unsustainable, resulting in a delayed
reaction to the action that triggered this information. In truth, long periods between
refreshments make the information out-dated, consequently decreasing his importance and
the value of the organization. . In that case, it is increasingly necessary that the information
stored on the data warehousing systems, is the more recent possible, taking back
interruption on the share of that information. The need of obtain information in real time,
puts some challenges, as keep all the data accessible 24 hours a day, 7 day a week, 365
days a year, reducingthe periods of data latency or avoiding operational strangulations in
transactional systems. Thus, it is imperative the usage of techniques of data collection nonintrusive
that can act when some particular event occurred on operational systems and reflect that information immediately (or almost immediately) on the data warehousing
system.In this dissertation, we intend to study all the problematic related to real time
change data capture, and conceiving a component capable to support an universal real time
data extraction system, capable of capture the changes occurred on a transactional system,
in a non-intrusive way and communicate with the data warehousing system in the right
time
Pragmatic development of service based real-time change data capture
This thesis makes a contribution to the Change Data Capture (CDC) field by providing an empirical evaluation on the performance of CDC architectures in the context of realtime data warehousing. CDC is a mechanism for providing data warehouse architectures with fresh data from Online Transaction Processing (OLTP) databases. There are two types of CDC architectures, pull architectures and push architectures. There is exiguous data on the performance of CDC architectures in a real-time environment. Performance data is required to determine the real-time viability of the two architectures. We propose that push CDC architectures are optimal for real-time CDC. However, push CDC architectures are seldom implemented because they are highly intrusive towards existing systems and arduous to maintain. As part of our contribution, we pragmatically develop a service based push CDC solution, which addresses the issues of intrusiveness and maintainability. Our solution uses Data Access Services (DAS) to decouple CDC logic from the applications. A requirement for the DAS is to place minimal overhead on a transaction in an OLTP environment. We synthesize DAS literature and pragmatically develop DAS that eciently execute transactions in an OLTP environment. Essentially we develop effeicient RESTful DAS, which expose Transactions As A Resource (TAAR). We evaluate the TAAR solution and three pull CDC mechanisms in a real-time environment, using the industry recognised TPC-C benchmark. The optimal CDC mechanism in a real-time environment, will capture change data with minimal latency and will have a negligible affect on the database's transactional throughput. Capture latency is the time it takes a CDC mechanism to capture a data change that has been applied to an OLTP database. A standard definition for capture latency and how to measure it does not exist in the field. We create this definition and extend the TPC-C benchmark to make the capture latency measurement. The results from our evaluation show that pull CDC is capable of real-time CDC at low levels of user concurrency. However, as the level of user concurrency scales upwards, pull CDC has a significant impact on the database's transaction rate, which affirms the theory that pull CDC architectures are not viable in a real-time architecture. TAAR CDC on the other hand is capable of real-time CDC, and places a minimal overhead on the transaction rate, although this performance is at the expense of CPU resources.EThOS - Electronic Theses Online ServiceGBUnited Kingdo