135 research outputs found

    A Proposed Architecture for Big Data Driven Supply Chain Analytics

    Full text link
    Advancement in information and communication technology (ICT) has given rise to explosion of data in every field of operations. Working with the enormous volume of data (or Big Data, as it is popularly known as) for extraction of useful information to support decision making is one of the sources of competitive advantage for organizations today. Enterprises are leveraging the power of analytics in formulating business strategy in every facet of their operations to mitigate business risk. Volatile global market scenario has compelled the organizations to redefine their supply chain management (SCM). In this paper, we have delineated the relevance of Big Data and its importance in managing end to end supply chains for achieving business excellence. A Big Data-centric architecture for SCM has been proposed that exploits the current state of the art technology of data management, analytics and visualization. The security and privacy requirements of a Big Data system have also been highlighted and several mechanisms have been discussed to implement these features in a real world Big Data system deployment in the context of SCM. Some future scope of work has also been pointed out. Keyword: Big Data, Analytics, Cloud, Architecture, Protocols, Supply Chain Management, Security, Privacy.Comment: 24 pages, 4 figures, 3 table

    Collaboration and Virtualization in Large Information Systems Projects

    Get PDF
    A project is evolving through different phases from idea and conception until the experiments, implementation and maintenance. The globalization, the Internet, the Web and the mobile computing changed many human activities, and in this respect, the realization of the Information System (IS) projects. The projects are growing, the teams are geographically distributed, and the users are heterogeneous. In this respect, the realization of the large Information Technology (IT) projects needs to use collaborative technologies. The distribution of the team, the users' heterogeneity and the project complexity determines the virtualization. This paper is an overview of these aspects for large IT projects. It shortly present a general framework developed by the authors for collaborative systems in general and adapted to collaborative project management. The general considerations are illustrated on the case of a large IT project in which the authors were involved.large IT projects, collaborative systems, virtualization, framework for collaborative virtual systems

    Comparative Study Of Implementing The On-Premises and Cloud Business Intelligence On Business Problems In a Multi-National Software Development Company

    Get PDF
    Internship Report presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceNowadays every enterprise wants to be competitive. In the last decade, the data volumes are increased dramatically. As each year data in the market increases, the ability to extract, analyze and manage the data become the backbone condition for the organization to be competitive. In this condition, organizations need to adapt their technologies to the new business reality in order to be competitive and provide new solutions that meet new requests. Business Intelligence by the main definition is the ability to extract analyze and manage the data through which an organization gain a competitive advantage. Before using this approach, it’s important to decide on which computing system it will base on, considering the volume of data, business context of the organization and technologies requirements of the market. In the last 10 years, the popularity of cloud computing increased and divided the computing Systems into On-Premises and cloud. The cloud benefits are based on providing scalability, availability and fewer costs. On another hand, traditional On-Premises provides independence of software configuration, control over data and high security. The final decision as to which computing paradigm to follow in the organization it’s not an easy task as well as depends on the business context of the organization, and the characteristics of the performance of the current On-Premises systems in business processes. In this case, Business Intelligence functions and requires in-depth analysis in order to understand if cloud computing technologies could better perform in those processes than traditional systems. The objective of this internship is to conduct a comparative study between 2 computing systems in Business Intelligence routine functions. The study will compare the On-Premises Business Intelligence Based on Oracle Architecture with Cloud Business Intelligence based on Google Cloud Services. A comparative study will be conducted through participation in activities and projects in the Business Intelligence department, of a company that develops software digital solutions to serve the telecommunications market for 12 months, as an internship student in the 2nd year of a master’s degree in Information Management, with a specialization in Knowledge Management and Business Intelligence at Nova Information Management School (NOVA IMS)

    Snapshot : friend or foe of data management - on optimizing transaction processing in database and blockchain systems

    Get PDF
    Data management is a complicated task. Due to a wide range of data management tasks, businesses often need a sophisticated data management infrastructure with a plethora of distinct systems to fulfill their requirements. Moreover, since snapshot is an essential ingredient in solving many data management tasks such as checkpointing and recovery, they have been widely exploited in almost all major data management systems that have appeared in recent years. However, snapshots do not always guarantee exceptional performance. In this dissertation, we will see two different faces of the snapshot, one where it has a tremendous positive impact on the performance and usability of the system, and another where an incorrect usage of the snapshot might have a significant negative impact on the performance of the system. This dissertation consists of three loosely-coupled parts that represent three distinct projects that emerged during this doctoral research. In the first part, we analyze the importance of utilizing snapshots in relational database systems. We identify the bottlenecks in state-of-the-art snapshotting algorithms, propose two snapshotting techniques, and optimize the multi-version concurrency control for handling hybrid workloads effectively. Our snapshotting algorithm is up to 100x faster and reduces the latency of analytical queries by up to 4x in comparison to the state-of-the-art techniques. In the second part, we recognize strict snapshotting used by Fabric as a critical bottleneck, and replace it with MVCC and propose some additional optimizations to improve the throughput of the permissioned-blockchain system by up to 12x under highly contended workloads. In the last part, we propose ChainifyDB, a platform that transforms an existing database infrastructure into a blockchain infrastructure. ChainifyDB achieves up to 6x higher throughput in comparison to another state-of-the-art permissioned blockchain system. Furthermore, its external concurrency control protocol outperforms the internal concurrency control protocol of PostgreSQL and MySQL, achieving up to 2.6x higher throughput in a blockchain setup in comparison to a standalone isolated setup. We also utilize snapshots in ChainifyDB to support recovery, which has been missing so far from the permissioned-blockchain world.Datenverwaltung ist eine komplizierte Aufgabe. Aufgrund der vielfältigen Aufgaben im Bereich der Datenverwaltung benötigen Unternehmen häufig eine anspruchsvolle Infrastruktur mit einer Vielzahl an unterschiedlichen Systemen, um ihre Anforderungen zu erfüllen. Dabei ist Snapshotting ein wesentlicher Bestandteil in nahezu allen aktuellen Datenbanksystemen, um Probleme wie Checkpointing und Recovery zu lösen. Allerdings garantieren Snapshots nicht immer eine gute Performance. In dieser Arbeit werden wir zwei Facetten des Snapshots beleuchten: Einerseits können Snapshots enorm positive Auswirkungen auf die Performance und Usability des Systems haben, andererseits können sie bei falscher Anwendung zu erheblichen Performanceverlusten führen. Diese Dissertation besteht aus drei Teilen basierend auf drei unterschiedlichen Projekten, die im Rahmen der Forschung zu dieser Arbeit entstanden sind. Im ersten Teil untersuchen wir die Bedeutung von Snapshots in relationalen Datenbanksystemen. Wir identifizieren die Bottlenecks gegenwärtiger Snapshottingalgorithmen, stellen zwei leichtgewichtige Snapshottingverfahren vor und optimieren Multi- Version Concurrency Control f¨ur das effiziente Ausführen hybrider Workloads. Unser Snapshottingalgorithmus ist bis zu 100 mal schneller und verringert die Latenz analytischer Anfragen um bis zu Faktor vier gegenüber dem Stand der Technik. Im zweiten Teil identifizieren wir striktes Snapshotting als Bottleneck von Fabric. In Folge dessen ersetzen wir es durch MVCC und schlagen weitere Optimierungen vor, mit denen der Durchsatz des Permissioned Blockchain Systems unter hoher Arbeitslast um Faktor zwölf verbessert werden kann. Im letzten Teil stellen wir ChainifyDB vor, eine Platform die eine existierende Datenbankinfrastruktur in eine Blockchaininfrastruktur überführt. ChainifyDB erreicht dabei einen bis zu sechs mal höheren Durchsatz im Vergleich zu anderen aktuellen Systemen, die auf Permissioned Blockchains basieren. Das externe Concurrency Protokoll übertrifft dabei sogar die internen Varianten von PostgreSQL und MySQL und erreicht einen bis zu 2,6 mal höhren Durchsatz im Blockchain Setup als in einem eigenständigen isolierten Setup. Zusätzlich verwenden wir Snapshots in ChainifyDB zur Unterstützung von Recovery, was bisher im Rahmen von Permissioned Blockchains nicht möglich war

    A relational algebra approach to ETL modeling

    Get PDF
    The MAP-i Doctoral Programme in Informatics, of the Universities of Minho, Aveiro and PortoInformation Technology has been one of drivers of the revolution that currently is happening in today’s management decisions in most organizations. The amount of data gathered and processed through the use of computing devices has been growing every day, providing a valuable source of information for decision makers that are managing every type of organization, public or private. Gathering the right amount of data in a centralized and unified repository like a data warehouse is similar to build the foundations for a system that will act has a base to support decision making processes requiring factual information. Nevertheless, the complexity of building such a repository is very challenging, as well as developing all the components of a data warehousing system. One of the most critical components of a data warehousing system is the Extract-Transform-Load component, ETL for short, which is responsible for gathering data from information sources, clean, transform and conform it in order to store it in a data warehouse. Several designing methodologies for the ETL components have been presented in the last few years with very little impact in ETL commercial tools. Basically, this was due to an existing gap between the conceptual design of an ETL system and its correspondent physical implementation. The methodologies proposed ranged from new approaches, with novel notation and diagrams, to the adoption and expansion of current standard modeling notations, like UML or BPMN. However, all these proposals do not contain enough detail to be translated automatically into a specific execution platform. The use of a standard well-known notation like Relational Algebra might bridge the gap between the conceptual design and the physical design of an ETL component, mainly due to its formal approach that is based on a limited set of operators and also due to its functional characteristics like being a procedural language operating over data stored in relational format. The abstraction that Relational Algebra provides over the technological infrastructure might also be an advantage for uncommon execution platforms, like computing grids that provide an exceptional amount of processing power that is very critical for ETL systems. Additionally, partitioning data and task distribution over computing nodes works quite well with a Relational Algebra approach. An extensive research over the use of Relational Algebra in the ETL context was conducted to validate its usage. To complement this, a set of Relational Algebra patterns were also developed to support the most common ETL tasks, like changing data capture, data quality enforcement, data conciliation and integration, slowly changing dimensions and surrogate key pipelining. All these patterns provide a formal approach to the referred ETL tasks by specifying all the operations needed to accomplish them in a series of Relational Algebra operations. To evaluate the feasibility of the work done in this thesis, we used a real ETL application scenario for the extraction of data in two different social networks operational systems, storing hashtag usage information in a specific data mart. The ability to analyze trends in social network usage is a hot topic in today’s media and information coverage. A complete design of the ETL component using the patterns developed previously is also provided, as well as a critical evaluation of its usage.As Tecnologias da Informação têm sido um dos principais catalisadores na revolução que se assiste nas tomadas de decisão na maioria das organizações. A quantidade de dados que são angariados e processados através do uso de dispositivos computacionais tem crescido diariamente, tornando-se uma fonte de informação valiosa para os decisores que gerem todo o tipo de organizações, públicas ou privadas. Concentrar o conjunto ideal de dados num repositório centralizado e unificado, como um data warehouse, é essencial para a construção de um sistema que servirá de suporte aos processos de tomada de decisão que necessitam de factos. No entanto, a complexidade associada à construção deste repositório e de todos os componentes que caracterizam um sistema de data warehousing é extremamente desafiante. Um dos componentes mais críticos de um sistema de data warehousing é a componente de Extração-Transformação- Alimentação (ETL) que lida com a extração de dados das fontes, que limpa, transforma e concilia os dados com vista à sua integração no data warehouse. Nos últimos anos têm sido apresentadas várias metodologias de desenho da componente de ETL, no entanto estas não têm sido adotadas pelas ferramentas comerciais de ETL principalmente devido ao diferencial existente entre o desenho concetual e as plataformas físicas de execução. As metodologias de desenho propostas variam desde propostas que assentam em novas notações e diagramas até às propostas que usam notações standard como a notação UML e BPMN que depois são complementadas com conceitos de ETL. Contudo, estas propostas de modelação concetual não contêm informações detalhadas que permitam uma tradução automática para plataformas de execução. A utilização de uma linguagem standard e reconhecida como a linguagem de Álgebra Relacional pode servir como complemento e colmatar o diferencial existente entre o desenho concetual e o desenho físico da componente de ETL, principalmente devido ao facto de esta linguagem assentar numa abordagem procedimental com um conjunto limitado de operadores que atuam sobre dados armazenados num formato relacional. A abstração providenciada pela Álgebra Relacional relativamente às plataformas de execução pode eventualmente ser uma vantagem tendo em vista a utilização de plataformas menos comuns, como por exemplo grids computacionais. Este tipo de arquiteturas disponibiliza por norma um grande poder computacional o que é essencial para um sistema de ETL. O particionamento e distribuição dos dados e tarefas pelos nodos computacionais conjugam relativamente bem com a abordagem da Álgebra Relacional. No decorrer deste trabalho foi efetuado um estudo extensivo às propriedades da AR num contexto de ETL com vista à avaliação da sua usabilidade. Como complemento, foram desenhados um conjunto de padrões de AR que suportam as atividades mais comuns de ETL como por exemplo changing data capture, data quality enforcement, data conciliation and integration, slowly changing dimensions e surrogate key pipelining. Estes padrões formalizam este conjunto de atividades ETL, especificando numa série de operações de Álgebra Relacional quais os passos necessários à sua execução. Com vista à avaliação da sustentabilidade da proposta presente neste trabalho, foi utilizado um cenário real de ETL em que os dados fontes pertencem a duas redes sociais e os dados armazenados no data mart identificam a utilização de hashtags por parte dos seus utilizadores. De salientar que a deteção de tendências e de assuntos que estão na ordem do dia nas redes sociais é de vital importância para as empresas noticiosas e para as próprias redes sociais. Por fim, é apresentado o desenho completo do sistema de ETL para o cenário escolhido, utilizando os padrões desenvolvidos neste trabalho, avaliando e criticando a sua utilização

    Data virtualization design model for near real time decision making in business intelligence environment

    Get PDF
    The main purpose of Business Intelligence (BI) is to focus on supporting an organization‘s strategic, operational and tactical decisions by providing comprehensive, accurate and vivid data to the decision makers. A data warehouse (DW), which is considered as the input for decision making system activities is created through a complex process known as Extract, Transform and Load (ETL). ETL operates at pre-defined times and requires time to process and transfer data. However, providing near real time information to facilitate the data integration in supporting decision making process is a known issue. Inaccessibility to near realtime information could be overcome with Data Virtualization (DV) as it provides unified, abstracted, near real time, and encapsulated view of information for querying. Nevertheless, currently, there are lack of studies on the BI model for developing and managing data in virtual manner that can fulfil the organization needs. Therefore, the main aim of this study is to propose a DV model for near-real time decision making in BI environment. Design science research methodology was adopted to accomplish the research objectives. As a result of this study, a model called Data Virtualization Development Model (DVDeM) is proposed that addresses the phases and components which affect the BI environment. To validate the model, expert reviews and focus group discussions were conducted. A prototype based on the proposed model was also developed, and then implemented in two case studies. Also, an instrument was developed to measure the usability of the prototype in providing near real time data. In total, 60 participants were involved and the findings indicated that 93% of the participants agreed that the DVDeM based prototype was able to provide near real-time data for supporting decision-making process. From the studies, the findings also showed that the majority of the participants (more than 90%) in both of education and business sectors, have affirmed the workability of the DVDeM and the usability of the prototype in particular able to deliver near real-time decision-making data. Findings also indicate theoretical and practical contributions for developers to develop efficient BI applications using DV technique. Also, the mean values for each measurement item are greater than 4 indicating that the respondents agreed with the statement for each measurement item. Meanwhile, it was found that the mean scores for overall usability attributes of DVDeM design model fall under "High" or "Fairly High". Therefore, the results show sufficient indications that by adopting DVDeM model in developing a system, the usability of the produced system is perceived by the majority of respondents as high and is able to support near real time decision making data

    Collaboration and Virtualization in Large Information Systems Projects

    Get PDF
    A project is evolving through different phases from idea and conception until the experiments, implementation and maintenance. The globalization, the Internet, the Web and the mobile computing changed many human activities, and in this respect, the realization of the Information System (IS) projects. The projects are growing, the teams are geographically distributed, and the users are heterogeneous. In this respect, the realization of the large Information Technology (IT) projects needs to use collaborative technologies. The distribution of the team, the users' heterogeneity and the project complexity determines the virtualization. This paper is an overview of these aspects for large IT projects. It shortly present a general framework developed by the authors for collaborative systems in general and adapted to collaborative project management. The general considerations are illustrated on the case of a large IT project in which the authors were involved

    A data management and analytic model for business intelligence applications

    Get PDF
    Most organisations use several data management and business intelligence solutions which are on-premise and, or cloud-based to manage and analyse their constantly growing business data. Challenges faced by organisations nowadays include, but are not limited to growth limitations, big data, inadequate analytics, computing, and data storage capabilities. Although these organisations are able to generate reports and dashboards for decision-making in most cases, effective use of their business data and an appropriate business intelligence solution could achieve and retain informed decision-making and allow competitive reaction to the dynamic external environment. A data management and analytic model has been proposed on which organisations could rely for decisive guidance when planning to procure and implement a unified business intelligence solution. To achieve a sound model, literature was reviewed by extensively studying business intelligence in general, and exploring and developing various deployment models and architectures consisting of naïve, on-premise, and cloud-based which revealed their benefits and challenges. The outcome of the literature review was the development of a hybrid business intelligence model and the accompanying architecture as the main contribution to the study.In order to assess the state of business intelligence utilisation, and to validate and improve the proposed architecture, two case studies targeting users and experts were conducted using quantitative and qualitative approaches. The case studies found and established that a decision to procure and implement a successful business intelligence solution is based on a number of crucial elements, such as, applications, devices, tools, business intelligence services, data management and infrastructure. The findings further recognised that the proposed hybrid architecture is the solution for managing complex organisations with serious data challenges.ComputingM. Sc. (Computing

    Just-in-time Analytics Over Heterogeneous Data and Hardware

    Get PDF
    Industry and academia are continuously becoming more data-driven and data-intensive, relying on the analysis of a wide variety of datasets to gain insights. At the same time, data variety increases continuously across multiple axes. First, data comes in multiple formats, such as the binary tabular data of a DBMS, raw textual files, and domain-specific formats. Second, different datasets follow different data models, such as the relational and the hierarchical one. Data location also varies: Some datasets reside in a central "data lake", whereas others lie in remote data sources. In addition, users execute widely different analysis tasks over all these data types. Finally, the process of gathering and integrating diverse datasets introduces several inconsistencies and redundancies in the data, such as duplicate entries for the same real-world concept. In summary, heterogeneity significantly affects the way data analysis is performed. In this thesis, we aim for data virtualization: Abstracting data out of its original form and manipulating it regardless of the way it is stored or structured, without a performance penalty. To achieve data virtualization, we design and implement systems that i) mask heterogeneity through the use of heterogeneity-aware, high-level building blocks and ii) offer fast responses through on-demand adaptation techniques. Regarding the high-level building blocks, we use a query language and algebra to handle multiple collection types, such as relations and hierarchies, express transformations between these collection types, as well as express complex data cleaning tasks over them. In addition, we design a location-aware compiler and optimizer that masks away the complexity of accessing multiple remote data sources. Regarding on-demand adaptation, we present a design to produce a new system per query. The design uses customization mechanisms that trigger runtime code generation to mimic the system most appropriate to answer a query fast: Query operators are thus created based on the query workload and the underlying data models; the data access layer is created based on the underlying data formats. In addition, we exploit emerging hardware by customizing the system implementation based on the available heterogeneous processors â CPUs and GPGPUs. We thus pair each workload with its ideal processor type. The end result is a just-in-time database system that is specific to the query, data, workload, and hardware instance. This thesis redesigns the data management stack to natively cater for data heterogeneity and exploit hardware heterogeneity. Instead of centralizing all relevant datasets, converting them to a single representation, and loading them in a monolithic, static, suboptimal system, our design embraces heterogeneity. Overall, our design decouples the type of performed analysis from the original data layout; users can perform their analysis across data stores, data models, and data formats, but at the same time experience the performance offered by a custom system that has been built on demand to serve their specific use case

    Pragmatic development of service based real-time change data capture

    Get PDF
    This thesis makes a contribution to the Change Data Capture (CDC) field by providing an empirical evaluation on the performance of CDC architectures in the context of realtime data warehousing. CDC is a mechanism for providing data warehouse architectures with fresh data from Online Transaction Processing (OLTP) databases. There are two types of CDC architectures, pull architectures and push architectures. There is exiguous data on the performance of CDC architectures in a real-time environment. Performance data is required to determine the real-time viability of the two architectures. We propose that push CDC architectures are optimal for real-time CDC. However, push CDC architectures are seldom implemented because they are highly intrusive towards existing systems and arduous to maintain. As part of our contribution, we pragmatically develop a service based push CDC solution, which addresses the issues of intrusiveness and maintainability. Our solution uses Data Access Services (DAS) to decouple CDC logic from the applications. A requirement for the DAS is to place minimal overhead on a transaction in an OLTP environment. We synthesize DAS literature and pragmatically develop DAS that eciently execute transactions in an OLTP environment. Essentially we develop effeicient RESTful DAS, which expose Transactions As A Resource (TAAR). We evaluate the TAAR solution and three pull CDC mechanisms in a real-time environment, using the industry recognised TPC-C benchmark. The optimal CDC mechanism in a real-time environment, will capture change data with minimal latency and will have a negligible affect on the database's transactional throughput. Capture latency is the time it takes a CDC mechanism to capture a data change that has been applied to an OLTP database. A standard definition for capture latency and how to measure it does not exist in the field. We create this definition and extend the TPC-C benchmark to make the capture latency measurement. The results from our evaluation show that pull CDC is capable of real-time CDC at low levels of user concurrency. However, as the level of user concurrency scales upwards, pull CDC has a significant impact on the database's transaction rate, which affirms the theory that pull CDC architectures are not viable in a real-time architecture. TAAR CDC on the other hand is capable of real-time CDC, and places a minimal overhead on the transaction rate, although this performance is at the expense of CPU resources