    Quality measures for ETL processes: from goals to implementation

    Extraction transformation loading (ETL) processes play an increasingly important role for the support of modern business operations. These business processes are centred around artifacts with high variability and diverse lifecycles, which correspond to key business entities. The apparent complexity of these activities has been examined through the prism of business process management, mainly focusing on functional requirements and performance optimization. However, the quality dimension has not yet been thoroughly investigated, and there is a need for a more human-centric approach to bring them closer to business-users requirements. In this paper, we take a first step towards this direction by defining a sound model for ETL process quality characteristics and quantitative measures for each characteristic, based on existing literature. Our model shows dependencies among quality characteristics and can provide the basis for subsequent analysis using goal modeling techniques. We showcase the use of goal modeling for ETL process design through a use case, where we employ the use of a goal model that includes quantitative components (i.e., indicators) for evaluation and analysis of alternative design decisions.Peer ReviewedPostprint (author's final draft

    Standart-konformes Snapshotting für SystemC Virtuelle Plattformen

    The steady increase in complexity of high-end embedded systems goes along with an increasingly complex design process. We are currently still in a transition phase from Hardware-Description Language (HDL) based design towards virtual-platform-based design of embedded systems. As design complexity rises faster than developer productivity a gap forms. Restoring productivity while at the same time managing increased design complexity can also be achieved through focussing on the development of new tools and design methodologies. In most application areas, high-level modelling languages such as SystemC are used in early design phases. In modern software development Continuous Integration (CI) is used to automatically test if a submitted piece of code breaks functionality. Application of the CI concept to embedded system design and testing requires fast build and test execution times from the virtual platform framework. For this use case the ability to save a specific state of a virtual platform becomes necessary. The saving and restoring of specific states of a simulation requires the ability to serialize all data structures within the simulation models. Improving the frameworks and establishing better methods will only help to narrow the design gap, if these changes are introduced with the needs of the engineers and developers in mind. Ultimately, it is their productivity that shall be improved. The ability to save the state of a virtual platform enables developers to run longer test campaigns that can even contain randomized test stimuli. If the saved states are modifiable the developers can inject faulty states into the simulation models. This work contributes an extension to the SoCRocket virtual platform framework to enable snapshotting. The snapshotting extension can be considered a reference implementation as the utilization of current SystemC/TLM standards makes it compatible to other frameworkds. Furthermore, integrating the UVM SystemC library into the framework enables test driven development and fast validation of SystemC/TLM models using snapshots. These extensions narrow the design gap by supporting designers, testers and developers to work more efficiently.Die stetige Steigerung der Komplexität eingebetteter Systeme geht einher mit einer ebenso steigenden Komplexität des Entwurfsprozesses. Wir befinden uns momentan in der Übergangsphase vom Entwurf von eingebetteten Systemen basierend auf Hardware-Beschreibungssprachen hin zum Entwurf ebendieser basierend auf virtuellen Plattformen. Da die Entwurfskomplexität rasanter steigt als die Produktivität der Entwickler, entsteht eine Kluft. Die Produktivität wiederherzustellen und gleichzeitig die gesteigerte Entwurfskomplexität zu bewältigen, kann auch erreicht werden, indem der Fokus auf die Entwicklung neuer Werkzeuge und Entwurfsmethoden gelegt wird. In den meisten Anwendungsgebieten werden Modellierungssprachen auf hoher Ebene, wie zum Beispiel SystemC, in den frühen Entwurfsphasen benutzt. In der modernen Software-Entwicklung wird Continuous Integration (CI) benutzt um automatisiert zu überprüfen, ob eine eingespielte Änderung am Quelltext bestehende Funktionalitäten beeinträchtigt. Die Anwendung des CI-Konzepts auf den Entwurf und das Testen von eingebetteten Systemen fordert schnelle Bau- und Test-Ausführungszeiten von dem genutzten Framework für virtuelle Plattformen. Für diesen Anwendungsfall wird auch die Fähigkeit, einen bestimmten Zustand der virtuellen Plattform zu speichern, erforderlich. Das Speichern und Wiederherstellen der Zustände einer Simulation erfordert die Serialisierung aller Datenstrukturen, die sich in den Simulationsmodellen befinden. Das Verbessern von Frameworks und Etablieren besserer Methodiken hilft nur die Entwurfs-Kluft zu verringern, wenn diese Änderungen mit Berücksichtigung der Bedürfnisse der Entwickler und Ingenieure eingeführt werden. Letztendlich ist es ihre Produktivität, die gesteigert werden soll. Die Fähigkeit den Zustand einer virtuellen Plattform zu speichern, ermöglicht es den Entwicklern, längere Testkampagnen laufen zu lassen, die auch zufällig erzeugte Teststimuli beinhalten können oder, falls die gespeicherten Zustände modifizierbar sind, fehlerbehaftete Zustände in die Simulationsmodelle zu injizieren. Mein mit dieser Arbeit geleisteter Beitrag beinhaltet die Erweiterung des SoCRocket Frameworks um Checkpointing Funktionalität im Sinne einer Referenzimplementierung. Weiterhin ermöglicht die Integration der UVM SystemC Bibliothek in das Framework die Umsetzung der testgetriebenen Entwicklung und schnelle Validierung von SystemC/TLM Modellen mit Hilfe von Snapshots

    Performance Evaluation of Structured and Unstructured Data in PIG/HADOOP and MONGO-DB Environments

    The exponential development of data initially exhibited difficulties for prominent organizations, for example, Google, Yahoo, Amazon, Microsoft, Facebook, Twitter and so forth. The size of the information that needs to be handled by cloud applications is developing significantly quicker than storage capacity. This development requires new systems for managing and breaking down data. The term Big Data is used to address large volumes of unstructured (or semi-structured) and structured data that gets created from different applications, messages, weblogs, and online networking. Big Data is data whose size, variety and uncertainty require new supplementary models, procedures, algorithms, and research to manage and extract value and concealed learning from it. To process more information efficiently and skillfully, for analysis parallelism is utilized. To deal with the unstructured and semi-structured information NoSQL database has been presented. Hadoop better serves the Big Data analysis requirements. It is intended to scale up starting from a single server to a large cluster of machines, which has a high level of adaptation to internal failure. Many business and research institutes such as Facebook, Yahoo, Google, and so on had an expanding need to import, store, and analyze dynamic semi-structured data and its metadata. Also, significant development of semi-structured data inside expansive web-based organizations has prompted the formation of NoSQL data collections for flexible sorting and MapReduce for adaptable parallel analysis. They assessed, used and altered Hadoop, the most popular open source execution of MapReduce, for tending to the necessities of various valid analytics problems. These institutes are also utilizing MongoDB, and a report situated NoSQL store. In any case, there is a limited comprehension of the execution trade-offs of using these two innovations. This paper assesses the execution, versatility, and adaptation to an internal failure of utilizing MongoDB and Hadoop, towards the objective of recognizing the correct programming condition for logical data analytics and research. Lately, an expanding number of organizations have developed diverse, distinctive kinds of non-relational databases (such as MongoDB, Cassandra, Hypertable, HBase/ Hadoop, CouchDB and so on), generally referred to as NoSQL databases. The enormous amount of information generated requires an effective system to analyze the data in various scenarios, under various breaking points. In this paper, the objective is to find the break-even point of both Hadoop/Pig and MongoDB and develop a robust environment for data analytics

    Resilience for large ensemble computations

    With the increasing power of supercomputers, ever more detailed models of physical systems can be simulated, and ever larger problem sizes can be considered for any kind of numerical system. During the last twenty years the performance of the fastest clusters went from the teraFLOPS domain (ASCI RED: 2.3 teraFLOPS) to the pre-exaFLOPS domain (Fugaku: 442 petaFLOPS), and we will soon have the first supercomputer with a peak performance cracking the exaFLOPS (El Capitan: 1.5 exaFLOPS). Ensemble techniques experience a renaissance with the availability of those extreme scales. Especially recent techniques, such as particle filters, will benefit from it. Current ensemble methods in climate science, such as ensemble Kalman filters, exhibit a linear dependency between the problem size and the ensemble size, while particle filters show an exponential dependency. Nevertheless, with the prospect of massive computing power come challenges such as power consumption and fault-tolerance. The mean-time-between-failures shrinks with the number of components in the system, and it is expected to have failures every few hours at exascale. In this thesis, we explore and develop techniques to protect large ensemble computations from failures. We present novel approaches in differential checkpointing, elastic recovery, fully asynchronous checkpointing, and checkpoint compression. Furthermore, we design and implement a fault-tolerant particle filter with pre-emptive particle prefetching and caching. And finally, we design and implement a framework for the automatic validation and application of lossy compression in ensemble data assimilation. Altogether, we present five contributions in this thesis, where the first two improve state-of-the-art checkpointing techniques, and the last three address the resilience of ensemble computations. The contributions represent stand-alone fault-tolerance techniques, however, they can also be used to improve the properties of each other. For instance, we utilize elastic recovery (2nd contribution) for mitigating resiliency in an online ensemble data assimilation framework (3rd contribution), and we built our validation framework (5th contribution) on top of our particle filter implementation (4th contribution). We further demonstrate that our contributions improve resilience and performance with experiments on various architectures such as Intel, IBM, and ARM processors.Amb l’increment de les capacitats de còmput dels supercomputadors, es poden simular models de sistemes físics encara més detallats, i es poden resoldre problemes de més grandària en qualsevol tipus de sistema numèric. Durant els últims vint anys, el rendiment dels clústers més ràpids ha passat del domini dels teraFLOPS (ASCI RED: 2.3 teraFLOPS) al domini dels pre-exaFLOPS (Fugaku: 442 petaFLOPS), i aviat tindrem el primer supercomputador amb un rendiment màxim que sobrepassa els exaFLOPS (El Capitan: 1.5 exaFLOPS). Les tècniques d’ensemble experimenten un renaixement amb la disponibilitat d’aquestes escales tan extremes. Especialment les tècniques més noves, com els filtres de partícules, se¿n beneficiaran. Els mètodes d’ensemble actuals en climatologia, com els filtres d’ensemble de Kalman, exhibeixen una dependència lineal entre la mida del problema i la mida de l’ensemble, mentre que els filtres de partícules mostren una dependència exponencial. No obstant, juntament amb les oportunitats de poder computar massivament, apareixen desafiaments com l’alt consum energètic i la necessitat de tolerància a errors. El temps de mitjana entre errors es redueix amb el nombre de components del sistema, i s’espera que els errors s’esdevinguin cada poques hores a exaescala. En aquesta tesis, explorem i desenvolupem tècniques per protegir grans càlculs d’ensemble d’errors. Presentem noves tècniques en punts de control diferencials, recuperació elàstica, punts de control totalment asincrònics i compressió de punts de control. A més, dissenyem i implementem un filtre de partícules tolerant a errors amb captació i emmagatzematge en caché de partícules de manera preventiva. I finalment, dissenyem i implementem un marc per la validació automàtica i l’aplicació de compressió amb pèrdua en l’assimilació de dades d’ensemble. En total, en aquesta tesis presentem cinc contribucions, les dues primeres de les quals milloren les tècniques de punts de control més avançades, mentre que les tres restants aborden la resiliència dels càlculs d’ensemble. Les contribucions representen tècniques independents de tolerància a errors; no obstant, també es poden utilitzar per a millorar les propietats de cadascuna. Per exemple, utilitzem la recuperació elàstica (segona contribució) per a mitigar la resiliència en un marc d’assimilació de dades d’ensemble en línia (tercera contribució), i construïm el nostre marc de validació (cinquena contribució) sobre la nostra implementació del filtre de partícules (quarta contribució). A més, demostrem que les nostres contribucions milloren la resiliència i el rendiment amb experiments en diverses arquitectures, com processadors Intel, IBM i ARM.Postprint (published version

    Optimising Fault Tolerance in Real-time Cloud Computing IaaS Environment

    YesFault tolerance is the ability of a system to respond swiftly to an unexpected failure. Failures in a cloud computing environment are normal rather than exceptional, but fault detection and system recovery in a real time cloud system is a crucial issue. To deal with this problem and to minimize the risk of failure, an optimal fault tolerance mechanism was introduced where fault tolerance was achieved using the combination of the Cloud Master, Compute nodes, Cloud load balancer, Selection mechanism and Cloud Fault handler. In this paper, we proposed an optimized fault tolerance approach where a model is designed to tolerate faults based on the reliability of each compute node (virtual machine) and can be replaced if the performance is not optimal. Preliminary test of our algorithm indicates that the rate of increase in pass rate exceeds the decrease in failure rate and it also considers forward and backward recovery using diverse software tools. Our results obtained are demonstrated through experimental validation thereby laying a foundation for a fully fault tolerant IaaS Cloud environment, which suggests a good performance of our model compared to current existing approaches.Petroleum Technology Development Fund (PTDF

    Optimizations for Energy-Aware, High-Performance and Reliable Distributed Storage Systems

    With the decreasing cost and wide-spread use of commodity hard drives, it has become possible to create very large-scale storage systems with less expense. However, as we approach exabyte-scale storage systems, maintaining important features such as energy-efficiency, performance, reliability and usability became increasingly difficult. Despite the decreasing cost of storage systems, the energy consumption of these systems still needs to be addressed in order to retain cost-effectiveness. Any improvements in a storage system can be outweighed by high energy costs. On the other hand, large-scale storage systems can benefit more from the object storage features for improved performance and usability. One area of concern is metadata performance bottleneck of applications reading large directories or creating a large number of files. Similarly, computation on big data where data needs to be transferred between compute and storage clusters adversely affects I/O performance. As the storage systems become more complex and larger, transferring data between remote compute and storage tiers becomes impractical. Furthermore, storage systems implement reliability typically at the file system or client level. This approach might not always be practical in terms of performance. Lastly, object storage features are usually tailored to specific use cases that makes it harder to use them in various contexts. In this thesis, we are presenting several approaches to enhance energy-efficiency, performance, reliability and usability of large-scale storage systems. To begin with, we improve the energy-efficiency of storage systems by moving I/O load to a subset of the storage nodes with energy-aware node allocation methods and turn off the unused nodes, while preserving load balance on demand. To address the metadata performance issue associated with large creates and directory reads, we represent directories with object storage collections and implement lazy creation of objects. Similarly, in-situ computation on large-scale data is enabled by using object storage features to integrate a computational framework with the existing object storage layer to eliminate the need to transfer data between compute and storage silos for better performance. We then present parity-based redundancy using object storage features to achieve reliability with less performance impact. Finally, unified storage brings together the object storage features to meet the needs of distinct use cases; such as cloud storage, big data or high-performance computing to alleviate the unnecessary fragmentation of storage resources. We evaluate each proposed approach thoroughly and validate their effectiveness in terms of improving energy-efficiency, performance, reliability and usability of a large-scale storage system
