363 research outputs found
TANDEM: taming failures in next-generation datacenters with emerging memory
The explosive growth of online services, leading to unforeseen scales, has made modern datacenters highly prone to failures. Taming these failures hinges on fast and correct recovery, minimizing service interruptions.
Applications, owing to recovery, entail additional measures to maintain a recoverable state of data and computation logic during their failure-free execution. However, these precautionary measures have
severe implications on performance, correctness, and programmability, making recovery incredibly challenging to realize in practice.
Emerging memory, particularly non-volatile memory (NVM) and disaggregated memory (DM), offers a promising opportunity to achieve fast recovery with maximum performance. However, incorporating these technologies into datacenter architecture presents significant challenges; Their distinct architectural attributes, differing significantly from traditional memory devices, introduce new semantic challenges for
implementing recovery, complicating correctness and programmability.
Can emerging memory enable fast, performant, and correct recovery in the datacenter? This thesis aims to answer this question while addressing the associated challenges.
When architecting datacenters with emerging memory, system architects face four key challenges: (1) how to guarantee correct semantics; (2) how to efficiently enforce correctness with optimal performance; (3) how to validate end-to-end correctness including recovery; and (4) how to preserve programmer productivity (Programmability).
This thesis aims to address these challenges through the following approaches: (a)
defining precise consistency models that formally specify correct end-to-end semantics
in the presence of failures (consistency models also play a crucial role in programmability); (b) developing new low-level mechanisms to efficiently enforce the prescribed models given the capabilities of emerging memory; and (c) creating robust testing frameworks to validate end-to-end correctness and recovery.
We start our exploration with non-volatile memory (NVM), which offers fast persistence capabilities directly accessible through the processorâs load-store (memory) interface. Notably, these capabilities can be leveraged to enable fast recovery for Log-Free Data Structures (LFDs) while maximizing performance. However, due to the complexity of modern cache hierarchies, data hardly persist in any specific order, jeop-
ardizing recovery and correctness. Therefore, recovery needs primitives that explicitly control the order of updates to NVM (known as persistency models). We outline the precise specification of a novel persistency model â Release Persistency (RP) â that provides a consistency guarantee for LFDs on what remains in non-volatile memory upon failure. To efficiently enforce RP, we propose a novel microarchitecture mechanism,
lazy release persistence (LRP). Using standard LFDs benchmarks, we show that LRP achieves fast recovery while incurring minimal overhead on performance.
We continue our discussion with memory disaggregation which decouples memory from traditional monolithic servers, offering a promising pathway for achieving very high availability in replicated in-memory data stores. Achieving such availability hinges on transaction protocols that can efficiently handle recovery in this setting, where
compute and memory are independent. However, there is a challenge: disaggregated memory (DM) fails to work with RPC-style protocols, mandating one-sided transaction protocols. Exacerbating the problem, one-sided transactions expose critical low-level
ordering to architects, posing a threat to correctness. We present a highly available transaction protocol, Pandora, that is specifically designed to achieve fast recovery in disaggregated key-value stores (DKVSes).
Pandora is the first one-sided transactional protocol that ensures correct, non-blocking, and fast recovery in DKVS. Our experimental implementation artifacts demonstrate that Pandora achieves fast recovery and high availability while causing minimal disruption to services.
Finally, we introduce a novel target litmus-testing framework â DART â to validate the end-to-end correctness of transactional protocols with recovery. Using DARTâs target testing capabilities, we have found several critical bugs in Pandora, highlighting the need for robust end-to-end testing methods in the design loop to iteratively fix correctness bugs. Crucially, DART is lightweight and black-box, thereby eliminating
any intervention from the programmers
Multi-objective scheduling for real-time data warehouses
The issue of write-read contention is one of the most prevalent problems when deploying real-time data warehouses. With increasing load, updates are increasingly delayed and previously fast queries tend to be slowed down considerably. However, depending on the user requirements, we can improve the response time or the data quality by scheduling the queries and updates appropriately. If both criteria are to be considered simultaneously, we are faced with a so-called multi-objective optimization problem. We transformed this problem into a knapsack problem with additional inequalities and solved it efficiently. Based on our solution, we developed a scheduling approach that provides the optimal schedule with regard to the user requirements at any given point in time. We evaluated our scheduling in an extensive experimental study, where we compared our approach with the respective optimal schedule policies of each single optimization objective
Anpassen verteilter eingebetteter Anwendungen im laufenden Betrieb
The availability of third-party apps is among the key success factors for software ecosystems: The users benefit from more features and innovation speed, while third-party solution vendors can leverage the platform to create successful offerings.
However, this requires a certain decoupling of engineering activities of the different parties not achieved for distributed control systems, yet.
While late and dynamic integration of third-party components would be required, resulting control systems must provide high reliability regarding real-time requirements, which leads to integration complexity.
Closing this gap would particularly contribute to the vision of software-defined manufacturing, where an ecosystem of modern IT-based control system components could lead to faster innovations due to their higher abstraction and availability of various frameworks.
Therefore, this thesis addresses the research question:
How we can use modern IT technologies and enable independent evolution and easy third-party integration of software components in distributed control systems, where deterministic end-to-end reactivity is required, and especially, how can we apply distributed changes to such systems consistently and reactively during operation?
This thesis describes the challenges and related approaches in detail and points out that existing approaches do not fully address our research question.
To tackle this gap, a formal specification of a runtime platform concept is presented in conjunction with a model-based engineering approach.
The engineering approach decouples the engineering steps of component definition, integration, and deployment.
The runtime platform supports this approach by isolating the components, while still offering predictable end-to-end real-time behavior.
Independent evolution of software components is supported through a concept for synchronous reconfiguration during full operation, i.e., dynamic orchestration of components.
Time-critical state transfer is supported, too, and can lead to bounded quality degradation, at most.
The reconfiguration planning is supported by analysis concepts, including simulation of a formally specified system and reconfiguration, and analyzing potential quality degradation with the evolving dataflow graph (EDFG) method.
A platform-specific realization of the concepts, the real-time container architecture, is described as a reference implementation.
The model and the prototype are evaluated regarding their feasibility and applicability of the concepts by two case studies.
The first case study is a minimalistic distributed control system used in different setups with different component variants and reconfiguration plans to compare the model and the prototype and to gather runtime statistics.
The second case study is a smart factory showcase system with more challenging application components and interface technologies.
The conclusion is that the concepts are feasible and applicable, even though the concepts and the prototype still need to be worked on in future -- for example, to reach shorter cycle times.Eine groĂe Auswahl von Drittanbieter-Lösungen ist einer der SchlĂŒsselfaktoren fĂŒr Software Ecosystems:
Nutzer profitieren vom breiten Angebot und schnellen Innovationen, wĂ€hrend Drittanbieter ĂŒber die Plattform erfolgreiche Lösungen anbieten können.
Das jedoch setzt eine gewisse Entkopplung von Entwicklungsschritten der Beteiligten voraus, welche fĂŒr verteilte Steuerungssysteme noch nicht erreicht wurde.
WĂ€hrend Drittanbieter-Komponenten möglichst spĂ€t -- sogar Laufzeit -- integriert werden mĂŒssten, mĂŒssen Steuerungssysteme jedoch eine hohe ZuverlĂ€ssigkeit gegenĂŒber Echtzeitanforderungen aufweisen, was zu IntegrationskomplexitĂ€t fĂŒhrt.
Dies zu lösen wĂŒrde insbesondere zur Vision von Software-definierter Produktion beitragen, da ein Ecosystem fĂŒr moderne IT-basierte Steuerungskomponenten wegen deren höherem Abstraktionsgrad und der Vielzahl verfĂŒgbarer Frameworks zu schnellerer Innovation fĂŒhren wĂŒrde.
Daher behandelt diese Dissertation folgende Forschungsfrage:
Wie können wir moderne IT-Technologien verwenden und unabhĂ€ngige Entwicklung und einfache Integration von Software-Komponenten in verteilten Steuerungssystemen ermöglichen, wo Ende-zu-Ende-Echtzeitverhalten gefordert ist, und wie können wir insbesondere verteilte Ănderungen an solchen Systemen konsistent und im Vollbetrieb vornehmen?
Diese Dissertation beschreibt Herausforderungen und verwandte AnsÀtze im Detail und zeigt auf, dass existierende AnsÀtze diese Frage nicht vollstÀndig behandeln.
Um diese LĂŒcke zu schlieĂen, beschreiben wir eine formale Spezifikation einer Laufzeit-Plattform und einen zugehörigen Modell-basierten Engineering-Ansatz.
Dieser Ansatz entkoppelt die Design-Schritte der Entwicklung, Integration und des Deployments von Komponenten.
Die Laufzeit-Plattform unterstĂŒtzt den Ansatz durch Isolation von Komponenten und zugleich Zeit-deterministischem Ende-zu-Ende-Verhalten.
UnabhĂ€ngige Entwicklung und Integration werden durch Konzepte fĂŒr synchrone Rekonfiguration im Vollbetrieb unterstĂŒtzt, also durch dynamische Orchestrierung.
Dies beinhaltet auch Zeit-kritische Zustands-Transfers mit höchstens begrenzter QualitĂ€tsminderung, wenn ĂŒberhaupt.
Rekonfigurationsplanung wird durch Analysekonzepte unterstĂŒtzt, einschlieĂlich der Simulation formal spezifizierter Systeme und Rekonfigurationen und der Analyse der etwaigen QualitĂ€tsminderung mit dem Evolving Dataflow Graph (EDFG).
Die Real-Time Container Architecture wird als Referenzimplementierung und Evaluationsplattform beschrieben.
Zwei Fallstudien untersuchen Machbarkeit und NĂŒtzlichkeit der Konzepte.
Die erste verwendet verschiedene Varianten und Rekonfigurationen eines minimalistischen verteilten Steuerungssystems, um Modell und Prototyp zu vergleichen sowie Laufzeitstatistiken zu erheben.
Die zweite Fallstudie ist ein Smart-Factory-Demonstrator, welcher herausforderndere Applikationskomponenten und Schnittstellentechnologien verwendet.
Die Konzepte sind den Studien nach machbar und nĂŒtzlich, auch wenn sowohl die Konzepte als auch der Prototyp noch weitere Arbeit benötigen -- zum Beispiel, um kĂŒrzere Zyklen zu erreichen
Data aggregation for multi-instance security management tools in telecommunication network
Communication Service Providers employ multiple instances of network monitoring tools within extensive networks that span large geographical regions, encompassing entire countries. By collecting monitoring data from various nodes and consolidating it in a central location, a comprehensive control dashboard is established, presenting an overall network status categorized under different perspectives.
In order to achieve this centralized view, we evaluated three architectural options: polling data from individual nodes to a central node, asynchronous push of data from individual nodes to a central node, and a cloud-based Extract, Transform, Load (ETL) approach. Our analysis leads us to the conclusion that the third option is most suitable for the telecommunication system use case.
Remarkably, we observed that the quantity of monitoring results is approximately 30 times greater than the total number of devices monitored within the network. Implementing the ETL-based approach, we achieved favorable performance times of 2.23 seconds, 7.16 seconds, and 27.96 seconds for small, medium, and large networks, respectively. Notably, the extraction operation required the most significant amount of time, followed by the load and processing phases. Furthermore, in terms of average memory consumption, the small, medium, and large networks necessitated 323.59 MB, 497.34 MB, and 1668.59 MB, respectively. It is worth noting that the relationship between the total number of devices in the system and both performance and memory consumption is linear in nature
Edge and Big Data technologies for Industry 4.0 to create an integrated pre-sale and after-sale environment
The fourth industrial revolution, also known as Industry 4.0, has rapidly gained traction in businesses across Europe and the world, becoming a central theme in small, medium, and large enterprises alike. This new paradigm shifts the focus from locally-based and barely automated firms to a globally interconnected industrial sector, stimulating economic growth and productivity, and supporting the upskilling and reskilling of employees. However, despite the maturity and scalability of information and cloud technologies, the support systems already present in the machine field are often outdated and lack the necessary security, access control, and advanced communication capabilities.
This dissertation proposes architectures and technologies designed to bridge the gap between Operational and Information Technology, in a manner that is non-disruptive, efficient, and scalable. The proposal presents cloud-enabled data-gathering architectures that make use of the newest IT and networking technologies to achieve the desired quality of service and non-functional properties. By harnessing industrial and business data, processes can be optimized even before product sale, while the integrated environment enhances data exchange for post-sale support.
The architectures have been tested and have shown encouraging performance results, providing a promising solution for companies looking to embrace Industry 4.0, enhance their operational capabilities, and prepare themselves for the upcoming fifth human-centric revolution
Rise of the Planet of Serverless Computing: A Systematic Review
Serverless computing is an emerging cloud computing paradigm, being adopted to develop a wide range of software applications.
It allows developers to focus on the application logic in the granularity of function, thereby freeing developers from tedious and
error-prone infrastructure management. Meanwhile, its unique characteristic poses new challenges to the development and deployment
of serverless-based applications. To tackle these challenges, enormous research efforts have been devoted. This paper provides a
comprehensive literature review to characterize the current research state of serverless computing. Specifically, this paper covers 164
papers on 17 research directions of serverless computing, including performance optimization, programming framework, application
migration, multi-cloud development, testing and debugging, etc. It also derives research trends, focus, and commonly-used platforms
for serverless computing, as well as promising research opportunities
Technologies and Applications for Big Data Value
This open access book explores cutting-edge solutions and best practices for big data and data-driven AI applications for the data-driven economy. It provides the reader with a basis for understanding how technical issues can be overcome to offer real-world solutions to major industrial areas. The book starts with an introductory chapter that provides an overview of the book by positioning the following chapters in terms of their contributions to technology frameworks which are key elements of the Big Data Value Public-Private Partnership and the upcoming Partnership on AI, Data and Robotics. The remainder of the book is then arranged in two parts. The first part âTechnologies and Methodsâ contains horizontal contributions of technologies and methods that enable data value chains to be applied in any sector. The second part âProcesses and Applicationsâ details experience reports and lessons from using big data and data-driven approaches in processes and applications. Its chapters are co-authored with industry experts and cover domains including health, law, finance, retail, manufacturing, mobility, and smart cities. Contributions emanate from the Big Data Value Public-Private Partnership and the Big Data Value Association, which have acted as the European data community's nucleus to bring together businesses with leading researchers to harness the value of data to benefit society, business, science, and industry. The book is of interest to two primary audiences, first, undergraduate and postgraduate students and researchers in various fields, including big data, data science, data engineering, and machine learning and AI. Second, practitioners and industry experts engaged in data-driven systems, software design and deployment projects who are interested in employing these advanced methods to address real-world problems
- âŠ