25 research outputs found
Performance Optimizations and Operator Semantics for Streaming Data Flow Programs
Unternehmen sammeln mehr Daten als je zuvor und müssen auf diese Informationen zeitnah reagieren. Relationale Datenbanken eignen sich nicht für die latenzfreie Verarbeitung dieser oft unstrukturierten Daten. Um diesen Anforderungen zu begegnen, haben sich in der Datenbankforschung seit dem Anfang der 2000er Jahre zwei neue Forschungsrichtungen etabliert: skalierbare Verarbeitung unstrukturierter Daten und latenzfreie Datenstromverarbeitung.
Skalierbare Verarbeitung unstrukturierter Daten, auch bekannt unter dem Begriff "Big Data"-Verarbeitung, hat in der Industrie schnell Einzug erhalten. Gleichzeitig wurden in der Forschung Systeme zur latenzfreien Datenstromverarbeitung entwickelt, die auf eine verteilte Architektur, Skalierbarkeit und datenparallele Verarbeitung setzen. Obwohl diese Systeme in der Industrie vermehrt zum Einsatz kommen, gibt es immer noch große Herausforderungen im praktischen Einsatz.
Diese Dissertation verfolgt zwei Hauptziele: Zuerst wird das Laufzeitverhalten von hochskalierbaren datenparallelen Datenstromverarbeitungssystemen untersucht. Im zweiten Hauptteil wird das "Dual Streaming Model" eingeführt, das eine Semantik zur gleichzeitigen Verarbeitung von Datenströmen und Tabellen beschreibt.
Das Ziel unserer Untersuchung ist ein besseres Verständnis über das Laufzeitverhalten dieser Systeme zu erhalten und dieses Wissen zu nutzen um Anfragen automatisch ausreichende Rechenkapazität zuzuweisen. Dazu werden ein Kostenmodell und darauf aufbauende Optimierungsalgorithmen für Datenstromanfragen eingeführt, die Datengruppierung und Datenparallelität einbeziehen.
Das vorgestellte Datenstromverarbeitungsmodell beschreibt das Ergebnis eines Operators als kontinuierlichen Strom von Veränderugen auf einer Ergebnistabelle. Dabei behandelt unser Modell die Diskrepanz der physikalischen und logischen Ordnung von Datenelementen inhärent und erreicht damit eine deterministische Semantik und eine minimale Verarbeitungslatenz.Modern companies are able to collect more data and require insights from it faster than ever before. Relational databases do not meet the requirements for processing the often unstructured data sets with reasonable performance. The database research community started to address these trends in the early 2000s. Two new research directions have attracted major interest since: large-scale non-relational data processing as well as low-latency data stream processing.
Large-scale non-relational data processing, commonly known as "Big Data" processing, was quickly adopted in the industry. In parallel, low latency data stream processing was mainly driven by the research community developing new systems that embrace a distributed architecture, scalability, and exploits data parallelism. While these systems have gained more and more attention in the industry, there are still major challenges to operate them at large scale.
The goal of this dissertation is two-fold: First, to investigate runtime characteristics of large scale data-parallel distributed streaming systems.
And second, to propose the "Dual Streaming Model" to express semantics of continuous queries over data streams and tables.
Our goal is to improve the understanding of system and query runtime behavior with the aim to provision queries automatically. We introduce a cost model for streaming data flow programs taking into account the two techniques of record batching and data parallelization. Additionally, we introduce optimization algorithms that leverage our model for cost-based query provisioning.
The proposed Dual Streaming Model expresses the result of a streaming operator as a stream of successive updates to a result table, inducing a duality between streams and tables. Our model handles the inconsistency of the logical and the physical order of records within a data stream natively,
which allows for deterministic semantics as well as low latency query execution
Healthcare-associated prosthetic heart valve, aortic vascular graft, and disseminated Mycobacterium chimaera infections subsequent to open heart surgery
Aims We identified 10 patients with disseminated Mycobacterium chimaera infections subsequent to open-heart surgery at three European Hospitals. Infections originated from the heater-cooler unit of the heart-lung machine. Here we describe clinical aspects and treatment course of this novel clinical entity. Methods and results Interdisciplinary care and follow-up of all patients was documented by the study team. Patients' characteristics, clinical manifestations, microbiological findings, and therapeutic measures including surgical reinterventions were reviewed and treatment outcomes are described. The 10 patients comprise a 1-year-old child and nine adults with a median age of 61 years (range 36-76 years). The median duration from cardiac surgery to diagnosis was 21 (range 5-40) months. All patients had prosthetic material-associated infections with either prosthetic valve endocarditis, aortic graft infection, myocarditis, or infection of the prosthetic material following banding of the pulmonary artery. Extracardiac manifestations preceded cardiovascular disease in some cases. Despite targeted antimicrobial therapy, M. chimaera infection required cardiosurgical reinterventions in eight patients. Six out of 10 patients experienced breakthrough infections, of which four were fatal. Three patients are in a post-treatment monitoring period. Conclusion Healthcare-associated infections due to M. chimaera occurred in patients subsequent to cardiac surgery with extracorporeal circulation and implantation of prosthetic material. Infections became clinically apparent after a time lag of months to years. Mycobacterium chimaera infections are easily missed by routine bacterial diagnostics and outcome is poor despite long-term antimycobacterial therapy, probably because biofilm formation hinders eradication of pathogen
Translocator protein (18kDA) (TSPO) marks mesenchymal glioblastoma cell populations characterized by elevated numbers of tumor-associated macrophages
TSPO is a promising novel tracer target for positron-emission tomography (PET) imaging of brain tumors. However, due to the heterogeneity of cell populations that contribute to the TSPO-PET signal, imaging interpretation may be challenging. We therefore evaluated TSPO enrichment/expression in connection with its underlying histopathological and molecular features in gliomas. We analyzed TSPO expression and its regulatory mechanisms in large in silico datasets and by performing direct bisulfite sequencing of the TSPO promotor. In glioblastoma tissue samples of our TSPO-PET imaging study cohort, we dissected the association of TSPO tracer enrichment and protein labeling with the expression of cell lineage markers by immunohistochemistry and fluorescence multiplex stains. Furthermore, we identified relevant TSPO-associated signaling pathways by RNA sequencing.We found that TSPO expression is associated with prognostically unfavorable glioma phenotypes and that TSPO promotor hypermethylation is linked to IDH mutation. Careful histological analysis revealed that TSPO immunohistochemistry correlates with the TSPO-PET signal and that TSPO is expressed by diverse cell populations. While tumor core areas are the major contributor to the overall TSPO signal, TSPO signals in the tumor rim are mainly driven by CD68-positive microglia/macrophages. Molecularly, high TSPO expression marks prognostically unfavorable glioblastoma cell subpopulations characterized by an enrichment of mesenchymal gene sets and higher amounts of tumor-associated macrophages.In conclusion, our study improves the understanding of TSPO as an imaging marker in gliomas by unveiling IDH-dependent differences in TSPO expression/regulation, regional heterogeneity of the TSPO PET signal and functional implications of TSPO in terms of tumor immune cell interactions
Performance Optimization for Distributed Intra-Node-Parallel Streaming Systems
Abstract — The performance of intra-node parallel dataflow programs in the context of streaming systems depends mainly on two parameters: the degree of parallelism for each node of the dataflow program as well as the batching size for each node. In the state-of-the-art systems the user has to specify those values manually. Manual tuning of both parameters is necessary in order to get good performance. However, this process is difficult and time consuming—even for experts. In this paper we introduce and optimization algorithm that optimizes both parameters automatically. We define a novel cost model for intranode parallel dataflow programs with user-defined functions. Furthermore, we introduce different batching schemes to reduce the number of output buffers, i. e., main memory consumption. We implemented our approach on top of the open source system Storm and ran experiments with different workloads. Our results show a throughput improvement of more than one order of magnitude while the optimization time is less than a second