1,499 research outputs found

    Compilation and Code Optimization for Data Analytics

    Get PDF
    The trade-offs between the use of modern high-level and low-level programming languages in constructing complex software artifacts are well known. High-level languages allow for greater programmer productivity: abstraction and genericity allow for the same functionality to be implemented with significantly less code compared to low-level languages. Modularity, object-orientation, functional programming, and powerful type systems allow programmers not only to create clean abstractions and protect them from leaking, but also to define code units that are reusable and easily composable, and software architectures that are adaptable and extensible. The abstraction, succinctness, and modularity of high-level code help to avoid software bugs and facilitate debugging and maintenance. The use of high-level languages comes at a performance cost: increased indirection due to abstraction, virtualization, and interpretation, and superfluous work, particularly in the form of tempory memory allocation and deallocation to support objects and encapsulation. As a result of this, the cost of high-level languages for performance-critical systems may seem prohibitive. The vision of abstraction without regret argues that it is possible to use high-level languages for building performance-critical systems that allow for both productivity and high performance, instead of trading off the former for the latter. In this thesis, we realize this vision for building different types of data analytics systems. Our means of achieving this is by employing compilation. The goal is to compile away expensive language features -- to compile high-level code down to efficient low-level code

    Compilation and Code Optimization for Data Analytics

    Get PDF
    The trade-offs between the use of modern high-level and low-level programming languages in constructing complex software artifacts are well known. High-level languages allow for greater programmer productivity: abstraction and genericity allow for the same functionality to be implemented with significantly less code compared to low-level languages. Modularity, object-orientation, functional programming, and powerful type systems allow programmers not only to create clean abstractions and protect them from leaking, but also to define code units that are reusable and easily composable, and software architectures that are adaptable and extensible. The abstraction, succinctness, and modularity of high-level code help to avoid software bugs and facilitate debugging and maintenance. The use of high-level languages comes at a performance cost: increased indirection due to abstraction, virtualization, and interpretation, and superfluous work, particularly in the form of tempory memory allocation and deallocation to support objects and encapsulation. As a result of this, the cost of high-level languages for performance-critical systems may seem prohibitive. The vision of abstraction without regret argues that it is possible to use high-level languages for building performance-critical systems that allow for both productivity and high performance, instead of trading off the former for the latter. In this thesis, we realize this vision for building different types of data analytics systems. Our means of achieving this is by employing compilation. The goal is to compile away expensive language features -- to compile high-level code down to efficient low-level code

    High Throughput Push Based Storage Manager

    Full text link
    The storage manager, as a key component of the database system, is responsible for organizing, reading, and delivering data to the execution engine for processing. According to the data serving mechanism, existing storage managers are either pull-based, incurring high latency, or push-based, leading to a high number of I/O requests when the CPU is busy. To improve these shortcomings, this thesis proposes a push-based prefetching strategy in a column-wise storage manager. The proposed strategy implements an efficient cache layer to store shared data among queries to reduce the number of I/O requests. The capacity of the cache is maintained by a time access-aware eviction mechanism. Our strategy enables the storage manager to coordinate multiple queries by merging their requests and dynamically generate an optimal read order that maximizes the overall I/O throughput. We evaluated our storage manager both over a disk-based redundant array of independent disks (RAID) and an NVM Express (NVMe) solid-state drive (SSD). With the high read performance of the SSD, we successfully minimized the total read time and number of I/O accesses

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research

    CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap

    Get PDF
    After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in multimedia search engines, we have identified and analyzed gaps within European research effort during our second year. In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio- economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal challenges

    Continuously Providing Approximate Results under Limited Resources: Load Shedding and Spilling in XML Streams

    Get PDF
    Because of the high volume and unpredictable arrival rates, stream processing systems may not always be able to keep up with the input data streams, resulting in buffer overflow and uncontrolled loss of data. To continuously supply online results, two alternate solutions to tackle this problem of unpredictable failures of such overloaded systems can be identified. One technique, called load shedding, drops some fractions of data from the input stream to reduce the memory and CPU requirements of the workload. However, dropping some portions of the input data means that the accuracy of the output is reduced since some data is lost. To produce eventually complete results, the second technique, called data spilling, pushes some fractions of data to persistent storage temporarily when the processing speed cannot keep up with the arrival rate. The processing of the disk resident data is then postponed until a later time when system resources become available. This dissertation explores these load reduction technologies in the context of XML stream systems. Load shedding in the specific context of XML streams poses several unique opportunities and challenges. Since XML data is hierarchical, subelements, extracted from different positions of the XML tree structure, may vary in their importance. Further, dropping different subelements may vary in their savings of storage and computation. Hence, unlike prior work in the literature that drops data completely or not at all, in this dissertation we introduce the notion of structure-oriented load shedding, meaning selectively some XML subelements are shed from the possibly complex XML objects in the XML stream. First we develop a preference model that enables users to specify the relative importance of preserving different subelements within the XML result structure. This transforms shedding into the problem of rewriting the user query into shed queries that return approximate answers with their utility as measured by the user preference model. Our optimizer finds the appropriate shed queries to maximize the output utility driven by our structure-based preference model under the limitation of available computation resources. The experimental results demonstrate that our proposed XML-specific shedding solution consistently achieves higher utility results compared to the existing relational shedding techniques. Second, we introduces structure-based spilling, a spilling technique customized for XML streams by considering the spilling of partial substructures of possibly complex XML elements. Several new challenges caused by structure-based spilling are addressed. When a path is spilled, multiple other paths may be affected. We categorize varying types of spilling side effects on the query caused by spilling. How to execute the reduced query to produce the correct runtime output is also studied. Three optimization strategies are developed to select the reduced query that maximizes the output quality. We also examine the clean-up stage to guarantee that an entire result set is eventually generated by producing supplementary results to complement the partial results output earlier. The experimental study demonstrates that our proposed solutions consistently achieve higher quality results compared to the state-of-the-art techniques. Third, we design an integrated framework that combines both shedding and spilling policies into one comprehensive methodology. Decisions on the choice of whether to shed or spill data may be affected by the application needs and data arrival patterns. For some input data, it may be worth to flush it to disk if a delayed output of its result will be important, while other data would best directly dropped from the system given that a delayed delivery of these results would no longer be meaningful to the application. Therefore we need sophisticated technologies capable of deploying both shedding and spilling techniques within one integrated strategy with the ability to deliver the most appropriate decision customers need for each specific circumstance. We propose a novel flexible framework for structure-based shed and spill approaches, applicable in any XML stream system. We propose a solution space that represents all the shed and spill candidates. An age-based quality model is proposed for evaluating the output quality for different reduced query and supplementary query pairs. We also propose a family of four optimization strategies, OptF, OptSmart, HiX and Fex. OptF and OptSmart are both guaranteed to identify an optimal solution of reduced and supplementary query pair, with OptSmart exhibiting significantly less overhead than OptF. HiX and Fex use heuristic-based approaches that are much more efficient than OptF and OptSmart

    Spontananfragen auf Datenströmen

    Get PDF
    Many modern applications require processing large amounts of data in a real-time fashion. As a result, distributed stream processing engines (SPEs) have gained significant attention as an important new class of big data processing systems. The central design principle of these SPEs is to handle queries that potentially run forever on data streams with a query-at-a-time model, i.e., each query is optimized and executed separately. However, in many real applications, not only long-running queries but also many short-running queries are processed on data streams. In these applications, multiple stream queries are created and deleted concurrently, in an ad-hoc manner. The best practice to handle ad-hoc stream queries is to fork input stream and add additional resources for each query. However, this approach leads to redundant computation and data copy. This thesis lays the foundation for efficient ad-hoc stream query processing. To bridge the gap between stream data processing and ad-hoc query processing, we follow a top-down approach. First, we propose a benchmarking framework to analyze state-of-the-art SPEs. We provide a definition of latency and throughput for stateful operators. Moreover, we carefully separate the system under test and the driver, to correctly represent the open-world model of typical stream processing deployments. This separation enables us to measure the system performance under realistic conditions. Our solution is the first benchmarking framework to define and test the sustainable performance of SPEs. Throughout our analysis, we realize that the state-of-the-art SPEs are unable to execute stream queries in an ad-hoc manner. Second, we propose the first ad-hoc stream query processing engine for distributed data processing environments. We develop our solution based on three main requirements: (1) Integration: Ad-hoc query processing should be a composable layer that can extend stream operators, such as join, aggregation, and window operators; (2) Consistency: Ad-hoc query creation and deletion must be performed consistently and ensure exactly-once semantics and correctness; (3) Performance: In contrast to modern SPEs, ad-hoc SPEs should not only maximize data throughput but also query throughout via incremental computation and resource sharing. Third, we propose an ad-hoc stream join processing framework that integrates dynamic query processing and query re-optimization techniques with ad-hoc stream query processing. Our solution comprises an optimization layer and a stream data processing layer. The optimization layer periodically re-optimizes the query execution plan, performing join reordering and vertical and horizontal scaling at runtime without stopping the execution. The data processing layer enables incremental and consistent query processing, supporting all the actions triggered by the optimizer. The result of the second and the third contributions forms a complete ad-hoc SPE. We utilize the first contribution not only for benchmarking modern SPEs but also for evaluating the ad-hoc SPE.Eine Vielzahl moderner Anwendungen setzten die Echtzeitverarbeitung großer Datenmengen voraus. Aus diesem Grund haben neuerdings verteilte Systeme zur Verarbeitung von Datenströmen (sog. Datenstrom-Verarbeitungssysteme, abgek. "DSV") eine wichtige Bedeutung als neue Kategorie von Massendaten-Verarbeitungssystemen erlangt. Das zentrale Entwurfsprinzip dieser DSVs ist es, Anfragen, die potenziell unendlich lange auf einem Datenstrom laufen, jeweils Eine nach der Anderen zu verarbeiten (Englisch: "query-at-a-time model"). Das bedeutet, dass jede Anfrage eigenständig vom System optimiert und ausgeführt wird. Allerdings stellen vielen reale Anwendungen nicht nur lang laufende Anfragen auf Datenströmen, sondern auch kurz laufende Spontananfragen. Solche Anwendungen können mehrere Anfragen spontan und zeitgleich erstellen und entfernen. Das bewährte Verfahren, um Spontananfragen zu bearbeiten, zweigt den eingehenden Datenstrom ab und belegt zusätzliche Ressourcen für jede neue Anfrage. Allerdings ist dieses Verfahren ineffizient, weil Spontananfragen damit redundante Berechnungen und Daten-Kopieroperationen verursachen. In dieser Arbeit legen wir das Fundament für die effiziente Verarbeitung von Spontananfragen auf Datenströmen. Wir schließen in den folgenden drei Schritten die Lücke zwischen verteilter Datenstromanfrage-Verarbeitung und Spontananfrage-Verarbeitung. Erstens stellen wir ein Benchmark-Framework zur Analyse von modernen DSVs vor. In diesem Framework stellen wir eine neue Definition für die Latenz und den Durchsatz von zustandsbehafteten Operatoren vor. Zudem unterscheiden wir genau zwischen dem zu testenden System und dem Treibersystem, um das offene-Welt Modell, welches den typischen Anwendungsszenarien in der Datenstromverabeitung entspricht, korrekt zu repräsentieren. Diese strikte Unterscheidung ermöglicht es, die Systemleistung unter realen Bedingungen zu messen. Unsere Lösung ist damit das erste Benchmark-Framework, welches die dauerhaft durchhaltbare Systemleistung von DSVs definiert und testet. Durch eine systematische Analyse aktueller DSVs stellen wir fest, dass aktuelle DSVs außerstande sind, Spontananfragen effizient zu verarbeiten. Zweitens stellen wir das erste verteilte DSV zur Spontananfrageverarbeitung vor. Wir entwickeln unser Lösungskonzept basierend auf drei Hauptanforderungen: (1) Integration: Spontananfrageverarbeitung soll ein modularer Baustein sein, mit dem Datenstrom-Operatoren wie z.B. Join, Aggregation, und Zeitfenster-Operatoren erweitert werden können; (2) Konsistenz: die Erstellung und Entfernung von Spontananfragen müssen konsistent ausgeführt werden, die Semantik für einmalige Nachrichtenzustellung erhalten, sowie die Korrektheit des Anfrage-Ergebnisses sicherstellen; (3) Leistung: Im Gegensatz zu modernen DSVs sollen DSVs zur Spontananfrageverarbeitung nicht nur den Datendurchsatz, sondern auch den Anfragedurchsatz maximieren. Dies ermöglichen wir durch inkrementelle Kompilation und der Ressourcenteilung zwischen Anfragen. Drittens stellen wir ein Programmiergerüst zur Verbeitung von Spontananfragen auf Datenströmen vor. Dieses integriert die dynamische Anfrageverarbeitung und die Nachoptimierung von Anfragen mit der Spontananfrageverarbeitung auf Datenströmen. Unser Lösungsansatz besteht aus einer Schicht zur Anfrageoptimierung und einer Schicht zur Anfrageverarbeitung. Die Optimierungsschicht optimiert periodisch den Anfrageverarbeitungsplan nach, wobei sie zur Laufzeit Joins neu anordnet und vertikal sowie horizontal skaliert, ohne die Verarbeitung anzuhalten. Die Verarbeitungsschicht ermöglicht eine inkrementelle und konsistente Anfrageverarbeitung und unterstützt alle zuvor beschriebenen Eingriffe der Optimierungsschicht in die Anfrageverarbeitung. Zusammengefasst ergeben unsere zweiten und dritten Lösungskonzepte eine vollständige DSV zur Spontananfrageverarbeitung. Wir verwenden hierzu unseren ersten Beitrag nicht nur zur Bewertung moderner DSVs, sondern auch zur Evaluation unseres DSVs zur Spontananfrageverarbeitung
    • …
    corecore