291 research outputs found

    Dias: Dynamic Rewriting of Pandas Code

    Full text link
    In recent years, dataframe libraries, such as pandas have exploded in popularity. Due to their flexibility, they are increasingly used in ad-hoc exploratory data analysis (EDA) workloads. These workloads are diverse, including custom functions which can span libraries or be written in pure Python. The majority of systems available to accelerate EDA workloads focus on bulk-parallel workloads, which contain vastly different computational patterns, typically within a single library. As a result, they can introduce excessive overheads for ad-hoc EDA workloads due to their expensive optimization techniques. Instead, we identify program rewriting as a lightweight technique which can offer substantial speedups while also avoiding slowdowns. We implemented our techniques in Dias, which rewrites notebook cells to be more efficient for ad-hoc EDA workloads. We develop techniques for efficient rewrites in Dias, including dynamic checking of preconditions under which rewrites are correct and just-in-time rewrites for notebook environments. We show that Dias can rewrite individual cells to be 57Ă—\times faster compared to pandas and 1909Ă—\times faster compared to optimized systems such as modin. Furthermore, Dias can accelerate whole notebooks by up to 3.6Ă—\times compared to pandas and 26.4Ă—\times compared to modin.Comment: 16 pages, 22 figure

    Modern data analytics in the cloud era

    Get PDF
    Cloud Computing ist die dominante Technologie des letzten Jahrzehnts. Die Benutzerfreundlichkeit der verwalteten Umgebung in Kombination mit einer nahezu unbegrenzten Menge an Ressourcen und einem nutzungsabhängigen Preismodell ermöglicht eine schnelle und kosteneffiziente Projektrealisierung für ein breites Nutzerspektrum. Cloud Computing verändert auch die Art und Weise wie Software entwickelt, bereitgestellt und genutzt wird. Diese Arbeit konzentriert sich auf Datenbanksysteme, die in der Cloud-Umgebung eingesetzt werden. Wir identifizieren drei Hauptinteraktionspunkte der Datenbank-Engine mit der Umgebung, die veränderte Anforderungen im Vergleich zu traditionellen On-Premise-Data-Warehouse-Lösungen aufweisen. Der erste Interaktionspunkt ist die Interaktion mit elastischen Ressourcen. Systeme in der Cloud sollten Elastizität unterstützen, um den Lastanforderungen zu entsprechen und dabei kosteneffizient zu sein. Wir stellen einen elastischen Skalierungsmechanismus für verteilte Datenbank-Engines vor, kombiniert mit einem Partitionsmanager, der einen Lastausgleich bietet und gleichzeitig die Neuzuweisung von Partitionen im Falle einer elastischen Skalierung minimiert. Darüber hinaus führen wir eine Strategie zum initialen Befüllen von Puffern ein, die es ermöglicht, skalierte Ressourcen unmittelbar nach der Skalierung auszunutzen. Cloudbasierte Systeme sind von fast überall aus zugänglich und verfügbar. Daten werden häufig von zahlreichen Endpunkten aus eingespeist, was sich von ETL-Pipelines in einer herkömmlichen Data-Warehouse-Lösung unterscheidet. Viele Benutzer verzichten auf die Definition von strikten Schemaanforderungen, um Transaktionsabbrüche aufgrund von Konflikten zu vermeiden oder um den Ladeprozess von Daten zu beschleunigen. Wir führen das Konzept der PatchIndexe ein, die die Definition von unscharfen Constraints ermöglichen. PatchIndexe verwalten Ausnahmen zu diesen Constraints, machen sie für die Optimierung und Ausführung von Anfragen nutzbar und bieten effiziente Unterstützung bei Datenaktualisierungen. Das Konzept kann auf beliebige Constraints angewendet werden und wir geben Beispiele für unscharfe Eindeutigkeits- und Sortierconstraints. Darüber hinaus zeigen wir, wie PatchIndexe genutzt werden können, um fortgeschrittene Constraints wie eine unscharfe Multi-Key-Partitionierung zu definieren, die eine robuste Anfrageperformance bei Workloads mit unterschiedlichen Partitionsanforderungen bietet. Der dritte Interaktionspunkt ist die Nutzerinteraktion. Datengetriebene Anwendungen haben sich in den letzten Jahren verändert. Neben den traditionellen SQL-Anfragen für Business Intelligence sind heute auch datenwissenschaftliche Anwendungen von großer Bedeutung. In diesen Fällen fungiert das Datenbanksystem oft nur als Datenlieferant, während der Rechenaufwand in dedizierten Data-Science- oder Machine-Learning-Umgebungen stattfindet. Wir verfolgen das Ziel, fortgeschrittene Analysen in Richtung der Datenbank-Engine zu verlagern und stellen das Grizzly-Framework als DataFrame-zu-SQL-Transpiler vor. Auf dieser Grundlage identifizieren wir benutzerdefinierte Funktionen (UDFs) und maschinelles Lernen (ML) als wichtige Aufgaben, die von einer tieferen Integration in die Datenbank-Engine profitieren würden. Daher untersuchen und bewerten wir Ansätze für die datenbankinterne Ausführung von Python-UDFs und datenbankinterne ML-Inferenz.Cloud computing has been the groundbreaking technology of the last decade. The ease-of-use of the managed environment in combination with nearly infinite amount of resources and a pay-per-use price model enables fast and cost-efficient project realization for a broad range of users. Cloud computing also changes the way software is designed, deployed and used. This thesis focuses on database systems deployed in the cloud environment. We identify three major interaction points of the database engine with the environment that show changed requirements compared to traditional on-premise data warehouse solutions. First, software is deployed on elastic resources. Consequently, systems should support elasticity in order to match workload requirements and be cost-effective. We present an elastic scaling mechanism for distributed database engines, combined with a partition manager that provides load balancing while minimizing partition reassignments in the case of elastic scaling. Furthermore we introduce a buffer pre-heating strategy that allows to mitigate a cold start after scaling and leads to an immediate performance benefit using scaling. Second, cloud based systems are accessible and available from nearly everywhere. Consequently, data is frequently ingested from numerous endpoints, which differs from bulk loads or ETL pipelines in a traditional data warehouse solution. Many users do not define database constraints in order to avoid transaction aborts due to conflicts or to speed up data ingestion. To mitigate this issue we introduce the concept of PatchIndexes, which allow the definition of approximate constraints. PatchIndexes maintain exceptions to constraints, make them usable in query optimization and execution and offer efficient update support. The concept can be applied to arbitrary constraints and we provide examples of approximate uniqueness and approximate sorting constraints. Moreover, we show how PatchIndexes can be exploited to define advanced constraints like an approximate multi-key partitioning, which offers robust query performance over workloads with different partition key requirements. Third, data-centric workloads changed over the last decade. Besides traditional SQL workloads for business intelligence, data science workloads are of significant importance nowadays. For these cases the database system might only act as data delivery, while the computational effort takes place in data science or machine learning (ML) environments. As this workflow has several drawbacks, we follow the goal of pushing advanced analytics towards the database engine and introduce the Grizzly framework as a DataFrame-to-SQL transpiler. Based on this we identify user-defined functions (UDFs) and machine learning inference as important tasks that would benefit from a deeper engine integration and investigate approaches to push these operations towards the database engine

    Learned interpreters : structural and learned systematicity in neural networks for program execution

    Full text link
    Les architectures de réseaux de neurones profonds à usage général ont fait des progrès surprenants dans l'apprentissage automatique pour le code, permettant l’amélioration de la complétion de code, la programmation du langage naturel, la détection et la réparation des bogues, et même la résolution de problèmes de programmation compétitifs à un niveau de performance humain. Néanmoins, ces méthodes ont du mal à comprendre le processus d'exécution du code, même lorsqu'il s'agit de code qu'ils écrivent eux-mêmes. À cette fin, nous explorons une architecture du réseau neuronal inspiré d’interpréteur de code, via une nouvelle famille d'architecture appelée Instruction Pointer Attention Graph Neural Networks (IPA-GNN). Nous appliquons cette famille d'approches à plusieurs tâches nécessitant un raisonnement sur le comportement d'exécution du programme : apprendre à exécuter des programmes complets et partiels, prédire la couverture du code pour la vérification du matériel, et prédire les erreurs d'exécution dans des programmes de compétition. Grâce à cette série de travaux, nous apportons plusieurs contributions et rencontrons de multiples résultats surprenants et prometteurs. Nous introduisons une bibliothèque Python pour construire des représentations de graphes des programmes utiles dans la recherche sur l'apprentissage automatique, qui sert de fondement à la recherche dans cette thèse et dans la communauté de recherche plus large. Nous introduisons également de riches ensembles de données à grande échelle de programmes annotés avec le comportement du programme (les sorties et les erreurs soulevées lors de son exécution) pour faciliter la recherche dans ce domaine. Nous constatons que les méthodes IPA-GNN présentent une forte généralisation améliorée par rapport aux méthodes à usage général, fonctionnant bien lorsqu'ils sont entraînés pour exécuter uniquement des programmes courts mais testés sur des programmes plus longs. En fait, nous constatons que les méthodes IPA-GNN surpassent les méthodes génériques sur chacune des tâches de modélisation du comportement que nous considérons dans les domaines matériel et logiciel. Nous constatons même que les méthodes inspirées par l'interpréteur de code qui modélisent explicitement la gestion des exceptions ont une propriété interprétative souhaitable, permettant la prédiction des emplacements d'erreur même lorsqu'elles n'ont été entraînées qu'à prédire la présence d'erreur et le type d'erreur. Au total, les architectures inspirées des interpréteurs de code comme l'IPA-GNN représentent un chemin prometteur à suivre pour imprégner des réseaux de neurones avec de nouvelles capacités pour apprendre à raisonner sur les exécutions de programme.General purpose deep neural network architectures have made startling advances in machine learning for code, advancing code completion, enabling natural language programming, detecting and repairing bugs, and even solving competitive programming problems at a human level of performance. Nevertheless, these methods struggle to understand the execution behavior of code, even when it is code they write themselves. To this end, we explore interpreter-inspired neural network architectures, introducing a novel architecture family called instruction pointer attention graph neural networks (IPA-GNN). We apply this family of approaches to several tasks that require reasoning about the execution behavior of programs: learning to execute full and partial programs, code coverage prediction for hardware verification, and predicting runtime errors in competition programs. Through this series of works we make several contributions and encounter multiple surprising and promising results. We introduce a Python library for constructing graph representations of programs for use in machine learning research, which serves as a bedrock for the research in this thesis and in the broader research community. We also introduce rich large scale datasets of programs annotated with program behavior like outputs and errors raised to facilitate research in this domain. We find that IPA-GNN methods exhibit improved strong generalization over general purpose methods, performing well when trained to execute only on short programs and tested on significantly longer programs. In fact, we find that IPA-GNN methods outperform generic methods on each of the behavior modeling tasks we consider across both hardware and software domains. We even find that interpreter-inspired methods that model exception handling explicitly have a desirable interpretability property, enabling the prediction of error locations even when only trained on error presence and kind. In total, interpreter-inspired architectures like the IPA-GNN represent a promising path forward for imbuing neural networks with novel capabilities for learning to reason about program executions

    Data Management for Dynamic Multimedia Analytics and Retrieval

    Get PDF
    Multimedia data in its various manifestations poses a unique challenge from a data storage and data management perspective, especially if search, analysis and analytics in large data corpora is considered. The inherently unstructured nature of the data itself and the curse of dimensionality that afflicts the representations we typically work with in its stead are cause for a broad range of issues that require sophisticated solutions at different levels. This has given rise to a huge corpus of research that puts focus on techniques that allow for effective and efficient multimedia search and exploration. Many of these contributions have led to an array of purpose-built, multimedia search systems. However, recent progress in multimedia analytics and interactive multimedia retrieval, has demonstrated that several of the assumptions usually made for such multimedia search workloads do not hold once a session has a human user in the loop. Firstly, many of the required query operations cannot be expressed by mere similarity search and since the concrete requirement cannot always be anticipated, one needs a flexible and adaptable data management and query framework. Secondly, the widespread notion of staticity of data collections does not hold if one considers analytics workloads, whose purpose is to produce and store new insights and information. And finally, it is impossible even for an expert user to specify exactly how a data management system should produce and arrive at the desired outcomes of the potentially many different queries. Guided by these shortcomings and motivated by the fact that similar questions have once been answered for structured data in classical database research, this Thesis presents three contributions that seek to mitigate the aforementioned issues. We present a query model that generalises the notion of proximity-based query operations and formalises the connection between those queries and high-dimensional indexing. We complement this by a cost-model that makes the often implicit trade-off between query execution speed and results quality transparent to the system and the user. And we describe a model for the transactional and durable maintenance of high-dimensional index structures. All contributions are implemented in the open-source multimedia database system Cottontail DB, on top of which we present an evaluation that demonstrates the effectiveness of the proposed models. We conclude by discussing avenues for future research in the quest for converging the fields of databases on the one hand and (interactive) multimedia retrieval and analytics on the other

    Spark Optimization: A Column Recommendation System for Data Partitioning and Z-Ordering on ETL Platforms

    Get PDF
    In this thesis, we present a solution for the challenge of optimizing the retrieval of data in Spark. Our column recommendation system is based on Spark's event logs and finds influential columns for Z-ordering and partitioning. The column recommendation system consists of four methods, each looking for different query patterns and query characteristics. From the recommendation system experiment, we managed to improve the run time by 17% compared to the baseline. This improvement demonstrates our column recommendation system's potential for optimizing data retrieval in Spark. Our system was developed on an ETL platform and is a flexible solution for ETL platforms utilizing Spark

    Scalable and fault-tolerant data stream processing on multi-core architectures

    Get PDF
    With increasing data volumes and velocity, many applications are shifting from the classical “process-after-store” paradigm to a stream processing model: data is produced and consumed as continuous streams. Stream processing captures latency-sensitive applications as diverse as credit card fraud detection and high-frequency trading. These applications are expressed as queries of algebraic operations (e.g., aggregation) over the most recent data using windows, i.e., finite evolving views over the input streams. To guarantee correct results, streaming applications require precise window semantics (e.g., temporal ordering) for operations that maintain state. While high processing throughput and low latency are performance desiderata for stateful streaming applications, achieving both poses challenges. Computing the state of overlapping windows causes redundant aggregation operations: incremental execution (i.e., reusing previous results) reduces latency but prevents parallelization; at the same time, parallelizing window execution for stateful operations with precise semantics demands ordering guarantees and state access coordination. Finally, streams and state must be recovered to produce consistent and repeatable results in the event of failures. Given the rise of shared-memory multi-core CPU architectures and high-speed networking, we argue that it is possible to address these challenges in a single node without compromising window semantics, performance, or fault-tolerance. In this thesis, we analyze, design, and implement stream processing engines (SPEs) that achieve high performance on multi-core architectures. To this end, we introduce new approaches for in-memory processing that address the previous challenges: (i) for overlapping windows, we provide a family of window aggregation techniques that enable computation sharing based on the algebraic properties of aggregation functions; (ii) for parallel window execution, we balance parallelism and incremental execution by developing abstractions for both and combining them to a novel design; and (iii) for reliable single-node execution, we enable strong fault-tolerance guarantees without sacrificing performance by reducing the required disk I/O bandwidth using a novel persistence model. We combine the above to implement an SPE that processes hundreds of millions of tuples per second with sub-second latencies. These results reveal the opportunity to reduce resource and maintenance footprint by replacing cluster-based SPEs with single-node deployments.Open Acces

    High-level ETL for semantic data warehouses

    Get PDF
    The popularity of the Semantic Web (SW) encourages organizations to organize and publish semantic data using the RDF model. This growth poses new requirements to Business Intelligence (BI) technologies to enable On-Line Analytical Processing (OLAP)-like analysis over semantic data. The incorporation of semantic data into a Data Warehouse (DW) is not supported by the traditional Extract-Transform-Load (ETL) tools because they do not consider semantic issues in the integration process. In this paper, we propose a layer-based integration process and a set of high-level RDF-based ETL constructs required to define, map, extract, process, transform, integrate, update, and load (multidimensional) semantic data. Different to other ETL tools, we automate the ETL data flows by creating metadata at the schema level. Therefore, it relieves ETL developers from the burden of manual mapping at the ETL operation level. We create a prototype, named Semantic ETL Construct (SETLCONSTRUCT), based on the innovative ETL constructs proposed here. To evaluate SETLCONSTRUCT, we create a multidimensional semantic DW by integrating a Danish Business dataset and an EU Subsidy dataset using it and compare it with the previous programmable framework SETLPROG in terms of productivity, development time and performance. The evaluation shows that 1) SETLCONSTRUCT uses 92% fewer Number of Typed Characters (NOTC) than SETLPROG, and SETLAUTO (the extension of SETLCONSTRUCT for generating ETL execution flow automatically) further reduces the Number of Used Concepts (NOUC) by another 25%; 2) using SETLCONSTRUCT, the development time is almost cut in half compared to SETLPROG, and is cut by another 27% using SETLAUTO; 3) SETLCONSTRUCT is scalable and has similar performance compared to SETLPROG.This research is partially funded by the European Commission through the Erasmus Mundus Joint Doctorate Information Technologies for Business Intelligence (EM IT4BI-DC), the Poul Due Jensen Foundation, and the Danish Council for Independent Research (DFF) under grant agreement no. DFF-8048-00051B.Peer ReviewedPostprint (author's final draft
    • …
    corecore