8 research outputs found

    Exploiting Event Log Event Attributes in RNN Based Prediction

    Get PDF
    In predictive process analytics, current and historical process data in event logs are used to predict future. E.g., to predict the next activity or how long a process will still require to complete. Recurrent neural networks (RNN) and its subclasses have been demonstrated to be well suited for creating prediction models. Thus far, event attributes have not been fully utilized in these models. The biggest challenge in exploiting them in prediction models is the potentially large amount of event attributes and attribute values. We present a novel clustering technique which allows for trade-offs between prediction accuracy and the time needed for model training and prediction. As an additional finding, we also find that this clustering method combined with having raw event attribute values in some cases provides even better prediction accuracy at the cost of additional time required for training and prediction.Peer reviewe

    Advancements and Challenges in Object-Centric Process Mining: A Systematic Literature Review

    Full text link
    Recent years have seen the emergence of object-centric process mining techniques. Born as a response to the limitations of traditional process mining in analyzing event data from prevalent information systems like CRM and ERP, these techniques aim to tackle the deficiency, convergence, and divergence issues seen in traditional event logs. Despite the promise, the adoption in real-world process mining analyses remains limited. This paper embarks on a comprehensive literature review of object-centric process mining, providing insights into the current status of the discipline and its historical trajectory

    Hardware-conscious query processing for the many-core era

    Get PDF
    Die optimale Nutzung von moderner Hardware zur Beschleunigung von Datenbank-Anfragen ist keine triviale Aufgabe. Viele DBMS als auch DSMS der letzten Jahrzehnte basieren auf Sachverhalten, die heute kaum noch Gültigkeit besitzen. Ein Beispiel hierfür sind heutige Server-Systeme, deren Hauptspeichergröße im Bereich mehrerer Terabytes liegen kann und somit den Weg für Hauptspeicherdatenbanken geebnet haben. Einer der größeren letzten Hardware Trends geht hin zu Prozessoren mit einer hohen Anzahl von Kernen, den sogenannten Manycore CPUs. Diese erlauben hohe Parallelitätsgrade für Programme durch Multithreading sowie Vektorisierung (SIMD), was die Anforderungen an die Speicher-Bandbreite allerdings deutlich erhöht. Der sogenannte High-Bandwidth Memory (HBM) versucht diese Lücke zu schließen, kann aber ebenso wie Many-core CPUs jeglichen Performance-Vorteil negieren, wenn dieser leichtfertig eingesetzt wird. Diese Arbeit stellt die Many-core CPU-Architektur zusammen mit HBM vor, um Datenbank sowie Datenstrom-Anfragen zu beschleunigen. Es wird gezeigt, dass ein hardwarenahes Kostenmodell zusammen mit einem Kalibrierungsansatz die Performance verschiedener Anfrageoperatoren verlässlich vorhersagen kann. Dies ermöglicht sowohl eine adaptive Partitionierungs und Merge-Strategie für die Parallelisierung von Datenstrom-Anfragen als auch eine ideale Konfiguration von Join-Operationen auf einem DBMS. Nichtsdestotrotz ist nicht jede Operation und Anwendung für die Nutzung einer Many-core CPU und HBM geeignet. Datenstrom-Anfragen sind oft auch an niedrige Latenz und schnelle Antwortzeiten gebunden, welche von höherer Speicher-Bandbreite kaum profitieren können. Hinzu kommen üblicherweise niedrigere Taktraten durch die hohe Kernzahl der CPUs, sowie Nachteile für geteilte Datenstrukturen, wie das Herstellen von Cache-Kohärenz und das Synchronisieren von parallelen Thread-Zugriffen. Basierend auf den Ergebnissen dieser Arbeit lässt sich ableiten, welche parallelen Datenstrukturen sich für die Verwendung von HBM besonders eignen. Des Weiteren werden verschiedene Techniken zur Parallelisierung und Synchronisierung von Datenstrukturen vorgestellt, deren Effizienz anhand eines Mehrwege-Datenstrom-Joins demonstriert wird.Exploiting the opportunities given by modern hardware for accelerating query processing speed is no trivial task. Many DBMS and also DSMS from past decades are based on fundamentals that have changed over time, e.g., servers of today with terabytes of main memory capacity allow complete avoidance of spilling data to disk, which has prepared the ground some time ago for main memory databases. One of the recent trends in hardware are many-core processors with hundreds of logical cores on a single CPU, providing an intense degree of parallelism through multithreading as well as vectorized instructions (SIMD). Their demand for memory bandwidth has led to the further development of high-bandwidth memory (HBM) to overcome the memory wall. However, many-core CPUs as well as HBM have many pitfalls that can nullify any performance gain with ease. In this work, we explore the many-core architecture along with HBM for database and data stream query processing. We demonstrate that a hardware-conscious cost model with a calibration approach allows reliable performance prediction of various query operations. Based on that information, we can, therefore, come to an adaptive partitioning and merging strategy for stream query parallelization as well as finding an ideal configuration of parameters for one of the most common tasks in the history of DBMS, join processing. However, not all operations and applications can exploit a many-core processor or HBM, though. Stream queries optimized for low latency and quick individual responses usually do not benefit well from more bandwidth and suffer from penalties like low clock frequencies of many-core CPUs as well. Shared data structures between cores also lead to problems with cache coherence as well as high contention. Based on our insights, we give a rule of thumb which data structures are suitable to parallelize with focus on HBM usage. In addition, different parallelization schemas and synchronization techniques are evaluated, based on the example of a multiway stream join operation

    SO-TGDs und Skolemisierung für ChaTEAU

    Get PDF
    Schemaabbildungen und Evolution können als ST-TGDs dargestellt werden. Da die Komposition von zwei Schemaabbildungen sich nicht immer als ST-TGD darstellen lässt, wurden sogenannte SO-TGDs eingeführt. Wir betrachten, wann zur Darstellung der Komposition SO-TGDs notwendig sind und ob die Inversen von Schemaabbildungen immer als ST-TGD dargestellt werden können. Auÿerdem betrachten wir die Inverse von SO-TGDs. Abschlieÿend stellen wir vor, wie wir SO-TGDs und Skolemfunktionen in dem Chase-Tool ChaTEAU darstellen können

    Data-Efficient Learned Database Components

    Get PDF
    While databases are the backbone of many software systems, database components such as query optimizers often have to be redesigned to meet the increasing variety in workloads, data and hardware designs, which incurs significant engineering efforts to adapt their design. Recently, it was thus proposed to replace DBMS components such as optimizers, cardinality estimators, etc. by ML models, which not only eliminates the engineering efforts but also provides superior performance for many components. The predominant approach to derive such learned components is workload-driven learning where ten thousands of queries have to be executed first to derive the necessary training data. Unfortunately, the training data collection, which can take days even for medium-sized datasets, has to be repeated for every new database (i.e., the combination of dataset, schema and workload) a component should be deployed for. This is especially problematic for cloud databases such as Snowflake or Redshift since this effort has to be incurred for every customer. This dissertation thus proposes data-efficient learned database components, which either reduce or fully eliminate the high costs of training data collection for learned database components. In particular, three directions are proposed in this dissertation, namely (i) we first aim to reduce the number of training queries needed for workload-driven components before we (ii) propose data-driven learning, which uses the data stored in the database as training data instead of queries, and (iii) introduce zero-shot learned components, which can generalize to new databases out-of-the-box, s.t. no training data collection is required. First, we strive to reduce the number of training queries required for workload-driven components by using simulation models to convey the basic tradeoffs of the underlying problem, e.g., that in database partitioning the network costs of shuffling tuples over the network for joins is the dominating factor. This substantially reduces the number of training queries since the basic principles are already covered by the simulation model and thus only subtleties not covered in the simulation model have to be learned by observing query executions, which we will demonstrate for the problem of database partitioning. An alternative direction is to incorporate domain knowledge (e.g., in a cost model we could encode that scan costs increase linearly with the number of tuples) into components by designing them using differentiable programming. This significantly reduces the number of learnable parameters and thus also the number of required training queries. We demonstrate the feasibility of the approach for the problem of cost estimation in databases. While both approaches reduce the number of training queries, there is still a significant number of training queries required for unseen databases. This motivates our second approach of data-driven learning. In particular, we propose to train the database component by learning the data distribution present in a database instead of observing query executions. This not only completely eliminates the need to collect training data queries but can even improve the state-of-the-art in problems such as cardinality estimation or AQP. While we demonstrate the applicability to a wide range of additional database tasks such as the completion of incomplete relational datasets, data-driven learning is only useful for problems where the data distribution provides sufficient information for the underlying database task. However, for tasks where observations of query executions are indispensable such as cost estimation, data-driven learning cannot be leveraged. In a third direction, we thus propose zero-shot learned database components, which are applicable to a broader set of tasks including those that require observations of queries. In particular, motivated by recent advances in transfer learning, we propose to pretrain a model once on a variety of databases and workloads and thus allow the component to generalize to unseen databases out-of-the-box. Hence, similar to data-driven learning no training queries have to be collected. In this dissertation, we demonstrate that zero-shot learning can indeed yield learned cost models which can predict query latencies on entirely unseen databases more accurately than state-of-the-art workload-driven approaches, which require ten thousands of query executions on every unseen database. Overall, the proposed techniques yield state-of-the-art performance for many database tasks while significantly reducing or completely eliminating the expensive training data collection for unseen databases. However, while the proposed directions address the prevalent data-inefficiency of learned database components, there are still many opportunities to improve learned components in the future. First, the robustness and debuggability of learned components should be improved since as of today they do not offer the same transparency as standard code in databases, which can render the components less attractive to be deployed in production systems. Moreover, to increase the applicability of data-driven models it is desirable to increase the coverage of supported queries, e.g., queries involving wildcard predicates on string columns, which are currently not supported by data-driven learning. Finally, we envision that a broader set of tasks should be supported in the future by zero-shot models (e.g., query optimization) potentially converging towards complete zero-shot learned systems

    Big Data Analytics für die effiziente Aktivitätserkennung und -vorhersage in Assistenzsystemen

    Get PDF
    In dieser Arbeit wird untersucht, inwiefern parallele relationale Datenbanksysteme für Methoden der Aktivitätserkennung und -vorhersage in Assistenzsystemen gewinnbringend eingesetzt werden können. Der Fokus liegt hierbei auf der effizienten und skalierbaren Umsetzung und Komposition von Basisoperatoren der linearen Algebra. Dies ermöglicht neben der Umsetzung zugehöriger Machine-Learning-Verfahren die Einbeziehung zahlreicher weiterer Methoden des wissenschaftlichen Rechnens. Für die potenzielle Umsetzung solcher werden daher zahlreiche Aspekte diskutiert und experimentell ausgewertet

    New Fundamental Technologies in Data Mining

    Get PDF
    The progress of data mining technology and large public popularity establish a need for a comprehensive text on the subject. The series of books entitled by "Data Mining" address the need by presenting in-depth description of novel mining algorithms and many useful applications. In addition to understanding each section deeply, the two books present useful hints and strategies to solving problems in the following chapters. The contributing authors have highlighted many future research directions that will foster multi-disciplinary collaborations and hence will lead to significant development in the field of data mining
    corecore