14 research outputs found

    Relational Algebra for In-Database Process Mining

    Get PDF
    The execution logs that are used for process mining in practice are often obtained by querying an operational database and storing the result in a flat file. Consequently, the data processing power of the database system cannot be used anymore for this information, leading to constrained flexibility in the definition of mining patterns and limited execution performance in mining large logs. Enabling process mining directly on a database - instead of via intermediate storage in a flat file - therefore provides additional flexibility and efficiency. To help facilitate this ideal of in-database process mining, this paper formally defines a database operator that extracts the 'directly follows' relation from an operational database. This operator can both be used to do in-database process mining and to flexibly evaluate process mining related queries, such as: "which employee most frequently changes the 'amount' attribute of a case from one task to the next". We define the operator using the well-known relational algebra that forms the formal underpinning of relational databases. We formally prove equivalence properties of the operator that are useful for query optimization and present time-complexity properties of the operator. By doing so this paper formally defines the necessary relational algebraic elements of a 'directly follows' operator, which are required for implementation of such an operator in a DBMS

    Comparing global optimization and default settings of stream-based joins

    Get PDF
    One problem encountered in real-time data integration is the join of a continuous incoming data stream with a disk-based relation. In this paper we investigate a stream-based join algorithm, called mesh join (MESHJOIN), and focus on a critical component in the algorithm, called the disk-buffer. In MESHJOIN the size of disk-buffer varies with a change in total memory budget and tuning is required to get the maximum service rate within limited available memory. Until now there was little data on the position of the optimum value depending on the memory size, and no performance comparison has been carried out between the optimum and reasonable default sizes for the disk-buffer. To avoid tuning, we propose a reasonable default value for the disk-buffer size with a small and acceptable performance loss. The experimental results validate our arguments

    Memory system for a relational database processor

    Get PDF
    An associative memory for a relational database management system, with content addressing capability, is studied and analyzed. The system utilizes one level of indexing and the database is clustered. The logic-per-track approach is used for parallel processing of the data in a cylinder. The attributes and the tuples are allowed to have an arbitrary length and no encoding algorithm is used. The performance of the system is analyzed and it is demonstrated to have superior performance in comparison to software-based systems. The cost effectiveness of the system is also shown

    An Analytical Model for Evaluating Database Update Schemes

    Get PDF
    A methodology is presented for evaluating the performance of database update schemes. The methodology uses the M/Hr/1 queueing model as a basis for this analysis and makes use of the history of how data is used in the database. Parameters have been introduced which can be set based on the characteristics of a specific system. These include update to retrieval ratio, average file size, overhead, block size and the expected number of items in the database. The analysis is specifically directed toward the support of derived data within the relational model. Three support methods are analyzed. These are first examined in a central database system. The analysis is then extended in order to measure performance in a distributed system. Because concurrency is a major problem in a distributed system, the support of derived data is analyzed with respect to three distributive concurrency control techniques -- master/slave, distributed and synchronized. In addition to its use as a performance predictor, the development of the methodology serves to demonstrate how queueing theory may be used to investigate other related database problems. This is an important benefit due to this lack of fundamental results in the area of using queueing theory to analyze database performance

    A Memory-Optimal Many-To-Many Semi-Stream Join

    Get PDF
    Semi-stream join algorithms join a fast stream input with a disk-based master data relation. A common class of these algorithms is derived from hash joins: they use the stream as build input for a main hash table, and also include a cache for frequent master data. The composition of the cache is very important for performance; however, the decision of which master data to cache has so far been solely based on heuristics. We present the first formal criterion, a cache inequality that leads to a provably optimal composition of the cache in a semi-stream many-to-many equijoin algorithm. We propose a novel algorithm, Semi-Stream Balanced Join (SSBJ), which exploits this cache inequality to achieve a given service rate with a provably minimal amount of memory for all stream distributions. We present a cost model for SSBJ and compare its service rate empirically and analytically with other related approaches.<br/

    Efficient computation and communication management for all-pairs interactions

    Get PDF
    Big data continues to grow in size for all sciences. New methods like those proposed are needed to further reduce memory footprints and distribute work equally across compute nodes both in local HPC systems and rented cluster resources in the cloud. Modern infrastructures have evolved to support these big data computations and that includes key pieces like our internet backbones and data center networks. Many optical networks face heterogeneous communication requests requiring topologies to be efficient and fault tolerant. The all-pairs problem requires all elements (computation datasets or communication nodes) to be paired with all other elements. These all-pairs problems occur in many research fields and have significant impacts, which has led to their continued interest. We proposed using cyclic quorum sets to efficiently manage all-pairs computations. We proved these sets have an all-pairs property that allows for minimal data replication and for distributed, load balanced, and communication-less computation management. The quorums are O(NP)O\left(\frac{N}{\sqrt{P}}\right) in size, up to 50% smaller than dual NP\frac{N}{\sqrt{P}} array implementations, and significantly smaller than solutions requiring all data. Scaling from 16 to 512 cores (1 to 32 compute nodes) and using real dataset inputs, application experiments demonstrated scalability with greater than 150x (super-linear) speedup and less than 1/4th the memory usage per process. Cyclic quorum sets can provided benefits to more than just computations. The sets can also provide a guarantee that all pairs of optical nodes in a network can communicate. Our evaluation analyzed the fault tolerance of routing optical cycles based on cyclic quorum sets. With this method of topology construction, unicast and multicast communication requests do not need to be known or even modeled a priori. In the presence of network single-link faults, our simulated cycle routing had greater than 99% average fault coverage. Hence, even in the presence of a network fault, the optical networks could continue operation of nearly all node pair communications. Lastly, we proposed a generalized RR redundant cyclic quorum set. These sets guarantee all pairs of nodes occur at least RR times. When applied to routing cycles in optical networks, this technique provided almost fault-tolerant communications. More importantly, when applied using only single cycles rather than the standard paired cycles, the generalized RR redundancy technique almost halved the necessary light-trail resources while maintaining the fault tolerance and dependability expected from cycle-based routing. \section*{Problem Description} Big Data in recent years has become a focal point for science and commerce. As datasets grow larger, traditional methods and algorithms are challenged on whether they are able to truly scale. This has led to phrases like, swimming in sensors, drowning in data. Our work addresses some of the challenges facing a particular type of big data interaction. The interaction considered requires all elements in a set to interact with all other elements in the set. The all-pairs interaction is a general computation or communication problem that occurs frequently and can be as simple as considering the shaking of hands by all attendees to a party. More formally there is set ENE_N, where there are NN elements indexed 00 to (N1)\left(N-1\right). EN={e0,e1,...,eN1} E_N = \left\lbrace e_0, e_1, ... , e_{N-1} \right\rbrace The elements in this general formulation can be simple, single communication node or single item data structures, e.g., ENE_N could simply be all nodes in a network or be a large array of NN values. Or, elements can be complex data structures with many fields / values. Fields are not restricted to a single data type either, as many big data problems can rely on heterogeneous datasets. The all-pairs interaction considers all possible pairs of elements, (N2)\binom{N}{2}. {(e0,e1),(e0,e2),...,(e0,eN1),(e1,e2),(e1,e3),...,(e1,eN1),...,(eN2,eN1)}\left\lbrace \left(e_0,e_1\right), \left(e_0,e_2\right), ... , \left(e_0,e_{N-1}\right), \left(e_1,e_2\right), \left(e_1,e_3\right) , ... , \left(e_1, e_{N-1}\right) , ... , \left(e_{N-2},e_{N-1}\right) \right\rbrace While the simple hand shake example could be considered a symmetric interaction. e_i \leftrightarrow e_j , i The all-pairs interaction can be more generally represented by two separate interactions to better represent the computational or communication complexity in those problems where the all-pairs operation is not commutative. \[ e_i \rightarrow e_j, i \[ e_i \leftarrow e_j, i The computational complexity of this general algorithmic form is not daunting. \[\binom{N}{2} = \frac{\left( N-1\right) N}{2} = O\left( N^2\right) In fact, even for pair computations that do not have the commutative property, the complexity is unchanged. In general, polynomial O(N2)O\left(N^2\right) computations are considered highly computationally scalable. When performing an all-pairs data interaction on the big data scale sizes, while the computational complexity theoretically is manageable, the data management becomes complex. The problem definition inherently requires access to the entire dataset, such that every data element can be paired and processed with every other data element in the set. When the datasets exceed a system\u27s memory size, this presents a challenge, which our methods address. \section*{Solution Approach} For efficiency and distributed control, it is common in distributed systems and algorithms to group nodes into intersecting sets referred to as quorum sets. Our management techniques rely on the established quorum set theories for their efficiencies and management. We then proved an all-pairs property of cyclic quorum sets, which is central to guaranteeing that all-pairs of elements (nodes or data) are able to interact in the system. The all-pairs data computation problem requires all data elements to be paired with all other data elements. These all-pairs problems occur in many science fields, which has led to their continued interest. Our research addresses the memory and computation time challenges of the general all-pairs big data interaction computations through the use of memory efficient computation management techniques. Proposed were methods using distributed computing to share the computational workload. Although the problem definition requires every data element to have access to and interact with the entire dataset, our cyclic quorum set techniques relax this restriction in distributed systems. This computation management is used to reduce memory resource requirements per node and enable big data scalability. Implementation evaluation of a large bioinformatics application demonstrated scalability on real datasets with linear and at times super-linear speedups. Reductions in memory requirements per node allowed for processing larger datasets that would not have been feasible on a single node either due to memory or time requirements. Similar cyclic quorum set techniques were used to address efficient and fault tolerant communication routing challenges in optical networking. Cycle-based optical network routing, whether using SONET rings or p-cycles, provide the sufficient reliability in networks. Light-trails forming a cycle in the network allow broadcasts within a cycle to be used for efficient multicast communications. Using the proven ``all-pairs\u27\u27 property of cyclic quorum sets, we could guarantee all pairs of nodes will occur in one or more quorums, so efficient, arbitrary unicast communication can occur between any two nodes. Efficient broadcasts to all network nodes are possible by a node broadcasting to all quorum cycles to which it belongs (O(N)O\left(\sqrt{N}\right).) We analyzed node pair communications in networks, specifically, the fault tolerance aspects of using cyclic quorum sets to route cycles. Observed was better than 99% average single fault coverage and some node pair communications were protected by more than one cycle. Exploiting this redundant node pair protections revealed even greater resource efficiencies. Common cycle routing techniques will use pairs of cycles to achieve both routing and fault-tolerance, which uses substantial resources and creates the potential for underutilization. Instead, when we intentionally designed cyclic quorum sets with RR redundant pairs of nodes and utilized the RR redundancy within the quorum cycles to replace the pair of cycles with just a single cycle, we saw network resource usage almost halved. Our analysis of several networks showed R=2R=2 redundant single cycles had 96.60 - 99.37% single link fault coverage, while reducing resource usage by 42.9 - 47.18% on average. Increasing redundancy to R=3R=3 redundant cycles maintained a 93.23 - 99.34% average fault coverage even with two simultaneous link faults and used 38.85 - 42.39% fewer resources on average

    Database design: A practical methodology.

    Get PDF

    ObInject Query Language.

    Get PDF
    O surgimento de bancos de dados revolucionou a maneira como dados são armazenados. Eles permitiram que uma enorme quantidade de dados fossem armazenadas em estruturas além de facilitar a sua manipulação. Junto aos bancos de dados, surgiram as linguagens de consulta. Estas linguagens transferiram ao banco de dados a tarefa de manipular as estruturas e consequentemente o desempenho da extração de dados. Mais recentemente, frameworks para persistência de dados se tornaram muito populares. Entre eles, o framework Object-Inject (CARVALHO et al., 2013) se mostrou bastante promissor para a persistência de objetos. Entretanto, este framework ainda não apresenta uma linguagem de consulta, sendo necessário a manipulação das estruturas para realizar a extração de dados. Este trabalho tem como objetivo definir uma linguagem de consulta para tal framework

    Optimisation of partitioned temporal joins

    Get PDF

    Управління запитами в системах документообігу

    Get PDF
    У монографії розроблено нові методи вирішення задач прискорення запитів у базах даних і систем документообігу для прийняття рішень в соціоорієнтованих структурах. Запропоновано методику пошуку на основі нечітких мір для скорочення часу вибірки у великих і надвеликих базах даних. Удосконалено методи побудови автоматизованих інтегрованих мультимедійних комплексів для відображення динамічних ситуацій в об'єктах соціокомунальної структури
    corecore