Search CORE

1,530 research outputs found

Session stitching using sequence fingerprinting for web page visits

Author: De Smedt Johannes
Kohls Hans-Helmut
Lacka Ewelina
Nita Spyro
Paton Ross
Publication venue: 'Elsevier BV'
Publication date: 28/04/2021
Field of study

Partial-order-based process mining: a survey and outlook

Author: Leemans Sander J. J.
Lu Xixi
Zelst Sebastiaan J. van
Publication venue
Publication date: 01/01/2023
Field of study

The field of process mining focuses on distilling knowledge of the (historical) execution of a process based on the operational event data generated and stored during its execution. Most existing process mining techniques assume that the event data describe activity executions as degenerate time intervals, i.e., intervals of the form [t, t], yielding a strict total order on the observed activity instances. However, for various practical use cases, e.g., the logging of activity executions with a nonzero duration and uncertainty on the correctness of the recorded timestamps of the activity executions, assuming a partial order on the observed activity instances is more appropriate. Using partial orders to represent process executions, i.e., based on recorded event data, allows for new classes of process mining algorithms, i.e., aware of parallelism and robust to uncertainty. Yet, interestingly, only a limited number of studies consider using intermediate data abstractions that explicitly assume a partial order over a collection of observed activity instances. Considering recent developments in process mining, e.g., the prevalence of high-quality event data and techniques for event data abstraction, the need for algorithms designed to handle partially ordered event data is expected to grow in the upcoming years. Therefore, this paper presents a survey of process mining techniques that explicitly use partial orders to represent recorded process behavior. We performed a keyword search, followed by a snowball sampling strategy, yielding 68 relevant articles in the field. We observe a recent uptake in works covering partial-order-based process mining, e.g., due to the current trend of process mining based on uncertain event data. Furthermore, we outline promising novel research directions for the use of partial orders in the context of process mining algorithms

Utrecht University Repository

Frequent Itemset Mining for Big Data

Author: Pulvirenti Fabio
Publication venue: Politecnico di Torino
Publication date: 01/01/2017
Field of study

Traditional data mining tools, developed to extract actionable knowledge from data, demonstrated to be inadequate to process the huge amount of data produced nowadays. Even the most popular algorithms related to Frequent Itemset Mining, an exploratory data analysis technique used to discover frequent items co-occurrences in a transactional dataset, are inefficient with larger and more complex data. As a consequence, many parallel algorithms have been developed, based on modern frameworks able to leverage distributed computation in commodity clusters of machines (e.g., Apache Hadoop, Apache Spark). However, frequent itemset mining parallelization is far from trivial. The search-space exploration, on which all the techniques are based, is not easily partitionable. Hence, distributed frequent itemset mining is a challenging problem and an interesting research topic. In this context, our main contributions consist in an (i) exhaustive theoretical and experimental analysis of the best-in-class approaches, whose outcomes and open issues motivated (ii) the development of a distributed high-dimensional frequent itemset miner. The dissertation introduces also a data mining framework which takes strongly advantage of distributed frequent itemset mining for the extraction of a specific type of itemsets (iii). The theoretical analysis highlights the challenges related to the distribution and the preliminary partitioning of the frequent itemset mining problem (i.e. the search-space exploration) describing the most adopted distribution strategies. The extensive experimental campaign, instead, compares the expectations related to the algorithmic choices against the actual performances of the algorithms. We run more than 300 experiments in order to evaluate and discuss the performances of the algorithms with respect to different real life use cases and data distributions. The outcomes of the review is that no algorithm is universally superior and performances are heavily skewed by the data distribution. Moreover, we were able to identify a concrete lack as regards frequent pattern extraction within high-dimensional use cases. For this reason, we have developed our own distributed high-dimensional frequent itemset miner based on Apache Hadoop. The algorithm splits the search-space exploration into independent sub-tasks. However, since the exploration strongly benefits of a full-knowledge of the problem, we introduced an interleaving synchronization phase. The result is a trade-off between the benefits of a centralized state and the ones related to the additional computational power due to parallelism. The experimental benchmarks, performed on real-life high-dimensional use cases, show the efficiency of the proposed approach in terms of execution time, load balancing and reliability to memory issues. Finally, the dissertation introduces a data mining framework in which distributed itemset mining is a fundamental component of the processing pipeline. The aim of the framework is the extraction of a new type of itemsets, called misleading generalized itemsets

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Strategies for Managing Linked Enterprise Data

Author: Galkin Mikhail
Publication venue: Universitäts- und Landesbibliothek Bonn
Publication date
Field of study

Data, information and knowledge become key assets of our 21st century economy. As a result, data and knowledge management become key tasks with regard to sustainable development and business success. Often, knowledge is not explicitly represented residing in the minds of people or scattered among a variety of data sources. Knowledge is inherently associated with semantics that conveys its meaning to a human or machine agent. The Linked Data concept facilitates the semantic integration of heterogeneous data sources. However, we still lack an effective knowledge integration strategy applicable to enterprise scenarios, which balances between large amounts of data stored in legacy information systems and data lakes as well as tailored domain specific ontologies that formally describe real-world concepts. In this thesis we investigate strategies for managing linked enterprise data analyzing how actionable knowledge can be derived from enterprise data leveraging knowledge graphs. Actionable knowledge provides valuable insights, supports decision makers with clear interpretable arguments, and keeps its inference processes explainable. The benefits of employing actionable knowledge and its coherent management strategy span from a holistic semantic representation layer of enterprise data, i.e., representing numerous data sources as one, consistent, and integrated knowledge source, to unified interaction mechanisms with other systems that are able to effectively and efficiently leverage such an actionable knowledge. Several challenges have to be addressed on different conceptual levels pursuing this goal, i.e., means for representing knowledge, semantic data integration of raw data sources and subsequent knowledge extraction, communication interfaces, and implementation. In order to tackle those challenges we present the concept of Enterprise Knowledge Graphs (EKGs), describe their characteristics and advantages compared to existing approaches. We study each challenge with regard to using EKGs and demonstrate their efficiency. In particular, EKGs are able to reduce the semantic data integration effort when processing large-scale heterogeneous datasets. Then, having built a consistent logical integration layer with heterogeneity behind the scenes, EKGs unify query processing and enable effective communication interfaces for other enterprise systems. The achieved results allow us to conclude that strategies for managing linked enterprise data based on EKGs exhibit reasonable performance, comply with enterprise requirements, and ensure integrated data and knowledge management throughout its life cycle

bonndoc – Der Publikationsserver der Universität Bonn

When Things Matter: A Data-Centric View of the Internet of Things

Author: Dustdar Schahram
Falkner Nickolas J. G.
Qin Yongrui
Sheng Quan Z.
Vasilakos Athanasios V.
Wang Hua
Publication venue
Publication date: 01/01/2014
Field of study

With the recent advances in radio-frequency identification (RFID), low-cost wireless sensor devices, and Web technologies, the Internet of Things (IoT) approach has gained momentum in connecting everyday objects to the Internet and facilitating machine-to-human and machine-to-machine communication with the physical world. While IoT offers the capability to connect and integrate both digital and physical entities, enabling a whole new class of applications and services, several significant challenges need to be addressed before these applications and services can be fully realized. A fundamental challenge centers around managing IoT data, typically produced in dynamic and volatile environments, which is not only extremely large in scale and volume, but also noisy, and continuous. This article surveys the main techniques and state-of-the-art research efforts in IoT from data-centric perspectives, including data stream processing, data storage models, complex event processing, and searching in IoT. Open research issues for IoT data management are also discussed

arXiv.org e-Print Archive

Victoria University Eprints Repository

Recommended from our members

Fast, Scalable, and Accurate Algorithms for Time-Series Analysis

Author: Paparrizos Ioannis
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2018
Field of study

Time is a critical element for the understanding of natural processes (e.g., earthquakes and weather) or human-made artifacts (e.g., stock market and speech signals). The analysis of time series, the result of sequentially collecting observations of such processes and artifacts, is becoming increasingly prevalent across scientific and industrial applications. The extraction of non-trivial features (e.g., patterns, correlations, and trends) in time series is a critical step for devising effective time-series mining methods for real-world problems and the subject of active research for decades. In this dissertation, we address this fundamental problem by studying and presenting computational methods for efficient unsupervised learning of robust feature representations from time series. Our objective is to (i) simplify and unify the design of scalable and accurate time-series mining algorithms; and (ii) provide a set of readily available tools for effective time-series analysis. We focus on applications operating solely over time-series collections and on applications where the analysis of time series complements the analysis of other types of data, such as text and graphs. For applications operating solely over time-series collections, we propose a generic computational framework, GRAIL, to learn low-dimensional representations that natively preserve the invariances offered by a given time-series comparison method. GRAIL represents a departure from classic approaches in the time-series literature where representation methods are agnostic to the similarity function used in subsequent learning processes. GRAIL relies on the attractive idea that once we construct the data-to-data similarity matrix most time-series mining tasks can be trivially solved. To overcome scalability issues associated with approaches relying on such matrices, GRAIL exploits time-series clustering to construct a small set of landmark time series and learns representations to reduce the data-to-data matrix to a data-to-landmark points matrix. To demonstrate the effectiveness of GRAIL, we first present domain-independent, highly accurate, and scalable time-series clustering methods to facilitate exploration and summarization of time-series collections. Then, we show that GRAIL representations, when combined with suitable methods, significantly outperform, in terms of efficiency and accuracy, state-of-the-art methods in major time-series mining tasks, such as querying, clustering, classification, sampling, and visualization. Overall, GRAIL rises as a new primitive for highly accurate, yet scalable, time-series analysis. For applications where the analysis of time series complements the analysis of other types of data, such as text and graphs, we propose generic, simple, and lightweight methodologies to learn features from time-varying measurements. Such applications often organize operations over different types of data in a pipeline such that one operation provides input---in the form of feature vectors---to subsequent operations. To reason about the temporal patterns and trends in the underlying features, we need to (i) track the evolution of features over different time periods; and (ii) transform these time-varying features into actionable knowledge (e.g., forecasting an outcome). To address this challenging problem, we propose principled approaches to model time-varying features and study two large-scale, real-world, applications. Specifically, we first study the problem of predicting the impact of scientific concepts through temporal analysis of characteristics extracted from the metadata and full text of scientific articles. Then, we explore the promise of harnessing temporal patterns in behavioral signals extracted from web search engine logs for early detection of devastating diseases. In both applications, combinations of features with time-series relevant features yielded the greatest impact than any other indicator considered in our analysis. We believe that our simple methodology, along with the interesting domain-specific findings that our work revealed, will motivate new studies across different scientific and industrial settings

Columbia University Academic Commons

Supporting Evolution and Maintenance of android Apps

Author: Linares-Vasquez Mario
Publication venue: W&M ScholarWorks
Publication date: 01/10/2016
Field of study

Mobile developers and testers face a number of emerging challenges. These include rapid platform evolution and API instability; issues in bug reporting and reproduction involving complex multitouch gestures; platform fragmentation; the impact of reviews and ratings on the success of their apps; management of crowd-sourced requirements; continuous pressure from the market for frequent releases; lack of effective and usable testing tools; and limited computational resources for handheld devices. Traditional and contemporary methods in software evolution and maintenance were not designed for these types of challenges; therefore, a set of studies and a new toolbox of techniques for mobile development are required to analyze current challenges and propose new solutions. This dissertation presents a set of empirical studies, as well as solutions for some of the key challenges when evolving and maintaining android apps. In particular, we analyzed key challenges experienced by practitioners and open issues in the mobile development community such as (i) android API instability, (ii) performance optimizations, (iii) automatic GUI testing, and (iv) energy consumption. When carrying out the studies, we relied on qualitative and quantitative analyses to understand the phenomena on a large scale by considering evidence extracted from software repositories and the opinions of open-source mobile developers. From the empirical studies, we identified that dynamic analysis is a relevant method for several evolution and maintenance tasks, in particular, because of the need of practitioners to execute/validate the apps on a diverse set of platforms (i.e., device and OS) and under pressure for continuous delivery. Therefore, we designed and implemented an extensible infrastructure that enables large-scale automatic execution of android apps to support different evolution and maintenance tasks (e.g., testing and energy optimization). In addition to the infrastructure we present a taxonomy of issues, single solutions to the issues, and guidelines to enable large execution of android apps. Finally, we devised novel approaches aimed at supporting testing and energy optimization of mobile apps (two key challenges in evolution and maintenance of android apps). First, we propose a novel hybrid approach for automatic GUI-based testing of apps that is able to generate (un)natural test sequences by mining real applications usages and learning statistical models that represent the GUI interactions. In addition, we propose a multi-objective approach for optimizing the energy consumption of GUIs in android apps that is able to generate visually appealing color compositions, while reducing the energy consumption and keeping a design concept close to the original

College of William & Mary: W&M Publish

Process Mining for Smart Product Design

Author: van Eck Maikel Leroy
Publication venue: Eindhoven University of Technology
Publication date: 15/03/2022
Field of study

Pure OAI Repository