116 research outputs found

    The Minimum Description Length Principle for Pattern Mining: A Survey

    Full text link
    This is about the Minimum Description Length (MDL) principle applied to pattern mining. The length of this description is kept to the minimum. Mining patterns is a core task in data analysis and, beyond issues of efficient enumeration, the selection of patterns constitutes a major challenge. The MDL principle, a model selection method grounded in information theory, has been applied to pattern mining with the aim to obtain compact high-quality sets of patterns. After giving an outline of relevant concepts from information theory and coding, as well as of work on the theory behind the MDL and similar principles, we review MDL-based methods for mining various types of data and patterns. Finally, we open a discussion on some issues regarding these methods, and highlight currently active related data analysis problems

    Network Model Selection Using Task-Focused Minimum Description Length

    Full text link
    Networks are fundamental models for data used in practically every application domain. In most instances, several implicit or explicit choices about the network definition impact the translation of underlying data to a network representation, and the subsequent question(s) about the underlying system being represented. Users of downstream network data may not even be aware of these choices or their impacts. We propose a task-focused network model selection methodology which addresses several key challenges. Our approach constructs network models from underlying data and uses minimum description length (MDL) criteria for selection. Our methodology measures efficiency, a general and comparable measure of the network's performance of a local (i.e. node-level) predictive task of interest. Selection on efficiency favors parsimonious (e.g. sparse) models to avoid overfitting and can be applied across arbitrary tasks and representations. We show stability, sensitivity, and significance testing in our methodology

    Predefined pattern detection in large time series

    Get PDF
    Predefined pattern detection from time series is an interesting and challenging task. In order to reduce its computational cost and increase effectiveness, a number of time series representation methods and similarity measures have been proposed. Most of the existing methods focus on full sequence matching, that is, sequences with clearly defined beginnings and endings, where all data points contribute to the match. These methods, however, do not account for temporal and magnitude deformations in the data and result to be ineffective on several real-world scenarios where noise and external phenomena introduce diversity in the class of patterns to be matched. In this paper, we present a novel pattern detection method, which is based on the notions of templates, landmarks, constraints and trust regions. We employ the Minimum Description Length (MDL) principle for time series preprocessing step, which helps to preserve all the prominent features and prevents the template from overfitting. Templates are provided by common users or domain experts, and represent interesting patterns we want to detect from time series. Instead of utilising templates to match all the potential subsequences in the time series, we translate the time series and templates into landmark sequences, and detect patterns from landmark sequence of the time series. Through defining constraints within the template landmark sequence, we effectively extract all the landmark subsequences from the time series landmark sequence, and obtain a number of landmark segments (time series subsequences or instances). We model each landmark segment through scaling the template in both temporal and magnitude dimensions. To suppress the influence of noise, we introduce the concept oftrust region, which not only helps to achieve an improved instance model, but also helps to catch the accurate boundaries of instances of the given template. Based on the similarities derived from instance models, we introduce the probability density function to calculate a similarity threshold. The threshold can be used to judge if a landmark segment is a true instance of the given template or not. To evaluate the effectiveness and efficiency of the proposed method, we apply it to two real-world datasets. The results show that our method is capable of detecting patterns of temporal and magnitude deformations with competitive performance

    Mining time-series data using discriminative subsequences

    Get PDF
    Time-series data is abundant, and must be analysed to extract usable knowledge. Local-shape-based methods offer improved performance for many problems, and a comprehensible method of understanding both data and models. For time-series classification, we transform the data into a local-shape space using a shapelet transform. A shapelet is a time-series subsequence that is discriminative of the class of the original series. We use a heterogeneous ensemble classifier on the transformed data. The accuracy of our method is significantly better than the time-series classification benchmark (1-nearest-neighbour with dynamic time-warping distance), and significantly better than the previous best shapelet-based classifiers. We use two methods to increase interpretability: First, we cluster the shapelets using a novel, parameterless clustering method based on Minimum Description Length, reducing dimensionality and removing duplicate shapelets. Second, we transform the shapelet data into binary data reflecting the presence or absence of particular shapelets, a representation that is straightforward to interpret and understand. We supplement the ensemble classifier with partial classifocation. We generate rule sets on the binary-shapelet data, improving performance on certain classes, and revealing the relationship between the shapelets and the class label. To aid interpretability, we use a novel algorithm, BruteSuppression, that can substantially reduce the size of a rule set without negatively affecting performance, leading to a more compact, comprehensible model. Finally, we propose three novel algorithms for unsupervised mining of approximately repeated patterns in time-series data, testing their performance in terms of speed and accuracy on synthetic data, and on a real-world electricity-consumption device-disambiguation problem. We show that individual devices can be found automatically and in an unsupervised manner using a local-shape-based approach

    MDL-Based Analysis of Time Series at Multiple Time-Scales

    Full text link

    Mining sensor data from complex systems

    Get PDF
    Today, virtually everything, from natural phenomena to complex artificial and physical systems, can be measured and the resulting information collected, stored and analyzed in order to gain new insight. This thesis shows how complex systems often exhibit diverse behavior at different temporal scales, and that data mining methods should be able to cope with the multiple resolutions (scales) at the same time in order to fully understand the data at hand and extract useful information from it. Under these assumptions, we introduce novel data mining and visualization methods for large time series data collected from complex physical systems. In particular, we focus on three fundamental problems: the detection of multi-scale patterns, the recognition of recurrent events, and the interactive visualization of massive time series data. We evaluate our methods on a real-world scenario provided by InfraWatch, a Structural Health Monitoring project centered around the management and analysis of data collected by a large sensor network deployed on a Dutch highway bridge. The application of our methods resulted in the identification of the relevant scales of analysis in the InfraWatch data (and other datasets), the detection of the different recurring motifs and the visualization of terabytes of time series data interactively.STWAlgorithms and the Foundations of Software technolog

    Handling oversampling in dynamic networks using link prediction

    Full text link
    Oversampling is a common characteristic of data representing dynamic networks. It introduces noise into representations of dynamic networks, but there has been little work so far to compensate for it. Oversampling can affect the quality of many important algorithmic problems on dynamic networks, including link prediction. Link prediction seeks to predict edges that will be added to the network given previous snapshots. We show that not only does oversampling affect the quality of link prediction, but that we can use link prediction to recover from the effects of oversampling. We also introduce a novel generative model of noise in dynamic networks that represents oversampling. We demonstrate the results of our approach on both synthetic and real-world data.Comment: ECML/PKDD 201
    • …
    corecore