528,520 research outputs found

    A neural network for mining large volumes of time series data

    Get PDF
    Efficiently mining large volumes of time series data is amongst the most challenging problems that are fundamental in many fields such as industrial process monitoring, medical data analysis and business forecasting. This paper discusses a high-performance neural network for mining large time series data set and some practical issues on time series data mining. Examples of how this technology is used to search the engine data within a major UK eScience Grid project (DAME) for supporting the maintenance of Rolls-Royce aero-engine are presented

    Time Series Data Mining: A Retail Application Using SAS Enterprise Miner

    Get PDF
    Modern technologies have allowed for the amassment of data at a rate never encountered before. Organizations are now able to routinely collect and process massive volumes of data. A plethora of regularly collected information can be ordered using an appropriate time interval. The data would thus be developed into a time series. With such data, analytical techniques can be employed to collect information pertaining to historical trends and seasonality. Time series data mining methodology allows users to identify commonalities between sets of time-ordered data. This technique is supported by a variety of algorithms, notably dynamic time warping (DTW). This mathematical technique supports the identification of similarities between numerous time series. The following research aims to provide a practical application of this methodology using SAS Enterprise Miner, an industry-leading software platform for business analytics. Due to the prevalence of time series data in retail settings, a realistic product sales transaction data set was analyzed. This information was provided by dunnhumbyUSA. Interpretations were drawn from output that was generated using “TS nodes” in SAS Enterprise Miner

    Mining Novel Multivariate Relationships in Time Series Data Using Correlation Networks

    Full text link
    In many domains, there is significant interest in capturing novel relationships between time series that represent activities recorded at different nodes of a highly complex system. In this paper, we introduce multipoles, a novel class of linear relationships between more than two time series. A multipole is a set of time series that have strong linear dependence among themselves, with the requirement that each time series makes a significant contribution to the linear dependence. We demonstrate that most interesting multipoles can be identified as cliques of negative correlations in a correlation network. Such cliques are typically rare in a real-world correlation network, which allows us to find almost all multipoles efficiently using a clique-enumeration approach. Using our proposed framework, we demonstrate the utility of multipoles in discovering new physical phenomena in two scientific domains: climate science and neuroscience. In particular, we discovered several multipole relationships that are reproducible in multiple other independent datasets and lead to novel domain insights.Comment: This is the accepted version of article submitted to IEEE Transactions on Knowledge and Data Engineering 201

    Mining time-series data using discriminative subsequences

    Get PDF
    Time-series data is abundant, and must be analysed to extract usable knowledge. Local-shape-based methods offer improved performance for many problems, and a comprehensible method of understanding both data and models. For time-series classification, we transform the data into a local-shape space using a shapelet transform. A shapelet is a time-series subsequence that is discriminative of the class of the original series. We use a heterogeneous ensemble classifier on the transformed data. The accuracy of our method is significantly better than the time-series classification benchmark (1-nearest-neighbour with dynamic time-warping distance), and significantly better than the previous best shapelet-based classifiers. We use two methods to increase interpretability: First, we cluster the shapelets using a novel, parameterless clustering method based on Minimum Description Length, reducing dimensionality and removing duplicate shapelets. Second, we transform the shapelet data into binary data reflecting the presence or absence of particular shapelets, a representation that is straightforward to interpret and understand. We supplement the ensemble classifier with partial classifocation. We generate rule sets on the binary-shapelet data, improving performance on certain classes, and revealing the relationship between the shapelets and the class label. To aid interpretability, we use a novel algorithm, BruteSuppression, that can substantially reduce the size of a rule set without negatively affecting performance, leading to a more compact, comprehensible model. Finally, we propose three novel algorithms for unsupervised mining of approximately repeated patterns in time-series data, testing their performance in terms of speed and accuracy on synthetic data, and on a real-world electricity-consumption device-disambiguation problem. We show that individual devices can be found automatically and in an unsupervised manner using a local-shape-based approach

    Time Series Data Mining Algorithms for Identifying Short RNA in Arabidopsis thaliana

    Get PDF
    The class of molecules called short RNAs (sRNAs) are known to play a key role in gene regulation. Th are typically sequences of nucleotides between 21-25 nucleotides in length. They are known to play a key role in gene regulation. The identification, clustering and classification of sRNA has recently become the focus of much research activity. The basic problem involves detecting regions of interest on the chromosome where the pattern of candidate matches is somehow unusual. Currently, there are no published algorithms for detecting regions of interest, and the unpublished methods that we are aware of involve bespoke rule based systems designed for a specific organism. Work in this very new field has understandably focused on the outcomes rather than the methods used to obtain the results. In this paper we propose two generic approaches that place the specific biological problem in the wider context of time series data mining problems. Both methods are based on treating the occurrences on a chromosome, or “hit count” data, as a time series, then running a sliding window along a chromosome and measuring unusualness. This formulation means we can treat finding unusual areas of candidate RNA activity as a variety of time series anomaly detection problem. The first set of approaches is model based. We specify a null hypothesis distribution for not being a sRNA, then estimate the p-values along the chromosome. The second approach is instance based. We identify some typical shapes from known sRNA, then use dynamic time warping and fourier trans-form based distance to measure how closely the candidate series matches. We demonstrate that these methods can find known sRNA on Arabidopsis thaliana chromosomes and illustrate the benefits of the added information provided by these algorithms
    • …
    corecore