148 research outputs found

    Implications of Z-normalization in the matrix profile

    Get PDF
    Companies are increasingly measuring their products and services, resulting in a rising amount of available time series data, making techniques to extract usable information needed. One state-of-the-art technique for time series is the Matrix Profile, which has been used for various applications including motif/discord discovery, visualizations and semantic segmentation. Internally, the Matrix Profile utilizes the z-normalized Euclidean distance to compare the shape of subsequences between two series. However, when comparing subsequences that are relatively flat and contain noise, the resulting distance is high despite the visual similarity of these subsequences. This property violates some of the assumptions made by Matrix Profile based techniques, resulting in worse performance when series contain flat and noisy subsequences. By studying the properties of the z-normalized Euclidean distance, we derived a method to eliminate this effect requiring only an estimate of the standard deviation of the noise. In this paper we describe various practical properties of the z-normalized Euclidean distance and show how these can be used to correct the performance of Matrix Profile related techniques. We demonstrate our techniques using anomaly detection using a Yahoo! Webscope anomaly dataset, semantic segmentation on the PAMAP2 activity dataset and for data visualization on a UCI activity dataset, all containing real-world data, and obtain overall better results after applying our technique. Our technique is a straightforward extension of the distance calculation in the Matrix Profile and will benefit any derived technique dealing with time series containing flat and noisy subsequences

    A generalized matrix profile framework with support for contextual series analysis

    Get PDF
    The Matrix Profile is a state-of-the-art time series analysis technique that can be used for motif discovery, anomaly detection, segmentation and others, in various domains such as healthcare, robotics, and audio. Where recent techniques use the Matrix Profile as a preprocessing or modeling step, we believe there is unexplored potential in generalizing the approach. We derived a framework that focuses on the implicit distance matrix calculation. We present this framework as the Series Distance Matrix (SDM). In this framework, distance measures (SDM-generators) and distance processors (SDM-consumers) can be freely combined, allowing for more flexibility and easier experimentation. In SDM, the Matrix Profile is but one specific configuration. We also introduce the Contextual Matrix Profile (CMP) as a new SDM-consumer capable of discovering repeating patterns. The CMP provides intuitive visualizations for data analysis and can find anomalies that are not discords. We demonstrate this using two real world cases. The CMP is the first of a wide variety of new techniques for series analysis that fits within SDM and can complement the Matrix Profile

    ALDI++: Automatic and parameter-less discord and outlier detection for building energy load profiles

    Full text link
    Data-driven building energy prediction is an integral part of the process for measurement and verification, building benchmarking, and building-to-grid interaction. The ASHRAE Great Energy Predictor III (GEPIII) machine learning competition used an extensive meter data set to crowdsource the most accurate machine learning workflow for whole building energy prediction. A significant component of the winning solutions was the pre-processing phase to remove anomalous training data. Contemporary pre-processing methods focus on filtering statistical threshold values or deep learning methods requiring training data and multiple hyper-parameters. A recent method named ALDI (Automated Load profile Discord Identification) managed to identify these discords using matrix profile, but the technique still requires user-defined parameters. We develop ALDI++, a method based on the previous work that bypasses user-defined parameters and takes advantage of discord similarity. We evaluate ALDI++ against a statistical threshold, variational auto-encoder, and the original ALDI as baselines in classifying discords and energy forecasting scenarios. Our results demonstrate that while the classification performance improvement over the original method is marginal, ALDI++ helps achieve the best forecasting error improving 6% over the winning's team approach with six times less computation time.Comment: 10 pages, 5 figures, 3 table

    A New Time Series Similarity Measure and Its Smart Grid Applications

    Full text link
    Many smart grid applications involve data mining, clustering, classification, identification, and anomaly detection, among others. These applications primarily depend on the measurement of similarity, which is the distance between different time series or subsequences of a time series. The commonly used time series distance measures, namely Euclidean Distance (ED) and Dynamic Time Warping (DTW), do not quantify the flexible nature of electricity usage data in terms of temporal dynamics. As a result, there is a need for a new distance measure that can quantify both the amplitude and temporal changes of electricity time series for smart grid applications, e.g., demand response and load profiling. This paper introduces a novel distance measure to compare electricity usage patterns. The method consists of two phases that quantify the effort required to reshape one time series into another, considering both amplitude and temporal changes. The proposed method is evaluated against ED and DTW using real-world data in three smart grid applications. Overall, the proposed measure outperforms ED and DTW in accurately identifying the best load scheduling strategy, anomalous days with irregular electricity usage, and determining electricity users' behind-the-meter (BTM) equipment.Comment: 7 pages, 6 figures conferenc

    Refining the Optimization Target for Automatic Univariate Time Series Anomaly Detection in Monitoring Services

    Full text link
    Time series anomaly detection is crucial for industrial monitoring services that handle a large volume of data, aiming to ensure reliability and optimize system performance. Existing methods often require extensive labeled resources and manual parameter selection, highlighting the need for automation. This paper proposes a comprehensive framework for automatic parameter optimization in time series anomaly detection models. The framework introduces three optimization targets: prediction score, shape score, and sensitivity score, which can be easily adapted to different model backbones without prior knowledge or manual labeling efforts. The proposed framework has been successfully applied online for over six months, serving more than 50,000 time series every minute. It simplifies the user's experience by requiring only an expected sensitive value, offering a user-friendly interface, and achieving desired detection results. Extensive evaluations conducted on public datasets and comparison with other methods further confirm the effectiveness of the proposed framework.Comment: Accepted by 2023 IJCAI Worksho

    Contributions to time series data mining towards the detection of outliers/anomalies

    Get PDF
    148 p.Los recientes avances tecnológicos han supuesto un gran progreso en la recogida de datos, permitiendo recopilar una gran cantidad de datos a lo largo del tiempo. Estos datos se presentan comúnmente en forma de series temporales, donde las observaciones se han registrado de forma cronológica y están correlacionadas en el tiempo. A menudo, estas dependencias temporales contienen información significativa y útil, por lo que, en los últimos años, ha surgido un gran interés por extraer dicha información. En particular, el área de investigación que se centra en esta tarea se denomina minería de datos de series temporales.La comunidad de investigadores de esta área se ha dedicado a resolver diferentes tareas como por ejemplo la clasificación, la predicción, el clustering o agrupamiento y la detección de valores atípicos/anomalías. Los valores atípicos o anomalías son aquellas observaciones que no siguen el comportamiento esperado en una serie temporal. Estos valores atípicos o anómalos suelen representar mediciones no deseadas o eventos de interés, y, por lo tanto, detectarlos suele ser relevante ya que pueden empeorar la calidad de los datos o reflejar fenómenos interesantes para el analista.Esta tesis presenta varias contribuciones en el campo de la minería de datos de series temporales, más específicamente sobre la detección de valores atípicos o anomalías. Estas contribuciones se pueden dividir en dos partes o bloques. Por una parte, la tesis presenta contribuciones en el campo de la detección de valores atípicos o anomalías en series temporales. Para ello, se ofrece una revisión de las técnicas en la literatura, y se presenta una nueva técnica de detección de anomalías en series temporales univariantes para la detección de fugas de agua, basada en el aprendizaje autosupervisado. Por otra parte, la tesis también introduce contribuciones relacionadas con el tratamiento de las series temporales con valores perdidos y demuestra su aplicabilidad en el campo de la detección de anomalías

    TRANSOM: An Efficient Fault-Tolerant System for Training LLMs

    Full text link
    Large language models (LLMs) with hundreds of billions or trillions of parameters, represented by chatGPT, have achieved profound impact on various fields. However, training LLMs with super-large-scale parameters requires large high-performance GPU clusters and long training periods lasting for months. Due to the inevitable hardware and software failures in large-scale clusters, maintaining uninterrupted and long-duration training is extremely challenging. As a result, A substantial amount of training time is devoted to task checkpoint saving and loading, task rescheduling and restart, and task manual anomaly checks, which greatly harms the overall training efficiency. To address these issues, we propose TRANSOM, a novel fault-tolerant LLM training system. In this work, we design three key subsystems: the training pipeline automatic fault tolerance and recovery mechanism named Transom Operator and Launcher (TOL), the training task multi-dimensional metric automatic anomaly detection system named Transom Eagle Eye (TEE), and the training checkpoint asynchronous access automatic fault tolerance and recovery technology named Transom Checkpoint Engine (TCE). Here, TOL manages the lifecycle of training tasks, while TEE is responsible for task monitoring and anomaly reporting. TEE detects training anomalies and reports them to TOL, who automatically enters the fault tolerance strategy to eliminate abnormal nodes and restart the training task. And the asynchronous checkpoint saving and loading functionality provided by TCE greatly shorten the fault tolerance overhead. The experimental results indicate that TRANSOM significantly enhances the efficiency of large-scale LLM training on clusters. Specifically, the pre-training time for GPT3-175B has been reduced by 28%, while checkpoint saving and loading performance have improved by a factor of 20.Comment: 14 pages, 9 figure

    A study of time series: anomaly detection and trend prediction.

    Get PDF
    Leung Tat Wing.Thesis (M.Phil.)--Chinese University of Hong Kong, 2006.Includes bibliographical references (leaves 94-98).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Unusual Pattern Discovery --- p.3Chapter 1.2 --- Trend Prediction --- p.4Chapter 1.3 --- Thesis Organization --- p.5Chapter 2 --- Unusual Pattern Discovery --- p.6Chapter 2.1 --- Introduction --- p.6Chapter 2.2 --- Related Work --- p.7Chapter 2.2.1 --- Time Series Discords --- p.7Chapter 2.2.2 --- Brute Force Algorithm --- p.8Chapter 2.2.3 --- Keogh et al.'s Algorithm --- p.10Chapter 2.2.4 --- Performance Analysis --- p.14Chapter 2.3 --- Proposed Approach --- p.18Chapter 2.3.1 --- Haar Transform --- p.20Chapter 2.3.2 --- Discretization --- p.22Chapter 2.3.3 --- Augmented Trie --- p.24Chapter 2.3.4 --- Approximating the Magic Outer Loop --- p.27Chapter 2.3.5 --- Approximating the Magic Inner Loop --- p.28Chapter 2.3.6 --- Experimental Result --- p.28Chapter 2.4 --- More on discord length --- p.42Chapter 2.4.1 --- Modified Haar Transform --- p.42Chapter 2.4.2 --- Fast Haar Transform Algorithm --- p.43Chapter 2.4.3 --- Relation between discord length and discord location --- p.45Chapter 2.5 --- Further Optimization --- p.47Chapter 2.5.1 --- Improved Inner Loop Heuristic --- p.50Chapter 2.5.2 --- Experimental Result --- p.52Chapter 2.6 --- Top K discords --- p.53Chapter 2.6.1 --- Utility of top K discords --- p.53Chapter 2.6.2 --- Algorithm --- p.58Chapter 2.6.3 --- Experimental Result --- p.62Chapter 2.7 --- Conclusion --- p.64Chapter 3 --- Trend Prediction --- p.69Chapter 3.1 --- Introduction --- p.69Chapter 3.2 --- Technical Analysis --- p.70Chapter 3.2.1 --- Relative Strength Index --- p.70Chapter 3.2.2 --- Chart Analysis --- p.70Chapter 3.2.3 --- Dow Theory --- p.71Chapter 3.2.4 --- Moving Average --- p.72Chapter 3.3 --- Proposed Algorithm --- p.79Chapter 3.3.1 --- Piecewise Linear Representation --- p.80Chapter 3.3.2 --- Prediction Tree --- p.82Chapter 3.3.3 --- Trend Prediction --- p.84Chapter 3.4 --- Experimental Results --- p.86Chapter 3.4.1 --- Experimental setup --- p.86Chapter 3.4.2 --- Experiment on accuracy --- p.87Chapter 3.4.3 --- Experiment on performance --- p.88Chapter 3.5 --- Conclusion --- p.90Chapter 4 --- Conclusion --- p.92Bibliography --- p.9

    Multivariate Time Series Retrieval with Symbolic Aggregate Approximation, Regular Expression, and Query Expansion

    Get PDF
    We present SAXRegEx, a method for pattern search in multivariate time series in the presence of various distortions, such as duration variation, warping, and time delay between signals. For example, in the automotive industry, calibration engineers spontaneously search for event-induced patterns in fresh measurements under time pressure. Current methods do not sufficiently address duration (horizontal along the time axis) scaling and inter-track time delay. One reason is that it can be overwhelmingly complex to consider scaling and warping jointly and analyze temporal dynamics and attribute interrelation simultaneously. SAXRegEx meets this challenge with a novel symbolic representation modeling adapted to handle time series with multiple tracks. We employ methods from text retrieval, i.e., regular expression matching, to perform a pattern retrieval and develop a novel query expansion algorithm to deal flexibly with pattern distortions. Experiments show the effectiveness of our approach, especially in the presence of such distortions, and its efficiency surpassing the state-of-the-art methods. While we design the method primarily for automotive data, it is well transferable to other domains
    corecore