148 research outputs found
Implications of Z-normalization in the matrix profile
Companies are increasingly measuring their products and services, resulting in a rising amount of available time series data, making techniques to extract usable information needed. One state-of-the-art technique for time series is the Matrix Profile, which has been used for various applications including motif/discord discovery, visualizations and semantic segmentation. Internally, the Matrix Profile utilizes the z-normalized Euclidean distance to compare the shape of subsequences between two series. However, when comparing subsequences that are relatively flat and contain noise, the resulting distance is high despite the visual similarity of these subsequences. This property violates some of the assumptions made by Matrix Profile based techniques, resulting in worse performance when series contain flat and noisy subsequences. By studying the properties of the z-normalized Euclidean distance, we derived a method to eliminate this effect requiring only an estimate of the standard deviation of the noise. In this paper we describe various practical properties of the z-normalized Euclidean distance and show how these can be used to correct the performance of Matrix Profile related techniques. We demonstrate our techniques using anomaly detection using a Yahoo! Webscope anomaly dataset, semantic segmentation on the PAMAP2 activity dataset and for data visualization on a UCI activity dataset, all containing real-world data, and obtain overall better results after applying our technique. Our technique is a straightforward extension of the distance calculation in the Matrix Profile and will benefit any derived technique dealing with time series containing flat and noisy subsequences
A generalized matrix profile framework with support for contextual series analysis
The Matrix Profile is a state-of-the-art time series analysis technique that can be used for motif discovery, anomaly detection, segmentation and others, in various domains such as healthcare, robotics, and audio. Where recent techniques use the Matrix Profile as a preprocessing or modeling step, we believe there is unexplored potential in generalizing the approach. We derived a framework that focuses on the implicit distance matrix calculation. We present this framework as the Series Distance Matrix (SDM). In this framework, distance measures (SDM-generators) and distance processors (SDM-consumers) can be freely combined, allowing for more flexibility and easier experimentation. In SDM, the Matrix Profile is but one specific configuration. We also introduce the Contextual Matrix Profile (CMP) as a new SDM-consumer capable of discovering repeating patterns. The CMP provides intuitive visualizations for data analysis and can find anomalies that are not discords. We demonstrate this using two real world cases. The CMP is the first of a wide variety of new techniques for series analysis that fits within SDM and can complement the Matrix Profile
ALDI++: Automatic and parameter-less discord and outlier detection for building energy load profiles
Data-driven building energy prediction is an integral part of the process for
measurement and verification, building benchmarking, and building-to-grid
interaction. The ASHRAE Great Energy Predictor III (GEPIII) machine learning
competition used an extensive meter data set to crowdsource the most accurate
machine learning workflow for whole building energy prediction. A significant
component of the winning solutions was the pre-processing phase to remove
anomalous training data. Contemporary pre-processing methods focus on filtering
statistical threshold values or deep learning methods requiring training data
and multiple hyper-parameters. A recent method named ALDI (Automated Load
profile Discord Identification) managed to identify these discords using matrix
profile, but the technique still requires user-defined parameters. We develop
ALDI++, a method based on the previous work that bypasses user-defined
parameters and takes advantage of discord similarity. We evaluate ALDI++
against a statistical threshold, variational auto-encoder, and the original
ALDI as baselines in classifying discords and energy forecasting scenarios. Our
results demonstrate that while the classification performance improvement over
the original method is marginal, ALDI++ helps achieve the best forecasting
error improving 6% over the winning's team approach with six times less
computation time.Comment: 10 pages, 5 figures, 3 table
A New Time Series Similarity Measure and Its Smart Grid Applications
Many smart grid applications involve data mining, clustering, classification,
identification, and anomaly detection, among others. These applications
primarily depend on the measurement of similarity, which is the distance
between different time series or subsequences of a time series. The commonly
used time series distance measures, namely Euclidean Distance (ED) and Dynamic
Time Warping (DTW), do not quantify the flexible nature of electricity usage
data in terms of temporal dynamics. As a result, there is a need for a new
distance measure that can quantify both the amplitude and temporal changes of
electricity time series for smart grid applications, e.g., demand response and
load profiling. This paper introduces a novel distance measure to compare
electricity usage patterns. The method consists of two phases that quantify the
effort required to reshape one time series into another, considering both
amplitude and temporal changes. The proposed method is evaluated against ED and
DTW using real-world data in three smart grid applications. Overall, the
proposed measure outperforms ED and DTW in accurately identifying the best load
scheduling strategy, anomalous days with irregular electricity usage, and
determining electricity users' behind-the-meter (BTM) equipment.Comment: 7 pages, 6 figures conferenc
Refining the Optimization Target for Automatic Univariate Time Series Anomaly Detection in Monitoring Services
Time series anomaly detection is crucial for industrial monitoring services
that handle a large volume of data, aiming to ensure reliability and optimize
system performance. Existing methods often require extensive labeled resources
and manual parameter selection, highlighting the need for automation. This
paper proposes a comprehensive framework for automatic parameter optimization
in time series anomaly detection models. The framework introduces three
optimization targets: prediction score, shape score, and sensitivity score,
which can be easily adapted to different model backbones without prior
knowledge or manual labeling efforts. The proposed framework has been
successfully applied online for over six months, serving more than 50,000 time
series every minute. It simplifies the user's experience by requiring only an
expected sensitive value, offering a user-friendly interface, and achieving
desired detection results. Extensive evaluations conducted on public datasets
and comparison with other methods further confirm the effectiveness of the
proposed framework.Comment: Accepted by 2023 IJCAI Worksho
Contributions to time series data mining towards the detection of outliers/anomalies
148 p.Los recientes avances tecnológicos han supuesto un gran progreso en la recogida de datos, permitiendo recopilar una gran cantidad de datos a lo largo del tiempo. Estos datos se presentan comúnmente en forma de series temporales, donde las observaciones se han registrado de forma cronológica y están correlacionadas en el tiempo. A menudo, estas dependencias temporales contienen información significativa y útil, por lo que, en los últimos años, ha surgido un gran interés por extraer dicha información. En particular, el área de investigación que se centra en esta tarea se denomina minería de datos de series temporales.La comunidad de investigadores de esta área se ha dedicado a resolver diferentes tareas como por ejemplo la clasificación, la predicción, el clustering o agrupamiento y la detección de valores atípicos/anomalías. Los valores atípicos o anomalías son aquellas observaciones que no siguen el comportamiento esperado en una serie temporal. Estos valores atípicos o anómalos suelen representar mediciones no deseadas o eventos de interés, y, por lo tanto, detectarlos suele ser relevante ya que pueden empeorar la calidad de los datos o reflejar fenómenos interesantes para el analista.Esta tesis presenta varias contribuciones en el campo de la minería de datos de series temporales, más específicamente sobre la detección de valores atípicos o anomalías. Estas contribuciones se pueden dividir en dos partes o bloques. Por una parte, la tesis presenta contribuciones en el campo de la detección de valores atípicos o anomalías en series temporales. Para ello, se ofrece una revisión de las técnicas en la literatura, y se presenta una nueva técnica de detección de anomalías en series temporales univariantes para la detección de fugas de agua, basada en el aprendizaje autosupervisado. Por otra parte, la tesis también introduce contribuciones relacionadas con el tratamiento de las series temporales con valores perdidos y demuestra su aplicabilidad en el campo de la detección de anomalías
TRANSOM: An Efficient Fault-Tolerant System for Training LLMs
Large language models (LLMs) with hundreds of billions or trillions of
parameters, represented by chatGPT, have achieved profound impact on various
fields. However, training LLMs with super-large-scale parameters requires large
high-performance GPU clusters and long training periods lasting for months. Due
to the inevitable hardware and software failures in large-scale clusters,
maintaining uninterrupted and long-duration training is extremely challenging.
As a result, A substantial amount of training time is devoted to task
checkpoint saving and loading, task rescheduling and restart, and task manual
anomaly checks, which greatly harms the overall training efficiency. To address
these issues, we propose TRANSOM, a novel fault-tolerant LLM training system.
In this work, we design three key subsystems: the training pipeline automatic
fault tolerance and recovery mechanism named Transom Operator and Launcher
(TOL), the training task multi-dimensional metric automatic anomaly detection
system named Transom Eagle Eye (TEE), and the training checkpoint asynchronous
access automatic fault tolerance and recovery technology named Transom
Checkpoint Engine (TCE). Here, TOL manages the lifecycle of training tasks,
while TEE is responsible for task monitoring and anomaly reporting. TEE detects
training anomalies and reports them to TOL, who automatically enters the fault
tolerance strategy to eliminate abnormal nodes and restart the training task.
And the asynchronous checkpoint saving and loading functionality provided by
TCE greatly shorten the fault tolerance overhead. The experimental results
indicate that TRANSOM significantly enhances the efficiency of large-scale LLM
training on clusters. Specifically, the pre-training time for GPT3-175B has
been reduced by 28%, while checkpoint saving and loading performance have
improved by a factor of 20.Comment: 14 pages, 9 figure
A study of time series: anomaly detection and trend prediction.
Leung Tat Wing.Thesis (M.Phil.)--Chinese University of Hong Kong, 2006.Includes bibliographical references (leaves 94-98).Abstracts in English and Chinese.Abstract --- p.iAcknowledgement --- p.ivChapter 1 --- Introduction --- p.1Chapter 1.1 --- Unusual Pattern Discovery --- p.3Chapter 1.2 --- Trend Prediction --- p.4Chapter 1.3 --- Thesis Organization --- p.5Chapter 2 --- Unusual Pattern Discovery --- p.6Chapter 2.1 --- Introduction --- p.6Chapter 2.2 --- Related Work --- p.7Chapter 2.2.1 --- Time Series Discords --- p.7Chapter 2.2.2 --- Brute Force Algorithm --- p.8Chapter 2.2.3 --- Keogh et al.'s Algorithm --- p.10Chapter 2.2.4 --- Performance Analysis --- p.14Chapter 2.3 --- Proposed Approach --- p.18Chapter 2.3.1 --- Haar Transform --- p.20Chapter 2.3.2 --- Discretization --- p.22Chapter 2.3.3 --- Augmented Trie --- p.24Chapter 2.3.4 --- Approximating the Magic Outer Loop --- p.27Chapter 2.3.5 --- Approximating the Magic Inner Loop --- p.28Chapter 2.3.6 --- Experimental Result --- p.28Chapter 2.4 --- More on discord length --- p.42Chapter 2.4.1 --- Modified Haar Transform --- p.42Chapter 2.4.2 --- Fast Haar Transform Algorithm --- p.43Chapter 2.4.3 --- Relation between discord length and discord location --- p.45Chapter 2.5 --- Further Optimization --- p.47Chapter 2.5.1 --- Improved Inner Loop Heuristic --- p.50Chapter 2.5.2 --- Experimental Result --- p.52Chapter 2.6 --- Top K discords --- p.53Chapter 2.6.1 --- Utility of top K discords --- p.53Chapter 2.6.2 --- Algorithm --- p.58Chapter 2.6.3 --- Experimental Result --- p.62Chapter 2.7 --- Conclusion --- p.64Chapter 3 --- Trend Prediction --- p.69Chapter 3.1 --- Introduction --- p.69Chapter 3.2 --- Technical Analysis --- p.70Chapter 3.2.1 --- Relative Strength Index --- p.70Chapter 3.2.2 --- Chart Analysis --- p.70Chapter 3.2.3 --- Dow Theory --- p.71Chapter 3.2.4 --- Moving Average --- p.72Chapter 3.3 --- Proposed Algorithm --- p.79Chapter 3.3.1 --- Piecewise Linear Representation --- p.80Chapter 3.3.2 --- Prediction Tree --- p.82Chapter 3.3.3 --- Trend Prediction --- p.84Chapter 3.4 --- Experimental Results --- p.86Chapter 3.4.1 --- Experimental setup --- p.86Chapter 3.4.2 --- Experiment on accuracy --- p.87Chapter 3.4.3 --- Experiment on performance --- p.88Chapter 3.5 --- Conclusion --- p.90Chapter 4 --- Conclusion --- p.92Bibliography --- p.9
Multivariate Time Series Retrieval with Symbolic Aggregate Approximation, Regular Expression, and Query Expansion
We present SAXRegEx, a method for pattern search in multivariate time series in the presence of various distortions, such as duration variation, warping, and time delay between signals. For example, in the automotive industry, calibration engineers spontaneously search for event-induced patterns in fresh measurements under time pressure. Current methods do not sufficiently address duration (horizontal along the time axis) scaling and inter-track time delay. One reason is that it can be overwhelmingly complex to consider scaling and warping jointly and analyze temporal dynamics and attribute interrelation simultaneously. SAXRegEx meets this challenge with a novel symbolic representation modeling adapted to handle time series with multiple tracks. We employ methods from text retrieval, i.e., regular expression matching, to perform a pattern retrieval and develop a novel query expansion algorithm to deal flexibly with pattern distortions. Experiments show the effectiveness of our approach, especially in the presence of such distortions, and its efficiency surpassing the state-of-the-art methods. While we design the method primarily for automotive data, it is well transferable to other domains
- …