16 research outputs found

    Clustering of Time Series Data: Measures, Methods, and Applications

    Get PDF
    Clustering is an essential branch of data mining and statistical analysis that could help us explore the distribution of data and extract knowledge. With the broad accumulation and application of time series data, the study of its clustering is a natural extension of existing unsupervised learning heuristics. We discuss the components which configure the clustering of time series data, specifically, the similarity measure, the clustering heuristic, the evaluation of cluster quality, and the applications of said heuristics. Being the groundwork for the task of data analysis, we propose a scalable and efficient time series similarity measure: segmented-Dynamic Time Warping. For time series clustering, we formulate the Distance Density Clustering heuristic, a deterministic clustering algorithm that adopts concepts from both density and distance separation. In addition, we explored the characteristics and discussed the limitations of existing cluster evaluation methods. Finally, all components lead to the goal of real-world applications

    Similarity searching in sequence databases under time warping.

    Get PDF
    Wong, Siu Fung.Thesis submitted in: December 2003.Thesis (M.Phil.)--Chinese University of Hong Kong, 2004.Includes bibliographical references (leaves 77-84).Abstracts in English and Chinese.Abstract --- p.iiAcknowledgement --- p.viChapter 1 --- Introduction --- p.1Chapter 2 --- Preliminary --- p.6Chapter 2.1 --- Dynamic Time Warping (DTW) --- p.6Chapter 2.2 --- Spatial Indexing --- p.10Chapter 2.3 --- Relevance Feedback --- p.11Chapter 3 --- Literature Review --- p.13Chapter 3.1 --- Searching Sequences under Euclidean Metric --- p.13Chapter 3.2 --- Searching Sequences under Dynamic Time Warping Metric --- p.17Chapter 4 --- Subsequence Matching under Time Warping --- p.21Chapter 4.1 --- Subsequence Matching --- p.22Chapter 4.1.1 --- Sequential Search --- p.22Chapter 4.1.2 --- Indexing Scheme --- p.23Chapter 4.2 --- Lower Bound Technique --- p.25Chapter 4.2.1 --- Properties of Lower Bound Technique --- p.26Chapter 4.2.2 --- Existing Lower Bound Functions --- p.27Chapter 4.3 --- Point-Based indexing --- p.28Chapter 4.3.1 --- Lower Bound for subsequences matching --- p.28Chapter 4.3.2 --- Algorithm --- p.35Chapter 4.4 --- Rectangle-Based indexing --- p.37Chapter 4.4.1 --- Lower Bound for subsequences matching --- p.37Chapter 4.4.2 --- Algorithm --- p.41Chapter 4.5 --- Experimental Results --- p.43Chapter 4.5.1 --- Candidate ratio vs Width of warping window --- p.44Chapter 4.5.2 --- CPU time vs Number of subsequences --- p.45Chapter 4.5.3 --- CPU time vs Width of warping window --- p.46Chapter 4.5.4 --- CPU time vs Threshold --- p.46Chapter 4.6 --- Summary --- p.47Chapter 5 --- Relevance Feedback under Time Warping --- p.49Chapter 5.1 --- Integrating Relevance Feedback with DTW --- p.49Chapter 5.2 --- Query Reformulation --- p.53Chapter 5.2.1 --- Constraint Updating --- p.53Chapter 5.2.2 --- Weight Updating --- p.55Chapter 5.2.3 --- Overall Strategy --- p.58Chapter 5.3 --- Experiments and Evaluation --- p.59Chapter 5.3.1 --- Effectiveness of the strategy --- p.61Chapter 5.3.2 --- Efficiency of the strategy --- p.63Chapter 5.3.3 --- Usability --- p.64Chapter 5.4 --- Summary --- p.71Chapter 6 --- Conclusion --- p.72Chapter A --- Deduction of Data Bounding Hyper-rectangle --- p.74Chapter B --- Proof of Theorem2 --- p.76Bibliography --- p.77Publications --- p.8

    A scalable machine learning system for anomaly detection in manufacturing

    Get PDF
    Berichte über Rückrufaktionen in der Automobilindustrie gehören inzwischen zum medialen Alltag. Tatsächlich hat deren Häufigkeit und die Anzahl der betroffenen Fahrzeuge in den letzten Jahren weiter zugenommen. Die meisten Aktionen sind auf Fehler in der Produktion zurückzuführen. Für die Hersteller stellt neben Verbesserungen im Qualitätsmanagement die intelligente und automatisierte Analyse von Produktionsprozessdaten ein bislang kaum ausgeschöpftes Potential dar. Die technischen Herausforderungen sind jedoch enorm: die Datenmengen sind gewaltig und die für einen Fehler charakteristischen Datenmuster zwangsläufig unbekannt. Der Einsatz maschineller Lernverfahren (ML) ist ein vielversprechender Ansatz um diese Suche nach der sinnbildlichen Nadel im Häuhaufen zu ermöglichen. Algorithmen sollen anhand der Daten selbständig lernen zwischen normalem und auffälligem Prozessverhalten zu unterscheiden um Prozessexperten frühzeitig zu warnen. Industrie und Forschung versuchen bereits seit Jahren solche ML-Systeme im Produktionsumfeld zu etablieren. Die meisten ML-Projekte scheitern jedoch bereits vor der Produktivphase bzw. verschlingen enorme Ressourcen im Betrieb und liefern keinen wirtschaftlichen Mehrwert. Ziel der Arbeit ist die Entwicklung eines technischen Frameworks zur Implementierung eines skalierbares ML-System für die Anomalieerkennung in Prozessdaten. Die Trainingsprozesse zum Initialisieren und Adaptieren der Modelle sollen hochautomatisierbar sein um einen strukturierten Skalierungsprozess zu ermöglichen. Das entwickelt DM/ML-Verfahren ermöglicht den langfristigen Aufwand für den Systembetrieb durch initialen Mehraufwand für den Modelltrainingsprozess zu senken und hat sich in der Praxis als sowohl relativ als auch absolut Skalierbar bewährt. Dadurch kann die Komplexität auf Systemebene auf ein beherrschbares Maß reduziert werden um einen späteren Systembetrieb zu ermöglichen

    Similarity processing in multi-observation data

    Get PDF
    Many real-world application domains such as sensor-monitoring systems for environmental research or medical diagnostic systems are dealing with data that is represented by multiple observations. In contrast to single-observation data, where each object is assigned to exactly one occurrence, multi-observation data is based on several occurrences that are subject to two key properties: temporal variability and uncertainty. When defining similarity between data objects, these properties play a significant role. In general, methods designed for single-observation data hardly apply for multi-observation data, as they are either not supported by the data models or do not provide sufficiently efficient or effective solutions. Prominent directions incorporating the key properties are the fields of time series, where data is created by temporally successive observations, and uncertain data, where observations are mutually exclusive. This thesis provides research contributions for similarity processing - similarity search and data mining - on time series and uncertain data. The first part of this thesis focuses on similarity processing in time series databases. A variety of similarity measures have recently been proposed that support similarity processing w.r.t. various aspects. In particular, this part deals with time series that consist of periodic occurrences of patterns. Examining an application scenario from the medical domain, a solution for activity recognition is presented. Finally, the extraction of feature vectors allows the application of spatial index structures, which support the acceleration of search and mining tasks resulting in a significant efficiency gain. As feature vectors are potentially of high dimensionality, this part introduces indexing approaches for the high-dimensional space for the full-dimensional case as well as for arbitrary subspaces. The second part of this thesis focuses on similarity processing in probabilistic databases. The presence of uncertainty is inherent in many applications dealing with data collected by sensing devices. Often, the collected information is noisy or incomplete due to measurement or transmission errors. Furthermore, data may be rendered uncertain due to privacy-preserving issues with the presence of confidential information. This creates a number of challenges in terms of effectively and efficiently querying and mining uncertain data. Existing work in this field either neglects the presence of dependencies or provides only approximate results while applying methods designed for certain data. Other approaches dealing with uncertain data are not able to provide efficient solutions. This part presents query processing approaches that outperform existing solutions of probabilistic similarity ranking. This part finally leads to the application of the introduced techniques to data mining tasks, such as the prominent problem of probabilistic frequent itemset mining.Viele Anwendungsgebiete, wie beispielsweise die Umweltforschung oder die medizinische Diagnostik, nutzen Systeme der Sensorüberwachung. Solche Systeme müssen oftmals in der Lage sein, mit Daten umzugehen, welche durch mehrere Beobachtungen repräsentiert werden. Im Gegensatz zu Daten mit nur einer Beobachtung (Single-Observation Data) basieren Daten aus mehreren Beobachtungen (Multi-Observation Data) auf einer Vielzahl von Beobachtungen, welche zwei Schlüsseleigenschaften unterliegen: Zeitliche Veränderlichkeit und Datenunsicherheit. Im Bereich der Ähnlichkeitssuche und im Data Mining spielen diese Eigenschaften eine wichtige Rolle. Gängige Lösungen in diesen Bereichen, die für Single-Observation Data entwickelt wurden, sind in der Regel für den Umgang mit mehreren Beobachtungen pro Objekt nicht anwendbar. Der Grund dafür liegt darin, dass diese Ansätze entweder nicht mit den Datenmodellen vereinbar sind oder keine Lösungen anbieten, die den aktuellen Ansprüchen an Lösungsqualität oder Effizienz genügen. Bekannte Forschungsrichtungen, die sich mit Multi-Observation Data und deren Schlüsseleigenschaften beschäftigen, sind die Analyse von Zeitreihen und die Ähnlichkeitssuche in probabilistischen Datenbanken. Während erstere Richtung eine zeitliche Ordnung der Beobachtungen eines Objekts voraussetzt, basieren unsichere Datenobjekte auf Beobachtungen, die sich gegenseitig bedingen oder ausschließen. Diese Dissertation umfasst aktuelle Forschungsbeiträge aus den beiden genannten Bereichen, wobei Methoden zur Ähnlichkeitssuche und zur Anwendung im Data Mining vorgestellt werden. Der erste Teil dieser Arbeit beschäftigt sich mit Ähnlichkeitssuche und Data Mining in Zeitreihendatenbanken. Insbesondere werden Zeitreihen betrachtet, welche aus periodisch auftretenden Mustern bestehen. Im Kontext eines medizinischen Anwendungsszenarios wird ein Ansatz zur Aktivitätserkennung vorgestellt. Dieser erlaubt mittels Merkmalsextraktion eine effiziente Speicherung und Analyse mit Hilfe von räumlichen Indexstrukturen. Für den Fall hochdimensionaler Merkmalsvektoren stellt dieser Teil zwei Indexierungsmethoden zur Beschleunigung von ähnlichkeitsanfragen vor. Die erste Methode berücksichtigt alle Attribute der Merkmalsvektoren, während die zweite Methode eine Projektion der Anfrage auf eine benutzerdefinierten Unterraum des Vektorraums erlaubt. Im zweiten Teil dieser Arbeit wird die Ähnlichkeitssuche im Kontext probabilistischer Datenbanken behandelt. Daten aus Sensormessungen besitzen häufig Eigenschaften, die einer gewissen Unsicherheit unterliegen. Aufgrund von Mess- oder übertragungsfehlern sind gemessene Werte oftmals unvollständig oder mit Rauschen behaftet. In diversen Szenarien, wie beispielsweise mit persönlichen oder medizinisch vertraulichen Daten, können Daten auch nachträglich von Hand verrauscht werden, so dass eine genaue Rekonstruktion der ursprünglichen Informationen nicht möglich ist. Diese Gegebenheiten stellen Anfragetechniken und Methoden des Data Mining vor einige Herausforderungen. In bestehenden Forschungsarbeiten aus dem Bereich der unsicheren Datenbanken werden diverse Probleme oftmals nicht beachtet. Entweder wird die Präsenz von Abhängigkeiten ignoriert, oder es werden lediglich approximative Lösungen angeboten, welche die Anwendung von Methoden für sichere Daten erlaubt. Andere Ansätze berechnen genaue Lösungen, liefern die Antworten aber nicht in annehmbarer Laufzeit zurück. Dieser Teil der Arbeit präsentiert effiziente Methoden zur Beantwortung von Ähnlichkeitsanfragen, welche die Ergebnisse absteigend nach ihrer Relevanz, also eine Rangliste der Ergebnisse, zurückliefern. Die angewandten Techniken werden schließlich auf Problemstellungen im probabilistischen Data Mining übertragen, um beispielsweise das Problem des Frequent Itemset Mining unter Berücksichtigung des vollen Gehalts an Unsicherheitsinformation zu lösen

    Modelagem simbólica de padrões morfológicos para classificação de séries temporais

    Get PDF
    Orientador : Prof. Dr. Fabiano SilvaTese (Doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa: Curitiba, 14/09/2015Inclui referências : f. 149-167Resumo: O contínuo armazenamento de dados ao longo do tempo, tais como séries temporais, tem motivado o desenvolvimento de novas abordagens baseadas em métodos de mineração de dados. Nesse cenário, uma nova área de pesquisa emergiu durante as últimas duas décadas, a mineração de dados em séries temporais. Mais especificamente, as abordagens baseadas em técnicas de aprendizado de máquina têm apresentado maior interesse entre os pesquisadores. Dentre as tarefas de mineração de dados, a classificação de séries temporais tem sido amplamente explorada, de modo que estudos recentes, utilizando algoritmos de aprendizado não simbólicos, têm reportado resultados significativos, em termos da acurácia de classificação. No entanto, em aplicações que envolvem processos de auxílio à tomada de decisão, tais como diagnóstico médico, controle de produção industrial, sistemas de monitoração de segurança em aeronaves ou usinas de energia elétrica, é necessário possibilitar o entendimento do raciocínio utilizado no processo de classificação. A primitiva shapelet foi proposta na literatura como um descritor de características morfológicas locais para possibilitar melhor compreensão dos conceitos, devido a sua maior proximidade com a percepção humana na identificação de padrões em séries temporais. Contudo, a maioria dos trabalhos relacionados ao estudo dessa primitiva tem se dedicado ao desenvolvimento de abordagens mais eficientes em termos de tempo e de acurácia, desconsiderando a necessidade da inteligibilidade dos classificadores. Nesse contexto, neste trabalho foi proposto um método que utiliza a transformada shapelet para a construção de modelos simbólicos de classificação por meio de uma abordagem híbrida que combina a representação de árvore de decisão com o algoritmo vizinho mais próximo. Também, foram desenvolvidas estratégias para melhorar a qualidade de representação da transformada shapelet na utilização de classificadores simbólicos, como árvores de decisão. Para avaliar o desempenho dessas propostas, foi conduzida uma avaliação experimental que envolveu a comparação com os algoritmos considerados estado da arte usando conjuntos de dados amplamente estudados na literatura de classificação de séries temporais. Com base nos resultados e análises realizadas nesta tese, foi possível verificar que a melhoria do processo de identificação de shapelets possibilita a construção de classificadores inteligíveis e competitivos; e que métodos híbridos podem contribuir para prover uma representação simbólica dos modelos, com desempenho equivalente ou até mesmo superior aos métodos não simbólicos. Palavras-chave: mineração de dados. aprendizado de máquina. séries temporais. classificação. modelos simbólicos.Abstract: The large amount of stored data over time, such as time series, has motivated the development of new approaches based on data mining methods. In this context, a new research area has emerged over the last two decades, the time series data mining. In particular, the approaches based on machine learning techniques have shown large interest among researchers. Among the data mining tasks, the time series classification has been widely exploited. Recent studies using non-symbolic learning algorithms have reported significant results in terms of classification accuracy. However, in applications related to decision making process, such as medical diagnosis, industrial production control, security monitoring systems in aircraft and in power plants, it is necessary allow the understanding of the reasoning used in the classification process. To take this into account, the shapelet primitive has been proposed in the literature as a descriptor of local morphological characteristics, which is closer to human perception for patterns identification in time series. On the other hand, most of the existing work related to shapelets has been dedicated to the development of more effective approaches in terms of time and accuracy, disregarding the need for interpretability of the classifiers. In this work, we propose to build symbolic models for time series classification using the shapelet transformation. This method is based on a hybrid approach that merges the decision tree representation and the nearest neighbor algorithm. Also, we developed strategies to improve the representation quality of the shapelet transformation using feature selection algorithms. We performed an experimental evaluation to analyze the performance of our proposals in comparison to the algorithms considered state of the art using datasets widely studied in the literature of time series classification. Based on the results and analysis carried out in this thesis, we found that the improvement of shapelet representation allows the construction of interpretable and competitive classifiers. Moreover, we found that the hybrid methods can help to provide symbolic models with equivalent or even superior performance to non-symbolic methods. Keywords: data mining. machine learning. time series. classification. symbolic models

    The efficient market hypothesis through the eyes of an artificial technical analyst

    Get PDF
    The academic literature has been reluctant to accept technical analysis as a rational strategy of traders in financial markets. In practice traders and analysts heavily use technical analysis to make investment decisions. To resolve this incongruence the aim of this study is to translate technical analysis into a rigorous formal framework and to investigate its potential failure or success. To avoid subjectivism we design an Artificial Technical Analyst. The empirical study presents the evidence of past market inefficiencies observed on the Tokyo Stock Exchange. The market can be perceived as inefficient if the technical analyst's transaction costs are below the break-even level derived from technical analysis. (English

    Enhanced Query Processing on Complex Spatial and Temporal Data

    Get PDF
    Innovative technologies in the area of multimedia and mechanical engineering as well as novel methods for data acquisition in different scientific subareas, including geo-science, environmental science, medicine, biology and astronomy, enable a more exact representation of the data, and thus, a more precise data analysis. The resulting quantitative and qualitative growth of specifically spatial and temporal data leads to new challenges for the management and processing of complex structured objects and requires the employment of efficient and effective methods for data analysis. Spatial data denote the description of objects in space by a well-defined extension, a specific location and by their relationships to the other objects. Classical representatives of complex structured spatial objects are three-dimensional CAD data from the sector "mechanical engineering" and two-dimensional bounded regions from the area "geography". For industrial applications, efficient collision and intersection queries are of great importance. Temporal data denote data describing time dependent processes, as for instance the duration of specific events or the description of time varying attributes of objects. Time series belong to one of the most popular and complex type of temporal data and are the most important form of description for time varying processes. An elementary type of query in time series databases is the similarity query which serves as basic query for data mining applications. The main target of this thesis is to develop an effective and efficient algorithm supporting collision queries on spatial data as well as similarity queries on temporal data, in particular, time series. The presented concepts are based on the efficient management of interval sequences which are suitable for spatial and temporal data. The effective analysis of the underlying objects will be efficiently supported by adequate access methods. First, this thesis deals with collision queries on complex spatial objects which can be reduced to intersection queries on interval sequences. We introduce statistical methods for the grouping of subsequences. Involving the concept of multi-step query processing, these methods enable the user to accelerate the query process drastically. Furthermore, in this thesis we will develop a cost model for the multi-step query process of interval sequences in distributed systems. The proposed approach successfully supports a cost based query strategy. Second, we introduce a novel similarity measure for time series. It allows the user to focus specific time series amplitudes for the similarity measurement. The new similarity model defines two time series to be similar iff they show similar temporal behavior w.r.t. being below or above a specific threshold. This type of query is primarily required in natural science applications. The main goal of this new query method is the detection of anomalies and the adaptation to new claims in the area of data mining in time series databases. In addition, a semi-supervised cluster analysis method will be presented which is based on the introduced similarity model for time series. The efficiency and effectiveness of the proposed techniques will be extensively discussed and the advantages against existing methods experimentally proofed by means of datasets derived from real-world applications
    corecore