19 research outputs found

    SE-shapelets: Semi-supervised Clustering of Time Series Using Representative Shapelets

    Full text link
    Shapelets that discriminate time series using local features (subsequences) are promising for time series clustering. Existing time series clustering methods may fail to capture representative shapelets because they discover shapelets from a large pool of uninformative subsequences, and thus result in low clustering accuracy. This paper proposes a Semi-supervised Clustering of Time Series Using Representative Shapelets (SE-Shapelets) method, which utilizes a small number of labeled and propagated pseudo-labeled time series to help discover representative shapelets, thereby improving the clustering accuracy. In SE-Shapelets, we propose two techniques to discover representative shapelets for the effective clustering of time series. 1) A \textit{salient subsequence chain} (SSCSSC) that can extract salient subsequences (as candidate shapelets) of a labeled/pseudo-labeled time series, which helps remove massive uninformative subsequences from the pool. 2) A \textit{linear discriminant selection} (LDSLDS) algorithm to identify shapelets that can capture representative local features of time series in different classes, for convenient clustering. Experiments on UCR time series datasets demonstrate that SE-shapelets discovers representative shapelets and achieves higher clustering accuracy than counterpart semi-supervised time series clustering methods

    Simulation Analytics for Deeper Comparisons

    Get PDF
    Output analysis for stochastic simulation has traditionally focused on obtaining statistical summaries of time-averaged and replication-averaged performance measures. Although providing a useful overview of expected long-run results, this focus ignores the finer behaviour and dynamic interactions that characterise a stochastic system, motivating an opening for simulation analytics. Data analysis efforts directed towards the detailed event logs of simulation sample paths can extend the analytical toolkit of simulation beyond static summaries of long-run behaviour. This thesis contributes novel methodologies to the field of simulation analytics. Through a careful mining of sample path data and application of appropriate machine learning techniques, we unlock new opportunities for understanding and improving the performance of stochastic systems. Our first area of focus is on the real-time prediction of dynamic performance measures, and we demonstrate a k-nearest neighbours model on the multivariate state of a simulation. In conjunction with this, metric learning is employed to refine a system-specific distance measure that operates between simulation states. The involvement of metric learning is found not only to enhance prediction accuracy, but also to offer insight into the driving factors behind a systemā€™s stochastic performance. Our main contribution within this approach is the adaptation of a metric learning formulation to accommodate the type of data that is typical of simulation sample paths. Secondly, we explore the continuous-time trajectories of simulation variables. Shapelets are found to identify the patterns that characterise and distinguish the trajectories of competing systems. Tailoring to the structure of discrete-event sample paths, we probe a deeper understanding and comparison of the dynamic behaviours of stochastic simulation

    Co-movement clustering: A novel approach for predicting inflation in the food and beverage industry

    Get PDF
    In the realm of food and beverage businesses, inflation poses a significant hurdle as it affects pricing, profitability, and consumerā€™s purchasing power, setting it apart from other industries. This study proposes a novel approach; co-movement clustering, to predict which items will be inflated together according to historical time-series data. Experiments were conducted to evaluate the proposed approach based on real-world data obtained from the UK Office for National Statistics. The predicted results of the proposed approach were compared against four classical methods (correlation, Euclidean distance, Cosine Similarity, and DTW). According to our experimental results, the accuracy of the proposed approach outperforms the above-mentioned classical methods. Moreover, the accuracy of the proposed approach is higher when an additional filter is applied. Our approach aids hospitality operators in accurately predicting food and beverage inflation, enabling the development of effective strategies to navigate the current challenging business environment in hospitality management. The lack of previous work has explored how time series clustering can be applied to support inflation prediction. This study opens a new research paradigm to the related field and this study can serve as a useful reference for future research in this emerging area. In addition, this study work contributes to the data analytics research stream in hospitality management literature

    Diffeomorphic Transformations for Time Series Analysis: An Efficient Approach to Nonlinear Warping

    Full text link
    The proliferation and ubiquity of temporal data across many disciplines has sparked interest for similarity, classification and clustering methods specifically designed to handle time series data. A core issue when dealing with time series is determining their pairwise similarity, i.e., the degree to which a given time series resembles another. Traditional distance measures such as the Euclidean are not well-suited due to the time-dependent nature of the data. Elastic metrics such as dynamic time warping (DTW) offer a promising approach, but are limited by their computational complexity, non-differentiability and sensitivity to noise and outliers. This thesis proposes novel elastic alignment methods that use parametric \& diffeomorphic warping transformations as a means of overcoming the shortcomings of DTW-based metrics. The proposed method is differentiable \& invertible, well-suited for deep learning architectures, robust to noise and outliers, computationally efficient, and is expressive and flexible enough to capture complex patterns. Furthermore, a closed-form solution was developed for the gradient of these diffeomorphic transformations, which allows an efficient search in the parameter space, leading to better solutions at convergence. Leveraging the benefits of these closed-form diffeomorphic transformations, this thesis proposes a suite of advancements that include: (a) an enhanced temporal transformer network for time series alignment and averaging, (b) a deep-learning based time series classification model to simultaneously align and classify signals with high accuracy, (c) an incremental time series clustering algorithm that is warping-invariant, scalable and can operate under limited computational and time resources, and finally, (d) a normalizing flow model that enhances the flexibility of affine transformations in coupling and autoregressive layers.Comment: PhD Thesis, defended at the University of Navarra on July 17, 2023. 277 pages, 8 chapters, 1 appendi

    Deep Clustering and Deep Network Compression

    Get PDF
    The use of deep learning has grown increasingly in recent years, thereby becoming a much-discussed topic across a diverse range of fields, especially in computer vision, text mining, and speech recognition. Deep learning methods have proven to be robust in representation learning and attained extraordinary achievement. Their success is primarily due to the ability of deep learning to discover and automatically learn feature representations by mapping input data into abstract and composite representations in a latent space. Deep learningā€™s ability to deal with high-level representations from data has inspired us to make use of learned representations, aiming to enhance unsupervised clustering and evaluate the characteristic strength of internal representations to compress and accelerate deep neural networks.Traditional clustering algorithms attain a limited performance as the dimensionality in-creases. Therefore, the ability to extract high-level representations provides beneficial components that can support such clustering algorithms. In this work, we first present DeepCluster, a clustering approach embedded in a deep convolutional auto-encoder. We introduce two clustering methods, namely DCAE-Kmeans and DCAE-GMM. The DeepCluster allows for data points to be grouped into their identical cluster, in the latent space, in a joint-cost function by simultaneously optimizing the clustering objective and the DCAE objective, producing stable representations, which is appropriate for the clustering process. Both qualitative and quantitative evaluations of proposed methods are reported, showing the efficiency of deep clustering on several public datasets in comparison to the previous state-of-the-art methods.Following this, we propose a new version of the DeepCluster model to include varying degrees of discriminative power. This introduces a mechanism which enables the imposition of regularization techniques and the involvement of a supervision component. The key idea of our approach is to distinguish the discriminatory power of numerous structures when searching for a compact structure to form robust clusters. The effectiveness of injecting various levels of discriminatory powers into the learning process is investigated alongside the exploration and analytical study of the discriminatory power obtained through the use of two discriminative attributes: data-driven discriminative attributes with the support of regularization techniques, and supervision discriminative attributes with the support of the supervision component. An evaluation is provided on four different datasets.The use of neural networks in various applications is accompanied by a dramatic increase in computational costs and memory requirements. Making use of the characteristic strength of learned representations, we propose an iterative pruning method that simultaneously identifies the critical neurons and prunes the model during training without involving any pre-training or fine-tuning procedures. We introduce a majority voting technique to compare the activation values among neurons and assign a voting score to evaluate their importance quantitatively. This mechanism effectively reduces model complexity by eliminating the less influential neurons and aims to determine a subset of the whole model that can represent the reference model with much fewer parameters within the training process. Empirically, we demonstrate that our pruning method is robust across various scenarios, including fully-connected networks (FCNs), sparsely-connected networks (SCNs), and Convolutional neural networks (CNNs), using two public datasets.Moreover, we also propose a novel framework to measure the importance of individual hidden units by computing a measure of relevance to identify the most critical filters and prune them to compress and accelerate CNNs. Unlike existing methods, we introduce the use of the activation of feature maps to detect valuable information and the essential semantic parts, with the aim of evaluating the importance of feature maps, inspired by novel neural network interpretability. A majority voting technique based on the degree of alignment between a se-mantic concept and individual hidden unit representations is utilized to evaluate feature mapsā€™ importance quantitatively. We also propose a simple yet effective method to estimate new convolution kernels based on the remaining crucial channels to accomplish effective CNN compression. Experimental results show the effectiveness of our filter selection criteria, which outperforms the state-of-the-art baselines.To conclude, we present a comprehensive, detailed review of time-series data analysis, with emphasis on deep time-series clustering (DTSC), and a founding contribution to the area of applying deep clustering to time-series data by presenting the first case study in the context of movement behavior clustering utilizing the DeepCluster method. The results are promising, showing that the latent space encodes sufficient patterns to facilitate accurate clustering of movement behaviors. Finally, we identify state-of-the-art and present an outlook on this important field of DTSC from five important perspectives

    Enhancing Grid Reliability With Phasor Measurement Units

    Get PDF
    Over the last decades, great efforts and investments have been made to increase the integration level of renewable energy resources in power grids. The New York State has set the goal to achieve 70% renewable generations by 2030, and realize carbon neutrality by 2040 eventually. However, the increased level of uncertainty brought about by renewables makes it more challenging to maintain stable and robust power grid operation. In addition to renewable energy resources, the ever-increasing number of electric vehicles and active loads have further increased the uncertainties in power systems. All these factors challenge the way the power grids are operated, and thus ask for new solutions to maintain stable and reliable grids. To meet the emerging requirements, advanced metering infrastructures are being integrated into power grids that transform traditional grids into \u27\u27 smart grids . One example is the widely deployed phasor measurement units (PMUs), which enable generating time-synchronized measurements with high sampling frequency, and pave a new path to realize real-time monitoring and control in power grids. However,the massive data generated by PMUs raises the questions of how to efficiently utilize the obtained measurements to understand and control the present system. Additionally, to meet the communication requirements between the advanced meters, the connectivity of the cyber layer has become more sophisticated, and thus is exposed to more cyber-attacks than before. Therefore, to enhance the grid reliability with PMUs, robust and efficient grid monitoring and control methods are required. This dissertation focuses on three important aspects of improving grid reliability with PMUs: (1) power system event detection; (2) impact assessment regarding both steady-state and transient stability; and (3) impact mitigation. In this dissertation, a comprehensive introduction of PMUs in the wide-area monitoring system, and comparisons with the existing supervisory control and data acquisition (SCADA) systems are presented first. Next, a data-driven event detection method is developed for efficient event detection with PMU measurements. A text mining approach is utilized to extract event oscillation patterns and determine event types. To ensure the integrity of the received data, the developed detection method is further designed to identify the fake events, and thus is robust against cyber-threat. Once a real event is detected, it is critical to promptly understand the consequences of the event in both steady and dynamic states. Sometimes, a single system event, e.g., a transmission line fault, may cause subsequent failures that lead to a cascading failure in the grid. In the worst case, these failures can result in large-scale blackouts. To assess the risk of an event in steady state, a probabilistic cascading failure model is developed. With the real-time phasor measurements, the failure probability of each system component at a specific operating condition can be predicted. In terms of the dynamic state, a failure of a system component may cause generators to lose synchronism, which will damage the power plant and lead to a blackout. To predict the transient stability after an event, a predictive online transient stability assessment (TSA) tool is developed in this dissertation. With only one sample of the PMU voltage measurements, the status of the transient stability can be predicted within cycles. In addition to the impact detection and assessment, it is also critical to identify proper mitigations to alleviate the failures. In this dissertation, a data-driven model predictive control strategy is developed. As a parameter-based system model is vulnerable to topology errors, a data-driven model is developed to mimic the grid behavior. Rather than utilizing the system parameters to construct the grid model, the data-driven model only leverages the received phasor measurements to determine proper corrective actions. Furthermore, to be robust against cyber-attacks, a check-point protocol, where past stored trustworthy data can be used to amend the attacked data, is utilized. The overall objective of this dissertation is to efficiently utilize advanced PMUs to detect, assess, and mitigate system failure, and help improve grid reliability

    Flexible estimation of temporal point processes and graphs

    Get PDF
    Handling complex data types with spatial structures, temporal dependencies, or discrete values, is generally a challenge in statistics and machine learning. In the recent years, there has been an increasing need of methodological and theoretical work to analyse non-standard data types, for instance, data collected on protein structures, genes interactions, social networks or physical sensors. In this thesis, I will propose a methodology and provide theoretical guarantees for analysing two general types of discrete data emerging from interactive phenomena, namely temporal point processes and graphs. On the one hand, temporal point processes are stochastic processes used to model event data, i.e., data that comes as discrete points in time or space where some phenomenon occurs. Some of the most successful applications of these discrete processes include online messages, financial transactions, earthquake strikes, and neuronal spikes. The popularity of these processes notably comes from their ability to model unobserved interactions and dependencies between temporally and spatially distant events. However, statistical methods for point processes generally rely on estimating a latent, unobserved, stochastic intensity process. In this context, designing flexible models and consistent estimation methods is often a challenging task. On the other hand, graphs are structures made of nodes (or agents) and edges (or links), where an edge represents an interaction or relationship between two nodes. Graphs are ubiquitous to model real-world social, transport, and mobility networks, where edges can correspond to virtual exchanges, physical connections between places, or migrations across geographical areas. Besides, graphs are used to represent correlations and lead-lag relationships between time series, and local dependence between random objects. Graphs are typical examples of non-Euclidean data, where adequate distance measures, similarity functions, and generative models need to be formalised. In the deep learning community, graphs have become particularly popular within the field of geometric deep learning. Structure and dependence can both be modelled by temporal point processes and graphs, although predominantly, the former act on the temporal domain while the latter conceptualise spatial interactions. Nonetheless, some statistical models combine graphs and point processes in order to account for both spatial and temporal dependencies. For instance, temporal point processes have been used to model the birth times of edges and nodes in temporal graphs. Moreover, some multivariate point processes models have a latent graph parameter governing the pairwise causal relationships between the components of the process. In this thesis, I will notably study such a model, called the Hawkes model, as well as graphs evolving in time. This thesis aims at designing inference methods that provide flexibility in the contexts of temporal point processes and graphs. This manuscript is presented in an integrated format, with four main chapters and two appendices. Chapters 2 and 3 are dedicated to the study of Bayesian nonparametric inference methods in the generalised Hawkes point process model. While Chapter 2 provides theoretical guarantees for existing methods, Chapter 3 also proposes, analyses, and evaluates a novel variational Bayes methodology. The other main chapters introduce and study model-free inference approaches for two estimation problems on graphs, namely spectral methods for the signed graph clustering problem in Chapter 4, and a deep learning algorithm for the network change point detection task on temporal graphs in Chapter 5. Additionally, Chapter 1 provides an introduction and background preliminaries on point processes and graphs. Chapter 6 concludes this thesis with a summary and critical thinking on the works in this manuscript, and proposals for future research. Finally, the appendices contain two supplementary papers. The first one, in Appendix A, initiated after the COVID-19 outbreak in March 2020, is an application of a discrete-time Hawkes model to COVID-related deaths counts during the first wave of the pandemic. The second work, in Appendix B, was conducted during an internship at Amazon Research in 2021, and proposes an explainability method for anomaly detection models acting on multivariate time series

    Kern-basierte Lernverfahren fĆ¼r das virtuelle Screening

    Get PDF
    We investigate the utility of modern kernel-based machine learning methods for ligand-based virtual screening. In particular, we introduce a new graph kernel based on iterative graph similarity and optimal assignments, apply kernel principle component analysis to projection error-based novelty detection, and discover a new selective agonist of the peroxisome proliferator-activated receptor gamma using Gaussian process regression. Virtual screening, the computational ranking of compounds with respect to a predicted property, is a cheminformatics problem relevant to the hit generation phase of drug development. Its ligand-based variant relies on the similarity principle, which states that (structurally) similar compounds tend to have similar properties. We describe the kernel-based machine learning approach to ligand-based virtual screening; in this, we stress the role of molecular representations, including the (dis)similarity measures defined on them, investigate effects in high-dimensional chemical descriptor spaces and their consequences for similarity-based approaches, review literature recommendations on retrospective virtual screening, and present an example workflow. Graph kernels are formal similarity measures that are defined directly on graphs, such as the annotated molecular structure graph, and correspond to inner products. We review graph kernels, in particular those based on random walks, subgraphs, and optimal vertex assignments. Combining the latter with an iterative graph similarity scheme, we develop the iterative similarity optimal assignment graph kernel, give an iterative algorithm for its computation, prove convergence of the algorithm and the uniqueness of the solution, and provide an upper bound on the number of iterations necessary to achieve a desired precision. In a retrospective virtual screening study, our kernel consistently improved performance over chemical descriptors as well as other optimal assignment graph kernels. Chemical data sets often lie on manifolds of lower dimensionality than the embedding chemical descriptor space. Dimensionality reduction methods try to identify these manifolds, effectively providing descriptive models of the data. For spectral methods based on kernel principle component analysis, the projection error is a quantitative measure of how well new samples are described by such models. This can be used for the identification of compounds structurally dissimilar to the training samples, leading to projection error-based novelty detection for virtual screening using only positive samples. We provide proof of principle by using principle component analysis to learn the concept of fatty acids. The peroxisome proliferator-activated receptor (PPAR) is a nuclear transcription factor that regulates lipid and glucose metabolism, playing a crucial role in the development of type 2 diabetes and dyslipidemia. We establish a Gaussian process regression model for PPAR gamma agonists using a combination of chemical descriptors and the iterative similarity optimal assignment kernel via multiple kernel learning. Screening of a vendor library and subsequent testing of 15 selected compounds in a cell-based transactivation assay resulted in 4 active compounds. One compound, a natural product with cyclobutane scaffold, is a full selective PPAR gamma agonist (EC50 = 10 +/- 0.2 muM, inactive on PPAR alpha and PPAR beta/delta at 10 muM). The study delivered a novel PPAR gamma agonist, de-orphanized a natural bioactive product, and, hints at the natural product origins of pharmacophore patterns in synthetic ligands.Wir untersuchen moderne Kern-basierte maschinelle Lernverfahren fĆ¼r das Liganden-basierte virtuelle Screening. Insbesondere entwickeln wir einen neuen Graphkern auf Basis iterativer GraphƤhnlichkeit und optimaler Knotenzuordnungen, setzen die Kernhauptkomponentenanalyse fĆ¼r Projektionsfehler-basiertes Novelty Detection ein, und beschreiben die Entdeckung eines neuen selektiven Agonisten des Peroxisom-Proliferator-aktivierten Rezeptors gamma mit Hilfe von GauƟ-Prozess-Regression. Virtuelles Screening ist die rechnergestĆ¼tzte Priorisierung von MolekĆ¼len bezĆ¼glich einer vorhergesagten Eigenschaft. Es handelt sich um ein Problem der Chemieinformatik, das in der Trefferfindungsphase der Medikamentenentwicklung auftritt. Seine Liganden-basierte Variante beruht auf dem Ƅhnlichkeitsprinzip, nach dem (strukturell) Ƥhnliche MolekĆ¼le tendenziell Ƥhnliche Eigenschaften haben. In unserer Beschreibung des Lƶsungsansatzes mit Kern-basierten Lernverfahren betonen wir die Bedeutung molekularer ReprƤsentationen, einschlieƟlich der auf ihnen definierten (Un)ƤhnlichkeitsmaƟe. Wir untersuchen Effekte in hochdimensionalen chemischen DeskriptorrƤumen, ihre Auswirkungen auf Ƅhnlichkeits-basierte Verfahren und geben einen LiteraturĆ¼berblick zu Empfehlungen zur retrospektiven Validierung, einschlieƟlich eines Beispiel-Workflows. Graphkerne sind formale ƄhnlichkeitsmaƟe, die inneren Produkten entsprechen und direkt auf Graphen, z.B. annotierten molekularen Strukturgraphen, definiert werden. Wir geben einen LiteraturĆ¼berblick Ć¼ber Graphkerne, insbesondere solche, die auf zufƤlligen Irrfahrten, Subgraphen und optimalen Knotenzuordnungen beruhen. Indem wir letztere mit einem Ansatz zur iterativen GraphƤhnlichkeit kombinieren, entwickeln wir den iterative similarity optimal assignment Graphkern. Wir beschreiben einen iterativen Algorithmus, zeigen dessen Konvergenz sowie die Eindeutigkeit der Lƶsung, und geben eine obere Schranke fĆ¼r die Anzahl der benƶtigten Iterationen an. In einer retrospektiven Studie zeigte unser Graphkern konsistent bessere Ergebnisse als chemische Deskriptoren und andere, auf optimalen Knotenzuordnungen basierende Graphkerne. Chemische DatensƤtze liegen oft auf Mannigfaltigkeiten niedrigerer DimensionalitƤt als der umgebende chemische Deskriptorraum. Dimensionsreduktionsmethoden erlauben die Identifikation dieser Mannigfaltigkeiten und stellen dadurch deskriptive Modelle der Daten zur VerfĆ¼gung. FĆ¼r spektrale Methoden auf Basis der Kern-Hauptkomponentenanalyse ist der Projektionsfehler ein quantitatives MaƟ dafĆ¼r, wie gut neue Daten von solchen Modellen beschrieben werden. Dies kann zur Identifikation von MolekĆ¼len verwendet werden, die strukturell unƤhnlich zu den Trainingsdaten sind, und erlaubt so Projektionsfehler-basiertes Novelty Detection fĆ¼r virtuelles Screening mit ausschlieƟlich positiven Beispielen. Wir fĆ¼hren eine Machbarkeitsstudie zur Lernbarkeit des Konzepts von FettsƤuren durch die Hauptkomponentenanalyse durch. Der Peroxisom-Proliferator-aktivierte Rezeptor (PPAR) ist ein im Zellkern vorkommender Rezeptor, der den Fett- und Zuckerstoffwechsel reguliert. Er spielt eine wichtige Rolle in der Entwicklung von Krankheiten wie Typ-2-Diabetes und DyslipidƤmie. Wir etablieren ein GauƟ-Prozess-Regressionsmodell fĆ¼r PPAR gamma-Agonisten mit chemischen Deskriptoren und unserem Graphkern durch gleichzeitiges Lernen mehrerer Kerne. Das Screening einer kommerziellen Substanzbibliothek und die anschlieƟende Testung 15 ausgewƤhlter Substanzen in einem Zell-basierten Transaktivierungsassay ergab vier aktive Substanzen. Eine davon, ein Naturstoff mit Cyclobutan-GrundgerĆ¼st, ist ein voller selektiver PPAR gamma-Agonist (EC50 = 10 +/- 0,2 muM, inaktiv auf PPAR alpha und PPAR beta/delta bei 10 muM). Unsere Studie liefert einen neuen PPAR gamma-Agonisten, legt den Wirkmechanismus eines bioaktiven Naturstoffs offen, und erlaubt RĆ¼ckschlĆ¼sse auf die NaturstoffursprĆ¼nge von Pharmakophormustern in synthetischen Liganden
    corecore