58 research outputs found

    Automatically estimating iSAX parameters

    Get PDF
    The Symbolic Aggregate Approximation (iSAX) is widely used in time series data mining. Its popularity arises from the fact that it largely reduces time series size, it is symbolic, allows lower bounding and is space efficient. However, it requires setting two parameters: the symbolic length and alphabet size, which limits the applicability of the technique. The optimal parameter values are highly application dependent. Typically, they are either set to a fixed value or experimentally probed for the best configuration. In this work we propose an approach to automatically estimate iSAX’s parameters. The approach – AutoiSAX – not only discovers the best parameter setting for each time series in the database, but also finds the alphabet size for each iSAX symbol within the same word. It is based on simple and intuitive ideas from time series complexity and statistics. The technique can be smoothly embedded in existing data mining tasks as an efficient sub-routine. We analyze its impact in visualization interpretability, classification accuracy and motif mining. Our contribution aims to make iSAX a more general approach as it evolves towards a parameter-free method

    Time series motif discovery

    Get PDF
    Programa doutoral MAP-i em Computer ScienceTime series data are daily produced in massive proportions in virtually every field. Most of the data are stored in time series databases. To find patterns in the databases is an important problem. These patterns, also known as motifs, provide useful insight to the domain expert and summarize the database. They have been widely used in areas as diverse as finance and medicine. Despite there are many algorithms for the task, they typically do not scale and need to set several parameters. We propose a novel algorithm that runs in linear time, is also space efficient and only needs to set one parameter. It fully exploits the state of the art time series representation (SAX _ Symbolic Aggregate Approximation) technique to extract motifs at several resolutions. This property allows the algorithm to skip expensive distance calculations that are typically employed by other algorithms. We also propose an approach to calculate time series motifs statistical significance. Despite there are many approaches in the literature to find time series motifs e_ciently, surprisingly there is no approach that calculates a motifs statistical significance. Our proposal leverages work from the bioinformatics community by using a symbolic definition of time series motifs to derive each motif's p-value. We estimate the expected frequency of a motif by using Markov Chain models. The p-value is then assessed by comparing the actual frequency to the estimated one using statistical hypothesis tests. Our contribution gives means to the application of a powerful technique - statistical tests - to a time series setting. This provides researchers and practitioners with an important tool to evaluate automatically the degree of relevance of each extracted motif. Finally, we propose an approach to automatically derive the Symbolic Aggregate Approximation (iSAX) time series representation's parameters. This technique is widely used in time series data mining. Its popularity arises from the fact that it is symbolic, reduces the dimensionality of the series, allows lower bounding and is space efficient. However, the need to set the symbolic length and alphabet size parameters limits the applicability of the representation since the best parameter setting is highly application dependent. Typically, these are either set to a fixed value (e.g. 8) or experimentally probed for the best configuration. The technique, referred as AutoiSAX, not only discovers the best parameter setting for each time series in the database but also finds the alphabet size for each iSAX symbol within the same word. It is based on the simple and intuitive ideas of time series complexity and standard deviation. The technique can be smoothly embedded in existing data mining tasks as an efficient sub-routine. We analyse the impact of using AutoiSAX in visualization interpretability, classification accuracy and motif mining results. Our contribution aims to make iSAX a more general approach as it evolves towards a parameter-free method.As séries temporais são produzidas diariamente em quantidades massivas em diferentes áreas de trabalho. Estes dados são guardados em bases de dados de séries temporais. Descobrir padrões desconhecidos e repetidos em bases de dados de séries temporais é um desafio pertinente. Estes padrões, também conhecidos como motivos, dão uma nova perspectiva da base de dados, ajudando a explorá-la e sumarizá-la. São frequentemente utilizados em áreas tão diversas como as finanças ou a medicina. Apesar de existirem diversos algoritmos destinados à execução desta tarefa, geralmente não apresentam uma boa escalabilidade e exigem a configuração de vários parâmetros. Propomos, neste trabalho, a criação de um novo algoritmo que executa em tempo linear e que é igualmente eficiente em termos de memória usada, necessitando apenas de um parâmetro. Este algoritmo usufrui da melhor técnica de representação de séries temporais para extrair motivos em várias resoluções (SAX). Esta propriedade permite evitar o cálculo de distâncias que têm um custo computacional muito elevado, cálculo este geralmente presente noutros algoritmos. Nesta tese também fazemos uma proposta para calcular a significância estatística de motivos em séries temporais. Apesar de existirem muitas propostas para a detecção eficiente de motivos em séries temporais, surpreendentemente não existe nenhuma aproximação para calcular a sua significância estatística. A nossa proposta é enriquecida pelo trabalho da área bioinformática, sendo usada uma definição simbólica de motivo para derivar o seu respectivo p-value. Estimamos a frequência esperada de um motivo usando modelos de cadeias de Markov. O p-value associado a um teste estatístico é calculado comparando a frequência real com a frequência estimada de cada padrão. A nossa contribuição permite a aplicação de uma técnica poderosa, testes estatísticos, para a área das séries temporais. Proporciona assim, aos investigadores e utilizadores, uma ferramenta importante para avaliarem, de forma automática, a relevância de cada motivo extraído dos seus dados. Por fim, propomos uma metodologia para derivar de forma automática os parâmetros da representação de séries temporais Symbolic Aggregate Approximation (iSAX). Esta técnica é vastamente utilizada na área de Extracção de Conhecimento em séries temporais. A sua popularidade surge associada ao facto de ser simbólica, de reduzir o tamanho das séries, de permitir aproximar a Distância Euclidiana nas séries originais e ser eficiente em termos de espaço. Contudo, a necessidade de definir os parâmetros comprimento da representação e tamanho do alfabeto limita a sua utilização na prática, uma vez que o parâmetro mais adequado está dependente da área em causa. Normalmente, estes são definidos quer para um valor fixo (por exemplo, 8). A técnica, designada por AutoiSAX, não só extrai a melhor configuração do parâmetro para cada série temporal da base de dados como consegue encontrar a dimensão do alfabeto para cada símbolo iSAX dentro da mesma palavra. Baseia-se em ideias simples e intuitivas como a complexidade das séries temporais e no desvio padrão. A técnica pode ser facilmente incorporada como uma sub-rotina eficiente em tarefas existentes de extracção de conhecimento. Analisamos também o impacto da utilização do AutoiSAX na capacidade interpretativa em tarefas de visualização, exactidão da classificação e na qualidade dos motivos extraídos. A nossa proposta pretende que a iSAX se consolide como uma abordagem mais geral à medida que se vai constituindo como uma metodologia livre de parâmetros.Fundação para a Ciência e Tecnologia (FCT) - SFRH / BD / 33303 / 200

    Highly comparative feature-based time-series classification

    Full text link
    A highly comparative, feature-based approach to time series classification is introduced that uses an extensive database of algorithms to extract thousands of interpretable features from time series. These features are derived from across the scientific time-series analysis literature, and include summaries of time series in terms of their correlation structure, distribution, entropy, stationarity, scaling properties, and fits to a range of time-series models. After computing thousands of features for each time series in a training set, those that are most informative of the class structure are selected using greedy forward feature selection with a linear classifier. The resulting feature-based classifiers automatically learn the differences between classes using a reduced number of time-series properties, and circumvent the need to calculate distances between time series. Representing time series in this way results in orders of magnitude of dimensionality reduction, allowing the method to perform well on very large datasets containing long time series or time series of different lengths. For many of the datasets studied, classification performance exceeded that of conventional instance-based classifiers, including one nearest neighbor classifiers using Euclidean distances and dynamic time warping and, most importantly, the features selected provide an understanding of the properties of the dataset, insight that can guide further scientific investigation

    Bridging the gap between algorithmic and learned index structures

    Get PDF
    Index structures such as B-trees and bloom filters are the well-established petrol engines of database systems. However, these structures do not fully exploit patterns in data distribution. To address this, researchers have suggested using machine learning models as electric engines that can entirely replace index structures. Such a paradigm shift in data system design, however, opens many unsolved design challenges. More research is needed to understand the theoretical guarantees and design efficient support for insertion and deletion. In this thesis, we adopt a different position: index algorithms are good enough, and instead of going back to the drawing board to fit data systems with learned models, we should develop lightweight hybrid engines that build on the benefits of both algorithmic and learned index structures. The indexes that we suggest provide the theoretical performance guarantees and updatability of algorithmic indexes while using position prediction models to leverage the data distributions and thereby improve the performance of the index structure. We investigate the potential for minimal modifications to algorithmic indexes such that they can leverage data distribution similar to how learned indexes work. In this regard, we propose and explore the use of helping models that boost classical index performance using techniques from machine learning. Our suggested approach inherits performance guarantees from its algorithmic baseline index, but at the same time it considers the data distribution to improve performance considerably. We study single-dimensional range indexes, spatial indexes, and stream indexing, and show that the suggested approach results in range indexes that outperform the algorithmic indexes and have comparable performance to the read-only, fully learned indexes and hence can be reliably used as a default index structure in a database engine. Besides, we consider the updatability of the indexes and suggest solutions for updating the index, notably when the data distribution drastically changes over time (e.g., for indexing data streams). In particular, we propose a specific learning-augmented index for indexing a sliding window with timestamps in a data stream. Additionally, we highlight the limitations of learned indexes for low-latency lookup on real- world data distributions. To tackle this issue, we suggest adding an algorithmic enhancement layer to a learned model to correct the prediction error with a small memory latency. This approach enables efficient modelling of the data distribution and resolves the local biases of a learned model at the cost of roughly one memory lookup.Open Acces

    Linking pensions to life expectancy: tackling conceptual uncertainty through Bayesian Model Averaging

    Get PDF
    Linking pensions to longevity developments at retirement age has been one of the most common policy responses of pension schemes to aging populations. The introduction of automatic stabilizers is primarily motivated by cost containment objectives, but there are other dimensions of welfare restructuring in the politics of pension reforms, including recalibration, rationalization, and blame avoidance for unpopular policies that involve retrenchments. This paper examines the policy designs and implications of linking entry pensions to life expectancy developments through sustainability factors or life expectancy coefficients in Finland, Portugal, and Spain. To address conceptual and specification uncertainty in policymaking, we propose and apply a Bayesian model averaging approach to stochastic mortality modeling and life expectancy computation. The results show that: (i) sustainability factors will generate substantial pension entitlement reductions in the three countries analyzed; (ii) the magnitude of the pension losses depends on the factor design; (iii) to offset pension cuts and safeguard pension adequacy, individuals will have to prolong their working lives significantly; (iv) factor designs considering cohort longevity markers would have generated higher pension cuts in countries with increasing life expectancy gap

    A Time-Series Compression Technique and its Application to the Smart Grid

    Get PDF
    Time-series data is increasingly collected in many domains. One example is the smart electricity infrastructure, which generates huge volumes of such data from sources such as smart electricity meters. Although today this data is used for visualization and billing in mostly 15-min resolution, its original temporal resolution frequently is more fine-grained, e.g., seconds. This is useful for various analytical applications such as short-term forecasting, disaggregation and visualization. However, transmitting and storing huge amounts of such fine-grained data is prohibitively expensive in terms of storage space in many cases. In this article, we present a compression technique based on piecewise regression and two methods which describe the performance of the compression. Although our technique is a general approach for time-series compression, smart grids serve as our running example and as our evaluation scenario. Depending on the data and the use-case scenario, the technique compresses data by ratios of up to factor 5,000 while maintaining its usefulness for analytics. The proposed technique has outperformed related work and has been applied to three real-world energy datasets in different scenarios. Finally, we show that the proposed compression technique can be implemented in a state-of-the-art database management system

    Feature-based Time Series Analytics

    Get PDF
    Time series analytics is a fundamental prerequisite for decision-making as well as automation and occurs in several applications such as energy load control, weather research, and consumer behavior analysis. It encompasses time series engineering, i.e., the representation of time series exhibiting important characteristics, and data mining, i.e., the application of the representation to a specific task. Due to the exhaustive data gathering, which results from the ``Industry 4.0'' vision and its shift towards automation and digitalization, time series analytics is undergoing a revolution. Big datasets with very long time series are gathered, which is challenging for engineering techniques. Traditionally, one focus has been on raw-data-based or shape-based engineering. They assess the time series' similarity in shape, which is only suitable for short time series. Another focus has been on model-based engineering. It assesses the time series' similarity in structure, which is suitable for long time series but requires larger models or a time-consuming modeling. Feature-based engineering tackles these challenges by efficiently representing time series and comparing their similarity in structure. However, current feature-based techniques are unsatisfactory as they are designed for specific data-mining tasks. In this work, we introduce a novel feature-based engineering technique. It efficiently provides a short representation of time series, focusing on their structural similarity. Based on a design rationale, we derive important time series characteristics such as the long-term and cyclically repeated characteristics as well as distribution and correlation characteristics. Moreover, we define a feature-based distance measure for their comparison. Both the representation technique and the distance measure provide desirable properties regarding storage and runtime. Subsequently, we introduce techniques based on our feature-based engineering and apply them to important data-mining tasks such as time series generation, time series matching, time series classification, and time series clustering. First, our feature-based generation technique outperforms state-of-the-art techniques regarding the accuracy of evolved datasets. Second, with our features, a matching method retrieves a match for a time series query much faster than with current representations. Third, our features provide discriminative characteristics to classify datasets as accurately as state-of-the-art techniques, but orders of magnitude faster. Finally, our features recommend an appropriate clustering of time series which is crucial for subsequent data-mining tasks. All these techniques are assessed on datasets from the energy, weather, and economic domains, and thus, demonstrate the applicability to real-world use cases. The findings demonstrate the versatility of our feature-based engineering and suggest several courses of action in order to design and improve analytical systems for the paradigm shift of Industry 4.0

    Some Recent Developments on Nonparametric Econometrics

    Get PDF
    In this paper, we survey some recent developments of nonparametric econometrics in the following areas: (i) nonparametric estimation of regression models with mixed discrete and continuous data; (ii) nonparametric models with nonstationary data; (iii) nonparametric models with instrumental variables; and (iv) nonparametric estimation of conditional quantile functions. In each of the above areas, we also point out some open research problems.This paper was publised in Advances in Econometrics, Volume 25 (2009), 495–549
    • …
    corecore