122,094 research outputs found
Highly comparative feature-based time-series classification
A highly comparative, feature-based approach to time series classification is
introduced that uses an extensive database of algorithms to extract thousands
of interpretable features from time series. These features are derived from
across the scientific time-series analysis literature, and include summaries of
time series in terms of their correlation structure, distribution, entropy,
stationarity, scaling properties, and fits to a range of time-series models.
After computing thousands of features for each time series in a training set,
those that are most informative of the class structure are selected using
greedy forward feature selection with a linear classifier. The resulting
feature-based classifiers automatically learn the differences between classes
using a reduced number of time-series properties, and circumvent the need to
calculate distances between time series. Representing time series in this way
results in orders of magnitude of dimensionality reduction, allowing the method
to perform well on very large datasets containing long time series or time
series of different lengths. For many of the datasets studied, classification
performance exceeded that of conventional instance-based classifiers, including
one nearest neighbor classifiers using Euclidean distances and dynamic time
warping and, most importantly, the features selected provide an understanding
of the properties of the dataset, insight that can guide further scientific
investigation
OFTER: An Online Pipeline for Time Series Forecasting
We introduce OFTER, a time series forecasting pipeline tailored for mid-sized
multivariate time series. OFTER utilizes the non-parametric models of k-nearest
neighbors and Generalized Regression Neural Networks, integrated with a
dimensionality reduction component. To circumvent the curse of dimensionality,
we employ a weighted norm based on a modified version of the maximal
correlation coefficient. The pipeline we introduce is specifically designed for
online tasks, has an interpretable output, and is able to outperform several
state-of-the art baselines. The computational efficacy of the algorithm, its
online nature, and its ability to operate in low signal-to-noise regimes,
render OFTER an ideal approach for financial multivariate time series problems,
such as daily equity forecasting. Our work demonstrates that while deep
learning models hold significant promise for time series forecasting,
traditional methods carefully integrating mainstream tools remain very
competitive alternatives with the added benefits of scalability and
interpretability.Comment: 26 pages, 12 figure
Efficient similarity search in high-dimensional data spaces
Similarity search in high-dimensional data spaces is a popular paradigm for many modern database applications, such as content based image retrieval, time series analysis in financial and marketing databases, and data mining. Objects are represented as high-dimensional points or vectors based on their important features. Object similarity is then measured by the distance between feature vectors and similarity search is implemented via range queries or k-Nearest Neighbor (k-NN) queries.
Implementing k-NN queries via a sequential scan of large tables of feature vectors is computationally expensive. Building multi-dimensional indexes on the feature vectors for k-NN search also tends to be unsatisfactory when the dimensionality is high. This is due to the poor index performance caused by the dimensionality curse.
Dimensionality reduction using the Singular Value Decomposition method is the approach adopted in this study to deal with high-dimensional data. Noting that for many real-world datasets, data distribution tends to be heterogeneous, dimensionality reduction on the entire dataset may cause a significant loss of information. More efficient representation is sought by clustering the data into homogeneous subsets of points, and applying dimensionality reduction to each cluster respectively, i.e., utilizing local rather than global dimensionality reduction.
The thesis deals with the improvement of the efficiency of query processing associated with local dimensionality reduction methods, such as the Clustering and Singular Value Decomposition (CSVD) and the Local Dimensionality Reduction (LDR) methods. Variations in the implementation of CSVD are considered and the two methods are compared from the viewpoint of the compression ratio, CPU time, and retrieval efficiency.
An exact k-NN algorithm is presented for local dimensionality reduction methods by extending an existing multi-step k-NN search algorithm, which is designed for global dimensionality reduction. Experimental results show that the new method requires less CPU time than the approximate method proposed original for CSVD at a comparable level of accuracy.
Optimal subspace dimensionality reduction has the intent of minimizing total query cost. The problem is complicated in that each cluster can retain a different number of dimensions. A hybrid method is presented, combining the best features of the CSVD and LDR methods, to find optimal subspace dimensionalities for clusters generated by local dimensionality reduction methods. The experiments show that the proposed method works well for both real-world datasets and synthetic datasets
DYNAMIC MIXTURES OF FACTOR ANALYZERS TO CHARACTERIZE MULTIVARIATE AIR POLLUTANT EXPOSURES
The assessment of pollution exposure is based on the analysis
of multivariate time series that include the concentrations of several
pollutants as well as the measurements of multiple atmospheric variables.
It typically requires methods of dimensionality reduction that
are capable to identify potentially dangerous combinations of pollutants
and, simultaneously, to segment exposure periods according
to air quality conditions. When the data are high-dimensional, however,
efficient methods of dimensionality reduction are challenging
because of the formidable structure of cross-correlations that arise
from the dynamic interaction between weather conditions and natural/anthropogenic
pollution sources. In order to assess pollution exposure
in an urban area while taking the above mentioned difficulties
into account, we develop a class of parsimonious hidden Markov
models. In a multivariate time-series setting, this approach allows to
simultaneously perform temporal segmentation and dimensionality
reduction. We specifically approximate the distribution of multiple
pollutant concentrations by mixtures of factor analysis models, whose
parameters evolve according to a latent Markov chain. Covariates are
included as predictors of the chain transition probabilities. Parameter
constraints on the factorial component of the model are exploited
to tune the flexibility of dimensionality reduction. In order to estimate
the model parameters efficiently, we propose a novel three-step
Alternating Expected Conditional Maximization (AECM) algorithm,
which is also assessed in a simulation study. In the case study, the
proposed methods were capable (1) to describe the exposure to pollution
in terms of a few latent regimes, (2) to associate these regimes
with specific combinations of pollutant concentration levels as well
as distinct correlation structures between concentrations, and (3) to
capture the influence of weather conditions on transitions between
regime
- …