9,482 research outputs found
Optimal Time-Series Motifs
Motifs are the most repetitive/frequent patterns of a time-series. The
discovery of motifs is crucial for practitioners in order to understand and
interpret the phenomena occurring in sequential data. Currently, motifs are
searched among series sub-sequences, aiming at selecting the most frequently
occurring ones. Search-based methods, which try out series sub-sequence as
motif candidates, are currently believed to be the best methods in finding the
most frequent patterns.
However, this paper proposes an entirely new perspective in finding motifs.
We demonstrate that searching is non-optimal since the domain of motifs is
restricted, and instead we propose a principled optimization approach able to
find optimal motifs. We treat the occurrence frequency as a function and
time-series motifs as its parameters, therefore we \textit{learn} the optimal
motifs that maximize the frequency function. In contrast to searching, our
method is able to discover the most repetitive patterns (hence optimal), even
in cases where they do not explicitly occur as sub-sequences. Experiments on
several real-life time-series datasets show that the motifs found by our method
are highly more frequent than the ones found through searching, for exactly the
same distance threshold.Comment: Submitted to KDD201
Mining Constrained Gradients
Many data analysis tasks can be viewed as search or mining in a multidimensional space (MDS). In such MDSs, dimensions capture potentially important factors for given applications, and cells represent combinations of values for the factors. To systematically analyze data in MDS, an interesting notion, called "cubegrade" was recently introduced by Imielinski et al. [14], which focuses on the notable changes in measures in MDS by comparing a cell (which we refer to as probe cell) with its gradient cells, namely, its ancestors, descendants, and siblings. We call such queries gradient analysis queries (GQs). Since an MDS can contain billions of cells, it is important to answer GQs efficiently. In this study, we focus on developing efficient methods for mining GQs constrained by certain (weakly) antimonotone constraints. Instead of conducting an independent gradient-cell search once per probe cell, which is inefficient due to much repeated work, we propose an efficient algorithm, LiveSet-Driven. This algorithm finds all good gradient-probe cell pairs in one search pass. It utilizes measure-value analysis and dimension-match analysis in a set-oriented manner, to achieve bidirectional pruning between the sets of hopeful probe cells and of hopeful gradient cells. Moreover, it adopts a hypertree structure and an H-cubing method to compress data and to maximize sharing of computation. Our performance study shows that this algorithm is efficient and scalable. In addition to data cubes, we extend our study to another important scenario: mining constrained gradients in transactional databases where each item is associated with some measures such as price. Such transactional databases can be viewed as sparse MDSs where items represent dimensions, although they have signi..
Visual Feature Fusion and its Application to Support Unsupervised Clustering Tasks
On visual analytics applications, the concept of putting the user on the loop
refers to the ability to replace heuristics by user knowledge on machine
learning and data mining tasks. On supervised tasks, the user engagement occurs
via the manipulation of the training data. However, on unsupervised tasks, the
user involvement is limited to changes in the algorithm parametrization or the
input data representation, also known as features. Depending on the application
domain, different types of features can be extracted from the raw data.
Therefore, the result of unsupervised algorithms heavily depends on the type of
employed feature. Since there is no perfect feature extractor, combining
different features have been explored in a process called feature fusion. The
feature fusion is straightforward when the machine learning or data mining task
has a cost function. However, when such a function does not exist, user support
for combination needs to be provided otherwise the process is impractical. In
this paper, we present a novel feature fusion approach that uses small data
samples to allows users not only to effortless control the combination of
different feature sets but also to interpret the attained results. The
effectiveness of our approach is confirmed by a comprehensive set of
qualitative and quantitative tests, opening up different possibilities of
user-guided analytical scenarios not covered yet. The ability of our approach
to providing real-time feedback for the feature fusion is exploited on the
context of unsupervised clustering techniques, where the composed groups
reflect the semantics of the feature combination.Comment: 15 pages, 21 Figure
Theoretical foundations of emergent constraints: relationships between climate sensitivity and global temperature variability in conceptual models
There is as yet no theoretical framework to guide the search for emergent
constraints. As a result, there are significant risks that indiscriminate
data-mining of the multidimensional outputs from GCMs could lead to spurious
correlations and less than robust constraints on future changes. To mitigate
against this risk, Cox et al (hereafter CHW18) proposed a theory-motivated
emergent constraint, using the one-box Hasselmann model to identify a linear
relationship between ECS and a metric of global temperature variability
involving both temperature standard deviation and autocorrelation (). A
number of doubts have been raised about this approach, some concerning the
theory and the application of the one-box model to understand relationships in
complex GCMs which are known to have more than the single characteristic
timescale. We illustrate theory driven testing of emergent constraints using
this as an example, namely we demonstrate that the linear -ECS
proportionality is not an artifact of the one-box model and rigorously features
to a good approximation in more realistic, yet still analytically soluble
conceptual models, namely the two-box and diffusion models. Each of the
conceptual models predict different power spectra with only the diffusion
model's pink spectrum being compatible with observations and the complex CMIP5
GCMs. We also show that the theoretically predicted -ECS relationship
exists in the \texttt{piControl} as well as \texttt{historical} CMIP5
experiments and that the differing gradients of the proportionality are
inversely related to the effective forcing in that experiment.Comment: 12 pages, 4 figures, accepted for publication in Dynamics and
Statistics of the Climate System. V2 with rewritten abstract to conform to
journal style and references no longer corrupte
Scalable Gaussian Processes for Characterizing Multidimensional Change Surfaces
We present a scalable Gaussian process model for identifying and
characterizing smooth multidimensional changepoints, and automatically learning
changes in expressive covariance structure. We use Random Kitchen Sink features
to flexibly define a change surface in combination with expressive spectral
mixture kernels to capture the complex statistical structure. Finally, through
the use of novel methods for additive non-separable kernels, we can scale the
model to large datasets. We demonstrate the model on numerical and real world
data, including a large spatio-temporal disease dataset where we identify
previously unknown heterogeneous changes in space and time.Comment: 18 pages, 8 figure
Change Surfaces for Expressive Multidimensional Changepoints and Counterfactual Prediction
Identifying changes in model parameters is fundamental in machine learning
and statistics. However, standard changepoint models are limited in
expressiveness, often addressing unidimensional problems and assuming
instantaneous changes. We introduce change surfaces as a multidimensional and
highly expressive generalization of changepoints. We provide a model-agnostic
formalization of change surfaces, illustrating how they can provide variable,
heterogeneous, and non-monotonic rates of change across multiple dimensions.
Additionally, we show how change surfaces can be used for counterfactual
prediction. As a concrete instantiation of the change surface framework, we
develop Gaussian Process Change Surfaces (GPCS). We demonstrate counterfactual
prediction with Bayesian posterior mean and credible sets, as well as massive
scalability by introducing novel methods for additive non-separable kernels.
Using two large spatio-temporal datasets we employ GPCS to discover and
characterize complex changes that can provide scientific and policy relevant
insights. Specifically, we analyze twentieth century measles incidence across
the United States and discover previously unknown heterogeneous changes after
the introduction of the measles vaccine. Additionally, we apply the model to
requests for lead testing kits in New York City, discovering distinct spatial
and demographic patterns
RetainVis: Visual Analytics with Interpretable and Interactive Recurrent Neural Networks on Electronic Medical Records
We have recently seen many successful applications of recurrent neural
networks (RNNs) on electronic medical records (EMRs), which contain histories
of patients' diagnoses, medications, and other various events, in order to
predict the current and future states of patients. Despite the strong
performance of RNNs, it is often challenging for users to understand why the
model makes a particular prediction. Such black-box nature of RNNs can impede
its wide adoption in clinical practice. Furthermore, we have no established
methods to interactively leverage users' domain expertise and prior knowledge
as inputs for steering the model. Therefore, our design study aims to provide a
visual analytics solution to increase interpretability and interactivity of
RNNs via a joint effort of medical experts, artificial intelligence scientists,
and visual analytics researchers. Following the iterative design process
between the experts, we design, implement, and evaluate a visual analytics tool
called RetainVis, which couples a newly improved, interpretable and interactive
RNN-based model called RetainEX and visualizations for users' exploration of
EMR data in the context of prediction tasks. Our study shows the effective use
of RetainVis for gaining insights into how individual medical codes contribute
to making risk predictions, using EMRs of patients with heart failure and
cataract symptoms. Our study also demonstrates how we made substantial changes
to the state-of-the-art RNN model called RETAIN in order to make use of
temporal information and increase interactivity. This study will provide a
useful guideline for researchers that aim to design an interpretable and
interactive visual analytics tool for RNNs.Comment: Accepted at IEEE VIS 2018. To appear in IEEE Transactions on
Visualization and Computer Graphics in January 201
From BOP to BOSS and Beyond: Time Series Classification with Dictionary Based Classifiers
A family of algorithms for time series classification (TSC) involve running a
sliding window across each series, discretising the window to form a word,
forming a histogram of word counts over the dictionary, then constructing a
classifier on the histograms. A recent evaluation of two of this type of
algorithm, Bag of Patterns (BOP) and Bag of Symbolic Fourier Approximation
Symbols (BOSS) found a significant difference in accuracy between these
seemingly similar algorithms. We investigate this phenomenon by deconstructing
the classifiers and measuring the relative importance of the four key
components between BOP and BOSS. We find that whilst ensembling is a key
component for both algorithms, the effect of the other components is mixed and
more complex. We conclude that BOSS represents the state of the art for
dictionary based TSC. Both BOP and BOSS can be classed as bag of words
approaches. These are particularly popular in Computer Vision for tasks such as
image classification. Converting approaches from vision requires careful
engineering. We adapt three techniques used in Computer Vision for TSC: Scale
Invariant Feature Transform; Spatial Pyramids; and Histrogram Intersection. We
find that using Spatial Pyramids in conjunction with BOSS (SP) produces a
significantly more accurate classifier. SP is significantly more accurate than
standard benchmarks and the original BOSS algorithm. It is not significantly
worse than the best shapelet based approach, and is only outperformed by
HIVE-COTE, an ensemble that includes BOSS as a constituent module
A Moving Least Squares Based Approach for Contour Visualization of Multi-Dimensional Data
Analysis of high dimensional data is a common task. Often, small multiples
are used to visualize 1 or 2 dimensions at a time, such as in a scatterplot
matrix. Associating data points between different views can be difficult
though, as the points are not fixed. Other times, dimensional reduction
techniques are employed to summarize the whole dataset in one image, but
individual dimensions are lost in this view. In this paper, we present a means
of augmenting a dimensional reduction plot with isocontours to reintroduce the
original dimensions. By applying this to each dimension in the original data,
we create multiple views where the points are consistent, which facilitates
their comparison. Our approach employs a combination of a novel, graph-based
projection technique with a GPU accelerated implementation of moving least
squares to interpolate space between the points. We also present evaluations of
this approach both with a case study and with a user study
Visual Analytics and Human Involvement in Machine Learning
The rapidly developing AI systems and applications still require human
involvement in practically all parts of the analytics process. Human decisions
are largely based on visualizations, providing data scientists details of data
properties and the results of analytical procedures. Different visualizations
are used in the different steps of the Machine Learning (ML) process. The
decision which visualization to use depends on factors, such as the data
domain, the data model and the step in the ML process. In this chapter, we
describe the seven steps in the ML process and review different visualization
techniques that are relevant for the different steps for different types of
data, models and purposes
- …