521 research outputs found
Similarity Measures and Dimensionality Reduction Techniques for Time Series Data Mining
The chapter is organized as follows. Section 2 will introduce the similarity matching
problem on time series. We will note the importance of the use of efficient data structures to
perform search, and the choice of an adequate distance measure. Section 3 will show some
of the most used distance measure for time series data mining. Section 4 will review the
above mentioned dimensionality reduction techniques
Systems aspects of COBE science data compression
A general approach to compression of diverse data from large scientific projects has been developed and this paper addresses the appropriate system and scientific constraints together with the algorithm development and test strategy. This framework has been implemented for the COsmic Background Explorer spacecraft (COBE) by retrofitting the existing VAS-based data management system with high-performance compression software permitting random access to the data. Algorithms which incorporate scientific knowledge and consume relatively few system resources are preferred over ad hoc methods. COBE exceeded its planned storage by a large and growing factor and the retrieval of data significantly affects the processing, delaying the availability of data for scientific usage and software test. Embedded compression software is planned to make the project tractable by reducing the data storage volume to an acceptable level during normal processing
Advances in Manipulation and Recognition of Digital Ink
Handwriting is one of the most natural ways for a human to record knowledge. Recently, this type of human-computer interaction has received increasing attention due to the rapid evolution of touch-based hardware and software. While hardware support for digital ink reached its maturity, algorithms for recognition of handwriting in certain domains, including mathematics, are lacking robustness. Simultaneously, users may possess several pen-based devices and sharing of training data in adaptive recognition setting can be challenging. In addition, resolution of pen-based devices keeps improving making the ink cumbersome to process and store. This thesis develops several advances for efficient processing, storage and recognition of handwriting, which are applicable to the classification methods based on functional approximation. In particular, we propose improvements to classification of isolated characters and groups of rotated characters, as well as symbols of substantially different size. We then develop an algorithm for adaptive classification of handwritten mathematical characters of a user. The adaptive algorithm can be especially useful in the cloud-based recognition framework, which is described further in the thesis. We investigate whether the training data available in the cloud can be useful to a new writer during the training phase by extracting styles of individuals with similar handwriting and recommending styles to the writer. We also perform factorial analysis of the algorithm for recognition of n-grams of rotated characters. Finally, we show a fast method for compression of linear pieces of handwritten strokes and compare it with an enhanced version of the algorithm based on functional approximation of strokes. Experimental results demonstrate validity of the theoretical contributions, which form a solid foundation for the next generation handwriting recognition systems
Intelligent Pattern Analysis of the Foetal Electrocardiogram
The aim of the project on which this thesis is based is to develop reliable techniques for
foetal electrocardiogram (ECG) based monitoring, to reduce incidents of unnecessary
medical intervention and foetal injury during labour. World-wide electronic foetal
monitoring is based almost entirely on the cardiotocogram (CTG), which is a continuous
display of the foetal heart rate (FHR) pattern together with the contraction of the womb.
Despite the widespread use of the CTG, there is no significant improvement in foetal
outcome. In the UK alone it is estimated that birth related negligence claims cost the health
authorities over £400M per-annum. An expert system, known as INFANT, has recently
been developed to assist CTG interpretation. However, the CTG alone does not always
provide all the information required to improve the outcome of labour. The widespread use
of ECG analysis has been hindered by the difficulties with poor signal quality and the
difficulties in applying the specialised knowledge required for interpreting ECG patterns, in
association with other events in labour, in an objective way.
A fundamental investigation and development of optimal signal enhancement techniques
that maximise the available information in the ECG signal, along with different techniques
for detecting individual waveforms from poor quality signals, has been carried out. To
automate the visual interpretation of the ECG waveform, novel techniques have been
developed that allow reliable extraction of key features and hence allow a detailed ECG
waveform analysis. Fuzzy logic is used to automatically classify the ECG waveform shape
using these features by using knowledge that was elicited from expert sources and derived
from example data. This allows the subtle changes in the ECG waveform to be
automatically detected in relation to other events in labour, and thus improve the clinicians
position for making an accurate diagnosis. To ensure the interpretation is based on reliable
information and takes place in the proper context, a new and sensitive index for assessing
the quality of the ECG has been developed.
New techniques to capture, for the first time in machine form, the clinical expertise /
guidelines for electronic foetal monitoring have been developed based on fuzzy logic and
finite state machines, The software model provides a flexible framework to further develop
and optimise rules for ECG pattern analysis. The signal enhancement, QRS detection and
pattern recognition of important ECG waveform shapes have had extensive testing and
results are presented. Results show that no significant loss of information is incurred as a
result of the signal enhancement and feature extraction techniques
RUBIK: Efficient Threshold Queries on Massive Time Series
An increasing number of applications from finance, meteorology, science and others are producing time series as output. The analysis of the vast amount of time series is key to understand the phenomena studied, particularly in the simulation sciences, where the analysis of time series resulting from simulation allows scientists to refine the model simulated. Existing approaches to query time series typically keep a compact representation in main memory, use it to answer queries approximately and then access the exact time series data on disk to validate the result. The more precise the in-memory representation, the fewer disk accesses are needed to validate the result. With the massive sizes of today's datasets, however, current in-memory representations oftentimes no longer fit into main memory. To make them fit, their precision has to be reduced considerably resulting in substantial disk access which impedes query execution today and limits scalability for even bigger datasets in the future. In this paper we develop RUBIK, a novel approach to compressing and indexing time series. RUBIK exploits that time series in many applications and particularly in the simulation sciences are similar to each other. It compresses similar time series, i.e., observation values as well as time information, achieving better space efficiency and improved precision. RUBIK translates threshold queries into two dimensional spatial queries and efficiently executes them on the compressed time series by exploiting the pruning power of a tree structure to find the result, thereby outperforming the state-of-the-art by a factor of between 6 and 23. As our experiments further indicate, exploiting similarity within and between time series is crucial to make query execution scale and to ultimately decouple query execution time from the growth of the data (size and number of time series)
Application-Specific Number Representation
Reconfigurable devices, such as Field Programmable Gate Arrays (FPGAs), enable application-
specific number representations. Well-known number formats include fixed-point, floating-
point, logarithmic number system (LNS), and residue number system (RNS). Such different
number representations lead to different arithmetic designs and error behaviours, thus produc-
ing implementations with different performance, accuracy, and cost.
To investigate the design options in number representations, the first part of this thesis presents
a platform that enables automated exploration of the number representation design space. The
second part of the thesis shows case studies that optimise the designs for area, latency or
throughput from the perspective of number representations.
Automated design space exploration in the first part addresses the following two major issues:
² Automation requires arithmetic unit generation. This thesis provides optimised
arithmetic library generators for logarithmic and residue arithmetic units, which support
a wide range of bit widths and achieve significant improvement over previous designs.
² Generation of arithmetic units requires specifying the bit widths for each
variable. This thesis describes an automatic bit-width optimisation tool called R-Tool,
which combines dynamic and static analysis methods, and supports different number
systems (fixed-point, floating-point, and LNS numbers).
Putting it all together, the second part explores the effects of application-specific number
representation on practical benchmarks, such as radiative Monte Carlo simulation, and seismic
imaging computations. Experimental results show that customising the number representations
brings benefits to hardware implementations: by selecting a more appropriate number format,
we can reduce the area cost by up to 73.5% and improve the throughput by 14.2% to 34.1%; by
performing the bit-width optimisation, we can further reduce the area cost by 9.7% to 17.3%.
On the performance side, hardware implementations with customised number formats achieve
5 to potentially over 40 times speedup over software implementations
Reservoir Flooding Optimization by Control Polynomial Approximations
In this dissertation, we provide novel parametrization procedures for water-flooding
production optimization problems, using polynomial approximation techniques. The methods project the original infinite dimensional controls space into a polynomial subspace. Our contribution includes new parameterization formulations using natural polynomials, orthogonal Chebyshev polynomials and Cubic spline interpolation.
We show that the proposed methods are well suited for black-box approach with
stochastic global-search method as they tend to produce smooth control trajectories, while reducing the solution space size. We demonstrate their efficiency on synthetic two-dimensional problems and on a realistic 3-dimensional problem.
By contributing with a new adjoint method formulation for polynomial approximation,
we implemented the methods also with gradient-based algorithms. In addition to fine-scale simulation, we also performed reduced order modeling, where we demonstrated a synergistic effect when combining polynomial approximation with model order reduction, that leads to faster optimization with higher gains in terms of Net Present Value.
Finally, we performed gradient-based optimization under uncertainty. We proposed
a new multi-objective function with three components, one that maximizes the expected
value of all realizations, and two that maximize the averages of distribution tails from
both sides. The new objective provides decision makers with the flexibility to choose the
amount of risk they are willing to take, while deciding on production strategy or performing reserves estimation (P10;P50;P90)
- …