19 research outputs found
Indexing Metric Spaces for Exact Similarity Search
With the continued digitalization of societal processes, we are seeing an
explosion in available data. This is referred to as big data. In a research
setting, three aspects of the data are often viewed as the main sources of
challenges when attempting to enable value creation from big data: volume,
velocity and variety. Many studies address volume or velocity, while much fewer
studies concern the variety. Metric space is ideal for addressing variety
because it can accommodate any type of data as long as its associated distance
notion satisfies the triangle inequality. To accelerate search in metric space,
a collection of indexing techniques for metric data have been proposed.
However, existing surveys each offers only a narrow coverage, and no
comprehensive empirical study of those techniques exists. We offer a survey of
all the existing metric indexes that can support exact similarity search, by i)
summarizing all the existing partitioning, pruning and validation techniques
used for metric indexes, ii) providing the time and storage complexity analysis
on the index construction, and iii) report on a comprehensive empirical
comparison of their similarity query processing performance. Here, empirical
comparisons are used to evaluate the index performance during search as it is
hard to see the complexity analysis differences on the similarity query
processing and the query performance depends on the pruning and validation
abilities related to the data distribution. This article aims at revealing
different strengths and weaknesses of different indexing techniques in order to
offer guidance on selecting an appropriate indexing technique for a given
setting, and directing the future research for metric indexes
Aspects of Metric Spaces in Computation
Metric spaces, which generalise the properties of commonly-encountered physical and abstract spaces into a mathematical framework, frequently occur in computer science applications. Three major kinds of questions about metric spaces are considered here: the intrinsic dimensionality of a distribution, the maximum number of distance permutations, and the difficulty of reverse similarity search. Intrinsic dimensionality measures the tendency for points to be equidistant, which is diagnostic of high-dimensional spaces. Distance permutations describe the order in which a set of fixed sites appears while moving away from a chosen point; the number of distinct permutations determines the amount of storage space required by some kinds of indexing data structure. Reverse similarity search problems are constraint satisfaction problems derived from distance-based index structures. Their difficulty reveals details of the structure of the space. Theoretical and experimental results are given for these three questions in a wide range of metric spaces, with commentary on the consequences for computer science applications and additional related results where appropriate
Investigations into Elasticity in Cloud Computing
The pay-as-you-go model supported by existing cloud infrastructure providers
is appealing to most application service providers to deliver their
applications in the cloud. Within this context, elasticity of applications has
become one of the most important features in cloud computing. This elasticity
enables real-time acquisition/release of compute resources to meet application
performance demands. In this thesis we investigate the problem of delivering
cost-effective elasticity services for cloud applications.
Traditionally, the application level elasticity addresses the question of how
to scale applications up and down to meet their performance requirements, but
does not adequately address issues relating to minimising the costs of using
the service. With this current limitation in mind, we propose a scaling
approach that makes use of cost-aware criteria to detect the bottlenecks within
multi-tier cloud applications, and scale these applications only at bottleneck
tiers to reduce the costs incurred by consuming cloud infrastructure resources.
Our approach is generic for a wide class of multi-tier applications, and we
demonstrate its effectiveness by studying the behaviour of an example
electronic commerce site application.
Furthermore, we consider the characteristics of the algorithm for
implementing the business logic of cloud applications, and investigate the
elasticity at the algorithm level: when dealing with large-scale data under
resource and time constraints, the algorithm's output should be elastic with
respect to the resource consumed. We propose a novel framework to guide the
development of elastic algorithms that adapt to the available budget while
guaranteeing the quality of output result, e.g. prediction accuracy for
classification tasks, improves monotonically with the used budget.Comment: 211 pages, 27 tables, 75 figure
Towards Formal Structural Representation of Spoken Language: An Evolving Transformation System (ETS) Approach
Speech recognition has been a very active area of research over the past twenty years. Despite an evident progress, it is generally agreed by the practitioners of the field that performance of the current speech recognition systems is rather suboptimal and new approaches are needed. The motivation behind the undertaken research is an observation that the notion of representation of objects and concepts that once was considered to be central in the early days of pattern recognition, has been largely marginalised by the advent of statistical approaches. As a consequence of a predominantly statistical approach to speech recognition problem, due to the numeric, feature vector-based, nature of representation, the classes inductively discovered from real data using decision-theoretic techniques have little meaning outside the statistical framework. This is because decision surfaces or probability distributions are difficult to analyse linguistically. Because of the later limitation it is doubtful that the gap between speech recognition and linguistic research can be bridged by the numeric representations. This thesis investigates an alternative, structural, approach to spoken language representation and categorisation. The approach pursued in this thesis is based on a consistent program, known as the Evolving Transformation System (ETS), motivated by the development and clarification of the concept of structural representation in pattern recognition and artificial intelligence from both theoretical and applied points of view. This thesis consists of two parts. In the first part of this thesis, a similarity-based approach to structural representation of speech is presented. First, a linguistically well-motivated structural representation of phones based on distinctive phonological features recovered from speech is proposed. The representation consists of string templates representing phones together with a similarity measure. The set of phonological templates together with a similarity measure defines a symbolic metric space. Representation and ETS-inspired categorisation in the symbolic metric spaces corresponding to the phonological structural representation are then investigated by constructing appropriate symbolic space classifiers and evaluating them on a standard corpus of read speech. In addition, similarity-based isometric transition from phonological symbolic metric spaces to the corresponding non-Euclidean vector spaces is investigated. Second part of this thesis deals with the formal approach to structural representation of spoken language. Unlike the approach adopted in the first part of this thesis, the representation developed in the second part is based on the mathematical language of the ETS formalism. This formalism has been specifically developed for structural modelling of dynamic processes. In particular, it allows the representation of both objects and classes in a uniform event-based hierarchical framework. In this thesis, the latter property of the formalism allows the adoption of a more physiologically-concreteapproach to structural representation. The proposed representation is based on gestural structures and encapsulates speech processes at the articulatory level. Algorithms for deriving the articulatory structures from the data are presented and evaluated
Parametric POMDPs for planning in continuous state spaces
This thesis is concerned with planning and acting under uncertainty in partially-observable continuous domains. In particular, it focusses on the problem of mobile robot navigation given a known map. The dominant paradigm for robot localisation is to use Bayesian estimation to maintain a probability distribution over possible robot poses. In contrast, control algorithms often base their decisions on the assumption that a single state, such as the mode of this distribution, is correct. In scenarios involving significant uncertainty, this can lead to serious control errors. It is generally agreed that the reliability of navigation in uncertain environments would be greatly improved by the ability to consider the entire distribution when acting, rather than the single most likely state. The framework adopted in this thesis for modelling navigation problems mathematically is the Partially Observable Markov Decision Process (POMDP). An exact solution to a POMDP problem provides the optimal balance between reward-seeking behaviour and information-seeking behaviour, in the presence of sensor and actuation noise. Unfortunately, previous exact and approximate solution methods have had difficulty scaling to real applications. The contribution of this thesis is the formulation of an approach to planning in the space of continuous parameterised approximations to probability distributions. Theoretical and practical results are presented which show that, when compared with similar methods from the literature, this approach is capable of scaling to larger and more realistic problems. In order to apply the solution algorithm to real-world problems, a number of novel improvements are proposed. Specifically, Monte Carlo methods are employed to estimate distributions over future parameterised beliefs, improving planning accuracy without a loss of efficiency. Conditional independence assumptions are exploited to simplify the problem, reducing computational requirements. Scalability is further increased by focussing computation on likely beliefs, using metric indexing structures for efficient function approximation. Local online planning is incorporated to assist global offline planning, allowing the precision of the latter to be decreased without adversely affecting solution quality. Finally, the algorithm is implemented and demonstrated during real-time control of a mobile robot in a challenging navigation task. We argue that this task is substantially more challenging and realistic than previous problems to which POMDP solution methods have been applied. Results show that POMDP planning, which considers the evolution of the entire probability distribution over robot poses, produces significantly more robust behaviour when compared with a heuristic planner which considers only the most likely states and outcomes
Parametric POMDPs for planning in continuous state spaces
This thesis is concerned with planning and acting under uncertainty in partially-observable continuous domains. In particular, it focusses on the problem of mobile robot navigation given a known map. The dominant paradigm for robot localisation is to use Bayesian estimation to maintain a probability distribution over possible robot poses. In contrast, control algorithms often base their decisions on the assumption that a single state, such as the mode of this distribution, is correct. In scenarios involving significant uncertainty, this can lead to serious control errors. It is generally agreed that the reliability of navigation in uncertain environments would be greatly improved by the ability to consider the entire distribution when acting, rather than the single most likely state. The framework adopted in this thesis for modelling navigation problems mathematically is the Partially Observable Markov Decision Process (POMDP). An exact solution to a POMDP problem provides the optimal balance between reward-seeking behaviour and information-seeking behaviour, in the presence of sensor and actuation noise. Unfortunately, previous exact and approximate solution methods have had difficulty scaling to real applications. The contribution of this thesis is the formulation of an approach to planning in the space of continuous parameterised approximations to probability distributions. Theoretical and practical results are presented which show that, when compared with similar methods from the literature, this approach is capable of scaling to larger and more realistic problems. In order to apply the solution algorithm to real-world problems, a number of novel improvements are proposed. Specifically, Monte Carlo methods are employed to estimate distributions over future parameterised beliefs, improving planning accuracy without a loss of efficiency. Conditional independence assumptions are exploited to simplify the problem, reducing computational requirements. Scalability is further increased by focussing computation on likely beliefs, using metric indexing structures for efficient function approximation. Local online planning is incorporated to assist global offline planning, allowing the precision of the latter to be decreased without adversely affecting solution quality. Finally, the algorithm is implemented and demonstrated during real-time control of a mobile robot in a challenging navigation task. We argue that this task is substantially more challenging and realistic than previous problems to which POMDP solution methods have been applied. Results show that POMDP planning, which considers the evolution of the entire probability distribution over robot poses, produces significantly more robust behaviour when compared with a heuristic planner which considers only the most likely states and outcomes
Relational clustering models for knowledge discovery and recommender systems
Cluster analysis is a fundamental research field in Knowledge Discovery and Data Mining
(KDD). It aims at partitioning a given dataset into some homogeneous clusters so as
to reflect the natural hidden data structure. Various heuristic or statistical approaches
have been developed for analyzing propositional datasets. Nevertheless, in relational
clustering the existence of multi-type relationships will greatly degrade the performance
of traditional clustering algorithms. This issue motivates us to find more effective algorithms
to conduct the cluster analysis upon relational datasets. In this thesis we
comprehensively study the idea of Representative Objects for approximating data distribution
and then design a multi-phase clustering framework for analyzing relational
datasets with high effectiveness and efficiency.
The second task considered in this thesis is to provide some better data models for
people as well as machines to browse and navigate a dataset. The hierarchical taxonomy
is widely used for this purpose. Compared with manually created taxonomies, automatically
derived ones are more appealing because of their low creation/maintenance cost
and high scalability. Up to now, the taxonomy generation techniques are mainly used
to organize document corpus. We investigate the possibility of utilizing them upon relational
datasets and then propose some algorithmic improvements. Another non-trivial
problem is how to assign suitable labels for the taxonomic nodes so as to credibly summarize
the content of each node. Unfortunately, this field has not been investigated
sufficiently to the best of our knowledge, and so we attempt to fill the gap by proposing
some novel approaches.
The final goal of our cluster analysis and taxonomy generation techniques is
to improve the scalability of recommender systems that are developed to tackle the
problem of information overload. Recent research in recommender systems integrates
the exploitation of domain knowledge to improve the recommendation quality, which
however reduces the scalability of the whole system at the same time. We address this
issue by applying the automatically derived taxonomy to preserve the pair-wise similarities
between items, and then modeling the user visits by another hierarchical structure.
Experimental results show that the computational complexity of the recommendation
procedure can be greatly reduced and thus the system scalability be improved
Computer Science & Technology Series : XXI Argentine Congress of Computer Science. Selected papers
CACIC’15 was the 21thCongress in the CACIC series. It was organized by the School of Technology at the UNNOBA (North-West of Buenos Aires National University) in Junín, Buenos Aires.
The Congress included 13 Workshops with 131 accepted papers, 4 Conferences, 2 invited tutorials, different meetings related with Computer Science Education (Professors, PhD students, Curricula) and an International School with 6 courses.
CACIC 2015 was organized following the traditional Congress format, with 13 Workshops covering a diversity of dimensions of Computer Science Research. Each topic was supervised by a committee of 3-5 chairs of different Universities.
The call for papers attracted a total of 202 submissions. An average of 2.5 review reports werecollected for each paper, for a grand total of 495 review reports that involved about 191 different reviewers.
A total of 131 full papers, involving 404 authors and 75 Universities, were accepted and 24 of them were selected for this book.Red de Universidades con Carreras en Informática (RedUNCI