9 research outputs found
On construction, performance, and diversification for structured queries on the semantic desktop
[no abstract
A novel ensemble learning approach to unsupervised record linkage
© 2017 Record linkage is a process of identifying records that refer to the same real-world entity. Many existing approaches to record linkage apply supervised machine learning techniques to generate a classification model that classifies a pair of records as either match or non-match. The main requirement of such an approach is a labelled training dataset. In many real-world applications no labelled dataset is available hence manual labelling is required to create a sufficiently sized training dataset for a supervised machine learning algorithm. Semi-supervised machine learning techniques, such as self-learning or active learning, which require only a small manually labelled training dataset have been applied to record linkage. These techniques reduce the requirement on the manual labelling of the training dataset. However, they have yet to achieve a level of accuracy similar to that of supervised learning techniques. In this paper we propose a new approach to unsupervised record linkage based on a combination of ensemble learning and enhanced automatic self-learning. In the proposed approach an ensemble of automatic self-learning models is generated with different similarity measure schemes. In order to further improve the automatic self-learning process we incorporate field weighting into the automatic seed selection for each of the self-learning models. We propose an unsupervised diversity measure to ensure that there is high diversity among the selected self-learning models. Finally, we propose to use the contribution ratios of self-learning models to remove those with poor accuracy from the ensemble. We have evaluated our approach on 4 publicly available datasets which are commonly used in the record linkage community. Our experimental results show that our proposed approach has advantages over the state-of-the-art semi-supervised and unsupervised record linkage techniques. In 3 out of 4 datasets it also achieves comparable results to those of the supervised approaches
Techniques for Constructing Efficient Lock-free Data Structures
Building a library of concurrent data structures is an essential way to
simplify the difficult task of developing concurrent software. Lock-free data
structures, in which processes can help one another to complete operations,
offer the following progress guarantee: If processes take infinitely many
steps, then infinitely many operations are performed. Handcrafted lock-free
data structures can be very efficient, but are notoriously difficult to
implement. We introduce numerous tools that support the development of
efficient lock-free data structures, and especially trees.Comment: PhD thesis, Univ Toronto (2017
Fast machine translation on parallel and massively parallel hardware
Parallel systems have been widely adopted in the field of machine translation, because
the raw computational power they offer is well suited to this computationally intensive
task. However programming for parallel hardware is not trivial as it requires redesign
of the existing algorithms. In my thesis I design efficient algorithms for machine translation
on parallel hardware. I identify memory accesses as the biggest bottleneck to
processing speed and propose novel algorithms that minimize them. I present three distinct
case studies in which minimizing memory access substantially improves speed:
Starting with statistical machine translation, I design a phrase table that makes decoding
ten times faster on a multi-threaded CPU. Next, I design a GPU-based n-gram
language model that is twice as fast per £ as a highly optimized CPU implementation.
Turning to neural machine translation, I design new stochastic gradient descent techniques
that make end-to-end training twice as fast. The work in this thesis has been
incorporated in two popular machine translation toolkits: Moses and Marian
Novel storage architectures and pointer-free search trees for database systems
Database systems research is an old and well-established field in computer science.
Many of the key concepts appeared as early as the 60s, while the core of relational
databases, which have dominated the database world for a while now, was solidified
during the 80s. However, the underlying hardware has not displayed such stability
in the same period, which means that a lot of assumptions that were made about the
hardware by early database systems are not necessarily true for modern computer
architectures.
In particular, over the last few decades there have been two notable consistent
trends in the evolution of computer hardware. The first is that the memory hierarchy
of mainstream computer systems has been getting deeper, with its different levels
moving away from each other, and new levels being added in between as a result,
in particular cache memories. The second is that, when it comes to data transfers
between any two adjacent levels of the memory hierarchy, access latencies have not
been keeping up with transfer rates. The challenge is therefore to adapt database index
structures so that they become immune to these two trends.
The latter is addressed by gradually increasing the size of the data transfer unit; the
former, by organizing the data so that it exhibits good locality for memory transfers
across multiple memory boundaries.We have developed novel structures that facilitate
both of these strategies. We started our investigation with the venerable B+-tree,
which is the cornerstone order-preserving index of any database system, and we have
developed a novel pointer-free tree structure for its pages that optimizes its cache
performance and makes it immune to the page size. We then adapted our approach to
the R-tree and the GiST, making it applicable to multi-dimensional data indexes as
well as generalized indexes for any abstract data type. Finally, we have investigated our
structure in the context of main memory alone, and have demonstrated its superiority
over the established approaches in that setting too.
While our research has its roots in data structures and algorithms theory, we have
conducted it with a strong experimental focus, as the complex interactions within the
memory hierarchy of a modern computer system can be quite challenging to model
and theorize about effectively. Our findings are therefore backed by solid experimental
results that verify our hypotheses and prove the superiority of our structures over
competing approaches