5 research outputs found
Premise Selection for Mathematics by Corpus Analysis and Kernel Methods
Smart premise selection is essential when using automated reasoning as a tool
for large-theory formal proof development. A good method for premise selection
in complex mathematical libraries is the application of machine learning to
large corpora of proofs. This work develops learning-based premise selection in
two ways. First, a newly available minimal dependency analysis of existing
high-level formal mathematical proofs is used to build a large knowledge base
of proof dependencies, providing precise data for ATP-based re-verification and
for training premise selection algorithms. Second, a new machine learning
algorithm for premise selection based on kernel methods is proposed and
implemented. To evaluate the impact of both techniques, a benchmark consisting
of 2078 large-theory mathematical problems is constructed,extending the older
MPTP Challenge benchmark. The combined effect of the techniques results in a
50% improvement on the benchmark over the Vampire/SInE state-of-the-art system
for automated reasoning in large theories.Comment: 26 page
Co-Regularized Least-Squares for Label Ranking
Contains fulltext :
83513.pdf (publisher's version ) (Closed access
Kernel-Based Ranking. Methods for Learning and Performance Estimation
Machine learning provides tools for automated construction of predictive
models in data intensive areas of engineering and science. The family of
regularized kernel methods have in the recent years become one of the mainstream
approaches to machine learning, due to a number of advantages the
methods share. The approach provides theoretically well-founded solutions
to the problems of under- and overfitting, allows learning from structured
data, and has been empirically demonstrated to yield high predictive performance
on a wide range of application domains. Historically, the problems
of classification and regression have gained the majority of attention in the
field. In this thesis we focus on another type of learning problem, that of
learning to rank.
In learning to rank, the aim is from a set of past observations to learn
a ranking function that can order new objects according to how well they
match some underlying criterion of goodness. As an important special case
of the setting, we can recover the bipartite ranking problem, corresponding
to maximizing the area under the ROC curve (AUC) in binary classification.
Ranking applications appear in a large variety of settings, examples
encountered in this thesis include document retrieval in web search, recommender
systems, information extraction and automated parsing of natural
language. We consider the pairwise approach to learning to rank, where
ranking models are learned by minimizing the expected probability of ranking
any two randomly drawn test examples incorrectly. The development
of computationally efficient kernel methods, based on this approach, has in
the past proven to be challenging. Moreover, it is not clear what techniques
for estimating the predictive performance of learned models are the most
reliable in the ranking setting, and how the techniques can be implemented
efficiently.
The contributions of this thesis are as follows. First, we develop
RankRLS, a computationally efficient kernel method for learning to rank,
that is based on minimizing a regularized pairwise least-squares loss. In
addition to training methods, we introduce a variety of algorithms for tasks
such as model selection, multi-output learning, and cross-validation, based
on computational shortcuts from matrix algebra. Second, we improve the fastest known training method for the linear version of the RankSVM algorithm,
which is one of the most well established methods for learning to
rank. Third, we study the combination of the empirical kernel map and reduced
set approximation, which allows the large-scale training of kernel machines
using linear solvers, and propose computationally efficient solutions
to cross-validation when using the approach. Next, we explore the problem
of reliable cross-validation when using AUC as a performance criterion,
through an extensive simulation study. We demonstrate that the proposed
leave-pair-out cross-validation approach leads to more reliable performance
estimation than commonly used alternative approaches. Finally, we present
a case study on applying machine learning to information extraction from
biomedical literature, which combines several of the approaches considered
in the thesis. The thesis is divided into two parts. Part I provides the background
for the research work and summarizes the most central results, Part
II consists of the five original research articles that are the main contribution
of this thesis.Siirretty Doriast