29,279 research outputs found
Unsupervised Representation Learning with Minimax Distance Measures
We investigate the use of Minimax distances to extract in a nonparametric way
the features that capture the unknown underlying patterns and structures in the
data. We develop a general-purpose and computationally efficient framework to
employ Minimax distances with many machine learning methods that perform on
numerical data. We study both computing the pairwise Minimax distances for all
pairs of objects and as well as computing the Minimax distances of all the
objects to/from a fixed (test) object.
We first efficiently compute the pairwise Minimax distances between the
objects, using the equivalence of Minimax distances over a graph and over a
minimum spanning tree constructed on that. Then, we perform an embedding of the
pairwise Minimax distances into a new vector space, such that their squared
Euclidean distances in the new space equal to the pairwise Minimax distances in
the original space. We also study the case of having multiple pairwise Minimax
matrices, instead of a single one. Thereby, we propose an embedding via first
summing up the centered matrices and then performing an eigenvalue
decomposition to obtain the relevant features.
In the following, we study computing Minimax distances from a fixed (test)
object which can be used for instance in K-nearest neighbor search. Similar to
the case of all-pair pairwise Minimax distances, we develop an efficient and
general-purpose algorithm that is applicable with any arbitrary base distance
measure. Moreover, we investigate in detail the edges selected by the Minimax
distances and thereby explore the ability of Minimax distances in detecting
outlier objects.
Finally, for each setting, we perform several experiments to demonstrate the
effectiveness of our framework.Comment: 32 page
LambdaFM: Learning Optimal Ranking with Factorization Machines Using Lambda Surrogates
State-of-the-art item recommendation algorithms, which apply
Factorization Machines (FM) as a scoring function and
pairwise ranking loss as a trainer (PRFM for short), have
been recently investigated for the implicit feedback based
context-aware recommendation problem (IFCAR). However,
good recommenders particularly emphasize on the accuracy
near the top of the ranked list, and typical pairwise loss functions
might not match well with such a requirement. In this
paper, we demonstrate, both theoretically and empirically,
PRFM models usually lead to non-optimal item recommendation
results due to such a mismatch. Inspired by the success
of LambdaRank, we introduce Lambda Factorization
Machines (LambdaFM), which is particularly intended for
optimizing ranking performance for IFCAR. We also point
out that the original lambda function suffers from the issue
of expensive computational complexity in such settings due
to a large amount of unobserved feedback. Hence, instead
of directly adopting the original lambda strategy, we create
three effective lambda surrogates by conducting a theoretical
analysis for lambda from the top-N optimization perspective.
Further, we prove that the proposed lambda surrogates
are generic and applicable to a large set of pairwise
ranking loss functions. Experimental results demonstrate
LambdaFM significantly outperforms state-of-the-art algorithms
on three real-world datasets in terms of four standard
ranking measures
Second-generation PLINK: rising to the challenge of larger and richer datasets
PLINK 1 is a widely used open-source C/C++ toolset for genome-wide
association studies (GWAS) and research in population genetics. However, the
steady accumulation of data from imputation and whole-genome sequencing studies
has exposed a strong need for even faster and more scalable implementations of
key functions. In addition, GWAS and population-genetic data now frequently
contain probabilistic calls, phase information, and/or multiallelic variants,
none of which can be represented by PLINK 1's primary data format.
To address these issues, we are developing a second-generation codebase for
PLINK. The first major release from this codebase, PLINK 1.9, introduces
extensive use of bit-level parallelism, O(sqrt(n))-time/constant-space
Hardy-Weinberg equilibrium and Fisher's exact tests, and many other algorithmic
improvements. In combination, these changes accelerate most operations by 1-4
orders of magnitude, and allow the program to handle datasets too large to fit
in RAM. This will be followed by PLINK 2.0, which will introduce (a) a new data
format capable of efficiently representing probabilities, phase, and
multiallelic variants, and (b) extensions of many functions to account for the
new types of information.
The second-generation versions of PLINK will offer dramatic improvements in
performance and compatibility. For the first time, users without access to
high-end computing resources can perform several essential analyses of the
feature-rich and very large genetic datasets coming into use.Comment: 2 figures, 1 additional fil
- …