937 research outputs found
Parametric t-Distributed Stochastic Exemplar-centered Embedding
Parametric embedding methods such as parametric t-SNE (pt-SNE) have been
widely adopted for data visualization and out-of-sample data embedding without
further computationally expensive optimization or approximation. However, the
performance of pt-SNE is highly sensitive to the hyper-parameter batch size due
to conflicting optimization goals, and often produces dramatically different
embeddings with different choices of user-defined perplexities. To effectively
solve these issues, we present parametric t-distributed stochastic
exemplar-centered embedding methods. Our strategy learns embedding parameters
by comparing given data only with precomputed exemplars, resulting in a cost
function with linear computational and memory complexity, which is further
reduced by noise contrastive samples. Moreover, we propose a shallow embedding
network with high-order feature interactions for data visualization, which is
much easier to tune but produces comparable performance in contrast to a deep
neural network employed by pt-SNE. We empirically demonstrate, using several
benchmark datasets, that our proposed methods significantly outperform pt-SNE
in terms of robustness, visual effects, and quantitative evaluations.Comment: fixed typo
Classifying document types to enhance search and recommendations in digital libraries
In this paper, we address the problem of classifying documents available from
the global network of (open access) repositories according to their type. We
show that the metadata provided by repositories enabling us to distinguish
research papers, thesis and slides are missing in over 60% of cases. While
these metadata describing document types are useful in a variety of scenarios
ranging from research analytics to improving search and recommender (SR)
systems, this problem has not yet been sufficiently addressed in the context of
the repositories infrastructure. We have developed a new approach for
classifying document types using supervised machine learning based exclusively
on text specific features. We achieve 0.96 F1-score using the random forest and
Adaboost classifiers, which are the best performing models on our data. By
analysing the SR system logs of the CORE [1] digital library aggregator, we
show that users are an order of magnitude more likely to click on research
papers and thesis than on slides. This suggests that using document types as a
feature for ranking/filtering SR results in digital libraries has the potential
to improve user experience.Comment: 12 pages, 21st International Conference on Theory and Practise of
Digital Libraries (TPDL), 2017, Thessaloniki, Greec
Disentangling Factors of Variation with Cycle-Consistent Variational Auto-Encoders
Generative models that learn disentangled representations for different
factors of variation in an image can be very useful for targeted data
augmentation. By sampling from the disentangled latent subspace of interest, we
can efficiently generate new data necessary for a particular task. Learning
disentangled representations is a challenging problem, especially when certain
factors of variation are difficult to label. In this paper, we introduce a
novel architecture that disentangles the latent space into two complementary
subspaces by using only weak supervision in form of pairwise similarity labels.
Inspired by the recent success of cycle-consistent adversarial architectures,
we use cycle-consistency in a variational auto-encoder framework. Our
non-adversarial approach is in contrast with the recent works that combine
adversarial training with auto-encoders to disentangle representations. We show
compelling results of disentangled latent subspaces on three datasets and
compare with recent works that leverage adversarial training
Heavy-tailed kernels reveal a finer cluster structure in t-SNE visualisations
T-distributed stochastic neighbour embedding (t-SNE) is a widely used data
visualisation technique. It differs from its predecessor SNE by the
low-dimensional similarity kernel: the Gaussian kernel was replaced by the
heavy-tailed Cauchy kernel, solving the "crowding problem" of SNE. Here, we
develop an efficient implementation of t-SNE for a -distribution kernel with
an arbitrary degree of freedom , with corresponding to SNE
and corresponding to the standard t-SNE. Using theoretical analysis and
toy examples, we show that can further reduce the crowding problem and
reveal finer cluster structure that is invisible in standard t-SNE. We further
demonstrate the striking effect of heavier-tailed kernels on large real-life
data sets such as MNIST, single-cell RNA-sequencing data, and the HathiTrust
library. We use domain knowledge to confirm that the revealed clusters are
meaningful. Overall, we argue that modifying the tail heaviness of the t-SNE
kernel can yield additional insight into the cluster structure of the data
Γ-stochastic neighbour embedding for feed-forward data visualization
t-distributed Stochastic Neighbour Embedding (t-SNE) is one of the most popular nonlinear dimension reduction techniques used in multiple application domains. In this paper we propose a variation on the embedding neighbourhood distribution, resulting in Γ-SNE, which can construct a feed-forward mapping using an RBF network. We compare the visualizations generated by Γ-SNE with those of t-SNE and provide empirical evidence suggesting the network is capable of robust interpolation and automatic weight regularization
New modification version of principal component analysis with kinetic correlation matrix using kinetic energy
Principle Component Analysis (PCA) is a direct, non-parametric method for extracting pertinent information from confusing data sets. It presents a roadmap for how to reduce a complex data set to a lower dimension to disclose the hidden, simplified structures that often underlie it. However, most PCA methods are not able to realize the desired benefits when they handle real world, and nonlinear data. In this work, a modified version of PCA with kinetic correlation matrix using kinetic energy is proposed. The features of this modified PCA have been assessed on different data sets of air passenger numbers. The results show that the modified version of PCA is more effective in data compression, classes reparability and classification accuracy than using traditional PCA
Unsupervised user behavior representation for fraud review detection with cold-start problem
© Springer Nature Switzerland AG 2019. Detecting fraud review is becoming extremely important in order to provide reliable information in cyberspace, in which, however, handling cold-start problem is a critical and urgent challenge since the case of cold-start fraud review rarely provides sufficient information for further assessing its authenticity. Existing work on detecting cold-start cases relies on the limited contents of the review posted by the user and a traditional classifier to make the decision. However, simply modeling review is not reliable since reviews can be easily manipulated. Also, it is hard to obtain high-quality labeled data for training the classifier. In this paper, we tackle cold-start problems by (1) using a user’s behavior representation rather than review contents to measure authenticity, which further (2) consider user social relations with other existing users when posting reviews. The method is completely (3) unsupervised. Comprehensive experiments on Yelp data sets demonstrate our method significantly outperforms the state-of-the-art methods
Localizing the Common Action Among a Few Videos
This paper strives to localize the temporal extent of an action in a long
untrimmed video. Where existing work leverages many examples with their start,
their ending, and/or the class of the action during training time, we propose
few-shot common action localization. The start and end of an action in a long
untrimmed video is determined based on just a hand-full of trimmed video
examples containing the same action, without knowing their common class label.
To address this task, we introduce a new 3D convolutional network architecture
able to align representations from the support videos with the relevant query
video segments. The network contains: (\textit{i}) a mutual enhancement module
to simultaneously complement the representation of the few trimmed support
videos and the untrimmed query video; (\textit{ii}) a progressive alignment
module that iteratively fuses the support videos into the query branch; and
(\textit{iii}) a pairwise matching module to weigh the importance of different
support videos. Evaluation of few-shot common action localization in untrimmed
videos containing a single or multiple action instances demonstrates the
effectiveness and general applicability of our proposal.Comment: ECCV 202
Graph Layouts by t‐SNE
We propose a new graph layout method based on a modification of the t-distributed Stochastic Neighbor Embedding (t-SNE) dimensionality reduction technique. Although t-SNE is one of the best techniques for visualizing high-dimensional data as 2D scatterplots, t-SNE has not been used in the context of classical graph layout. We propose a new graph layout method, tsNET, based on representing a graph with a distance matrix, which together with a modified t-SNE cost function results in desirable layouts. We evaluate our method by a formal comparison with state-of-the-art methods, both visually and via established quality metrics on a comprehensive benchmark, containing real-world and synthetic graphs. As evidenced by the quality metrics and visual inspection, tsNET produces excellent layouts
Can Genetic Programming Do Manifold Learning Too?
Exploratory data analysis is a fundamental aspect of knowledge discovery that
aims to find the main characteristics of a dataset. Dimensionality reduction,
such as manifold learning, is often used to reduce the number of features in a
dataset to a manageable level for human interpretation. Despite this, most
manifold learning techniques do not explain anything about the original
features nor the true characteristics of a dataset. In this paper, we propose a
genetic programming approach to manifold learning called GP-MaL which evolves
functional mappings from a high-dimensional space to a lower dimensional space
through the use of interpretable trees. We show that GP-MaL is competitive with
existing manifold learning algorithms, while producing models that can be
interpreted and re-used on unseen data. A number of promising future directions
of research are found in the process.Comment: 16 pages, accepted in EuroGP '1
- …