152 research outputs found
A Review and Taxonomy of Methods for Quantifying Dataset Similarity
In statistics and machine learning, measuring the similarity between two or
more datasets is important for several purposes. The performance of a
predictive model on novel datasets, referred to as generalizability, critically
depends on how similar the dataset used for fitting the model is to the novel
datasets. Exploiting or transferring insights between similar datasets is a key
aspect of meta-learning and transfer-learning. In two-sample testing, it is
checked, whether the underlying (multivariate) distributions of two datasets
coincide or not.
Extremely many approaches for quantifying dataset similarity have been
proposed in the literature. A structured overview is a crucial first step for
comparisons of approaches. We examine more than 100 methods and provide a
taxonomy, classifying them into ten classes, including (i) comparisons of
cumulative distribution functions, density functions, or characteristic
functions, (ii) methods based on multivariate ranks, (iii) discrepancy measures
for distributions, (iv) graph-based methods, (v) methods based on inter-point
distances, (vi) kernel-based methods, (vii) methods based on binary
classification, (viii) distance and similarity measures for datasets, (ix)
comparisons based on summary statistics, and (x) different testing approaches.
Here, we present an extensive review of these methods. We introduce the main
underlying ideas, formal definitions, and important properties.Comment: 90 pages, submitted to Statistics Survey
Deliverable D7.3 LinkedTV Dissemination and Standardisation Report v1
This deliverable presents the LinkedTV dissemination and standardisation report for the first 18 months of the project
Deliverable D1.4 Visual, text and audio information analysis for hypervideo, final release
Having extensively evaluated the performance of the technologies included in the first release of WP1 multimedia analysis tools, using content from the LinkedTV scenarios and by participating in international benchmarking activities, concrete decisions regarding the appropriateness and the importance of each individual method or combination of methods were made, which, combined with an updated list of information needs for each scenario, led to a new set of analysis requirements that had to be addressed through the release of the final set of analysis techniques of WP1. To this end, coordinated efforts on three directions, including (a) the improvement of a number of methods in terms of accuracy and time efficiency, (b) the development of new technologies and (c) the definition of synergies between methods for obtaining new types of information via multimodal processing, resulted in the final bunch of multimedia analysis methods for video hyperlinking. Moreover, the different developed analysis modules have been integrated into a web-based infrastructure, allowing the fully automatic linking of the multitude of WP1 technologies and the overall LinkedTV platform
Representation Learning: A Review and New Perspectives
The success of machine learning algorithms generally depends on data
representation, and we hypothesize that this is because different
representations can entangle and hide more or less the different explanatory
factors of variation behind the data. Although specific domain knowledge can be
used to help design representations, learning with generic priors can also be
used, and the quest for AI is motivating the design of more powerful
representation-learning algorithms implementing such priors. This paper reviews
recent work in the area of unsupervised feature learning and deep learning,
covering advances in probabilistic models, auto-encoders, manifold learning,
and deep networks. This motivates longer-term unanswered questions about the
appropriate objectives for learning good representations, for computing
representations (i.e., inference), and the geometrical connections between
representation learning, density estimation and manifold learning
Divergence Measures
Data science, information theory, probability theory, statistical learning and other related disciplines greatly benefit from non-negative measures of dissimilarity between pairs of probability measures. These are known as divergence measures, and exploring their mathematical foundations and diverse applications is of significant interest. The present Special Issue, entitled “Divergence Measures: Mathematical Foundations and Applications in Information-Theoretic and Statistical Problems”, includes eight original contributions, and it is focused on the study of the mathematical properties and applications of classical and generalized divergence measures from an information-theoretic perspective. It mainly deals with two key generalizations of the relative entropy: namely, the R_ényi divergence and the important class of f -divergences. It is our hope that the readers will find interest in this Special Issue, which will stimulate further research in the study of the mathematical foundations and applications of divergence measures
- …