1,101 research outputs found
Fractional norms and quasinorms do not help to overcome the curse of dimensionality
The curse of dimensionality causes the well-known and widely discussed
problems for machine learning methods. There is a hypothesis that using of the
Manhattan distance and even fractional quasinorms lp (for p less than 1) can
help to overcome the curse of dimensionality in classification problems. In
this study, we systematically test this hypothesis. We confirm that fractional
quasinorms have a greater relative contrast or coefficient of variation than
the Euclidean norm l2, but we also demonstrate that the distance concentration
shows qualitatively the same behaviour for all tested norms and quasinorms and
the difference between them decays as dimension tends to infinity. Estimation
of classification quality for kNN based on different norms and quasinorms shows
that a greater relative contrast does not mean better classifier performance
and the worst performance for different databases was shown by different norms
(quasinorms). A systematic comparison shows that the difference of the
performance of kNN based on lp for p=2, 1, and 0.5 is statistically
insignificant
Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search
Retrieval pipelines commonly rely on a term-based search to obtain candidate
records, which are subsequently re-ranked. Some candidates are missed by this
approach, e.g., due to a vocabulary mismatch. We address this issue by
replacing the term-based search with a generic k-NN retrieval algorithm, where
a similarity function can take into account subtle term associations. While an
exact brute-force k-NN search using this similarity function is slow, we
demonstrate that an approximate algorithm can be nearly two orders of magnitude
faster at the expense of only a small loss in accuracy. A retrieval pipeline
using an approximate k-NN search can be more effective and efficient than the
term-based pipeline. This opens up new possibilities for designing effective
retrieval pipelines. Our software (including data-generating code) and
derivative data based on the Stack Overflow collection is available online
Dimensionality's blessing: Clustering images by underlying distribution
Many high dimensional vector distances tend to a constant. This is typically
considered a negative "contrast-loss" phenomenon that hinders clustering and
other machine learning techniques. We reinterpret "contrast-loss" as a
blessing. Re-deriving "contrast-loss" using the law of large numbers, we show
it results in a distribution's instances concentrating on a thin "hyper-shell".
The hollow center means apparently chaotically overlapping distributions are
actually intrinsically separable. We use this to develop
distribution-clustering, an elegant algorithm for grouping of data points by
their (unknown) underlying distribution. Distribution-clustering, creates
notably clean clusters from raw unlabeled data, estimates the number of
clusters for itself and is inherently robust to "outliers" which form their own
clusters. This enables trawling for patterns in unorganized data and may be the
key to enabling machine intelligence.Comment: Accepted in CVPR 201
Why and When Can Deep -- but Not Shallow -- Networks Avoid the Curse of Dimensionality: a Review
The paper characterizes classes of functions for which deep learning can be
exponentially better than shallow learning. Deep convolutional networks are a
special case of these conditions, though weight sharing is not the main reason
for their exponential advantage
LDA-Based Industry Classification
Industry classification is a crucial step for financial analysis. However, existing industry classification schemes have several limitations. In order to overcome these limitations, in this paper, we propose an industry classification methodology on the basis of business commonalities using the topic features learned by the Latent Dirichlet Allocation (LDA) from firms’ business descriptions. Two types of classification – firm-centric classification and industry-centric classification were explored. Preliminary evaluation results showed the effectiveness of our method
Detecting the ultra low dimensionality of real networks
Reducing dimension redundancy to find simplifying patterns in high dimensional datasets and complex networks has become a major endeavor
in many scientific fields. However, detecting the dimensionality of their latent
space is challenging but necessary to generate efficient embeddings to be used
in a multitude of downstream tasks. Here, we propose a method to infer the
dimensionality of networks without the need for any a priori spatial embed ding. Due to the ability of hyperbolic geometry to capture the complex con nectivity of real networks, we detect ultra low dimensionality far below values
reported using other approaches. We applied our method to real networks
from different domains and found unexpected regularities, including: tissue specific biomolecular networks being extremely low dimensional; brain con nectomes being close to the three dimensions of their anatomical embedding;
and social networks and the Internet requiring slightly higher dimensionality.
Beyond paving the way towards an ultra efficient dimensional reduction, our
findings help address fundamental issues that hinge on dimensionality, such as
universality in critical behavior.Agencia Estatal de Investigación PID2019-106290GB-C22/AEI/10.13039/501100011033Generalitat de Catalunya 2017SGR106
- …