118 research outputs found
Adaptive Preferential Attached kNN Graph With Distribution-Awareness
Graph-based kNN algorithms have garnered widespread popularity for machine
learning tasks, due to their simplicity and effectiveness. However, the
conventional kNN graph's reliance on a fixed value of k can hinder its
performance, especially in scenarios involving complex data distributions.
Moreover, like other classification models, the presence of ambiguous samples
along decision boundaries often presents a challenge, as they are more prone to
incorrect classification. To address these issues, we propose the Preferential
Attached k-Nearest Neighbors Graph (paNNG), which combines adaptive kNN with
distribution-based graph construction. By incorporating distribution
information, paNNG can significantly improve performance for ambiguous samples
by "pulling" them towards their original classes and hence enable enhanced
overall accuracy and generalization capability. Through rigorous evaluations on
diverse benchmark datasets, paNNG outperforms state-of-the-art algorithms,
showcasing its adaptability and efficacy across various real-world scenarios
One Size Cannot Fit All: a Self-Adaptive Dispatcher for Skewed Hash Join in Shared-nothing RDBMSs
Shared-nothing architecture has been widely adopted in various commercial
distributed RDBMSs. Thanks to the architecture, query can be processed in
parallel and accelerated by scaling up the cluster horizontally on demand. In
spite of that, load balancing has been a challenging issue in all distributed
RDBMSs, including shared-nothing ones, which suffers much from skewed data
distribution. In this work, we focus on one of the representative operator,
namely Hash Join, and investigate how skewness among the nodes of a cluster
will affect the load balance and eventual efficiency of an arbitrary query in
shared-nothing RDBMSs. We found that existing Distributed Hash Join (Dist-HJ)
solutions may not provide satisfactory performance when a value is skewed in
both the probe and build tables. To address that, we propose a novel Dist-HJ
solution, namely Partition and Replication (PnR). Although PnR provide the best
efficiency in some skewness scenario, our exhaustive experiments over a group
of shared-nothing RDBMSs show that there is not a single Dist-HJ solution that
wins in all (data skew) scenarios. To this end, we further propose a
self-adaptive Dist-HJ solution with a builtin sub-operator cost model that
dynamically select the best Dist-HJ implementation strategy at runtime
according to the data skew of the target query. We implement the solution in
our commercial shared-nothing RDBMSs, namely KaiwuDB (former name ZNBase) and
empirical study justifies that the self-adaptive model achieves the best
performance comparing to a series of solution adopted in many existing RDBMSs
- …