Search CORE

4,249 research outputs found

Hybrid Cluster based Collaborative Filtering using Firefly and Agglomerative Hierarchical Clustering

Author: G. Spoorthy
Sriram G Sanjeevi
Publication venue: 'Asian Online Journals'
Publication date: 31/12/2021
Field of study

Recommendation Systems finds the user preferences based on the purchase history of an individual using data mining and machine learning techniques. To reduce the time taken for computation Recommendation systems generally use a pre-processing technique which in turn helps to increase high low performance and over comes over-fitting of data. In this paper, we propose a hybrid collaborative filtering algorithm using firefly and agglomerative hierarchical clustering technique with priority queue and Principle Component Analysis (PCA). We applied our hybrid algorithm on movielens dataset and used Pearson Correlation to obtain Top N recommendations. Experimental results show that the our algorithm delivers accurate and reliable recommendations showing high performance when compared with existing algorithms

International Journal of Computer and Information Technology

Fast Approximate $K$ -Means via Cluster Closures

Author: Ke Qifa
Li Shipeng
Wang Jing
Wang Jingdong
Zeng Gang
Publication venue
Publication date: 01/01/2012
Field of study

K

-means, a simple and effective clustering algorithm, is one of the most widely used algorithms in multimedia and computer vision community. Traditional

k

-means is an iterative algorithm---in each iteration new cluster centers are computed and each data point is re-assigned to its nearest center. The cluster re-assignment step becomes prohibitively expensive when the number of data points and cluster centers are large. In this paper, we propose a novel approximate

k

-means algorithm to greatly reduce the computational complexity in the assignment step. Our approach is motivated by the observation that most active points changing their cluster assignments at each iteration are located on or near cluster boundaries. The idea is to efficiently identify those active points by pre-assembling the data into groups of neighboring points using multiple random spatial partition trees, and to use the neighborhood information to construct a closure for each cluster, in such a way only a small number of cluster candidates need to be considered when assigning a data point to its nearest cluster. Using complexity analysis, image data clustering, and applications to image retrieval, we show that our approach out-performs state-of-the-art approximate

k

-means algorithms in terms of clustering quality and efficiency

arXiv.org e-Print Archive

Crossref

Knowledge Base Population using Semantic Label Propagation

Author: Deleu Johannes
Demeester Thomas
Develder Chris
Sterckx Lucas
Publication venue
Publication date: 01/01/2016
Field of study

A crucial aspect of a knowledge base population system that extracts new facts from text corpora, is the generation of training data for its relation extractors. In this paper, we present a method that maximizes the effectiveness of newly trained relation extractors at a minimal annotation cost. Manual labeling can be significantly reduced by Distant Supervision, which is a method to construct training data automatically by aligning a large text corpus with an existing knowledge base of known facts. For example, all sentences mentioning both 'Barack Obama' and 'US' may serve as positive training instances for the relation born_in(subject,object). However, distant supervision typically results in a highly noisy training set: many training sentences do not really express the intended relation. We propose to combine distant supervision with minimal manual supervision in a technique called feature labeling, to eliminate noise from the large and noisy initial training set, resulting in a significant increase of precision. We further improve on this approach by introducing the Semantic Label Propagation method, which uses the similarity between low-dimensional representations of candidate training instances, to extend the training set in order to increase recall while maintaining high precision. Our proposed strategy for generating training data is studied and evaluated on an established test collection designed for knowledge base population tasks. The experimental results show that the Semantic Label Propagation strategy leads to substantial performance gains when compared to existing approaches, while requiring an almost negligible manual annotation effort.Comment: Submitted to Knowledge Based Systems, special issue on Knowledge Bases for Natural Language Processin

arXiv.org e-Print Archive

Ghent University Academic Bibliography

Survey of Vector Database Management Systems

Author: Li Guoliang
Pan James Jie
Wang Jianguo
Publication venue
Publication date: 21/10/2023
Field of study

There are now over 20 commercial vector database management systems (VDBMSs), all produced within the past five years. But embedding-based retrieval has been studied for over ten years, and similarity search a staggering half century and more. Driving this shift from algorithms to systems are new data intensive applications, notably large language models, that demand vast stores of unstructured data coupled with reliable, secure, fast, and scalable query processing capability. A variety of new data management techniques now exist for addressing these needs, however there is no comprehensive survey to thoroughly review these techniques and systems. We start by identifying five main obstacles to vector data management, namely vagueness of semantic similarity, large size of vectors, high cost of similarity comparison, lack of natural partitioning that can be used for indexing, and difficulty of efficiently answering hybrid queries that require both attributes and vectors. Overcoming these obstacles has led to new approaches to query processing, storage and indexing, and query optimization and execution. For query processing, a variety of similarity scores and query types are now well understood; for storage and indexing, techniques include vector compression, namely quantization, and partitioning based on randomization, learning partitioning, and navigable partitioning; for query optimization and execution, we describe new operators for hybrid queries, as well as techniques for plan enumeration, plan selection, and hardware accelerated execution. These techniques lead to a variety of VDBMSs across a spectrum of design and runtime characteristics, including native systems specialized for vectors and extended systems that incorporate vector capabilities into existing systems. We then discuss benchmarks, and finally we outline research challenges and point the direction for future work.Comment: 25 page

arXiv.org e-Print Archive