4,249 research outputs found
Hybrid Cluster based Collaborative Filtering using Firefly and Agglomerative Hierarchical Clustering
Recommendation Systems finds the user preferences based on the purchase history of an individual using data mining and machine learning techniques. To reduce the time taken for computation Recommendation systems generally use a pre-processing technique which in turn helps to increase high low performance and over comes over-fitting of data. In this paper, we propose a hybrid collaborative filtering algorithm using firefly and agglomerative hierarchical clustering technique with priority queue and Principle Component Analysis (PCA). We applied our hybrid algorithm on movielens dataset and used Pearson Correlation to obtain Top N recommendations. Experimental results show that the our algorithm delivers accurate and reliable recommendations showing high performance when compared with existing algorithms
Fast Approximate -Means via Cluster Closures
-means, a simple and effective clustering algorithm, is one of the most
widely used algorithms in multimedia and computer vision community. Traditional
-means is an iterative algorithm---in each iteration new cluster centers are
computed and each data point is re-assigned to its nearest center. The cluster
re-assignment step becomes prohibitively expensive when the number of data
points and cluster centers are large.
In this paper, we propose a novel approximate -means algorithm to greatly
reduce the computational complexity in the assignment step. Our approach is
motivated by the observation that most active points changing their cluster
assignments at each iteration are located on or near cluster boundaries. The
idea is to efficiently identify those active points by pre-assembling the data
into groups of neighboring points using multiple random spatial partition
trees, and to use the neighborhood information to construct a closure for each
cluster, in such a way only a small number of cluster candidates need to be
considered when assigning a data point to its nearest cluster. Using complexity
analysis, image data clustering, and applications to image retrieval, we show
that our approach out-performs state-of-the-art approximate -means
algorithms in terms of clustering quality and efficiency
Knowledge Base Population using Semantic Label Propagation
A crucial aspect of a knowledge base population system that extracts new
facts from text corpora, is the generation of training data for its relation
extractors. In this paper, we present a method that maximizes the effectiveness
of newly trained relation extractors at a minimal annotation cost. Manual
labeling can be significantly reduced by Distant Supervision, which is a method
to construct training data automatically by aligning a large text corpus with
an existing knowledge base of known facts. For example, all sentences
mentioning both 'Barack Obama' and 'US' may serve as positive training
instances for the relation born_in(subject,object). However, distant
supervision typically results in a highly noisy training set: many training
sentences do not really express the intended relation. We propose to combine
distant supervision with minimal manual supervision in a technique called
feature labeling, to eliminate noise from the large and noisy initial training
set, resulting in a significant increase of precision. We further improve on
this approach by introducing the Semantic Label Propagation method, which uses
the similarity between low-dimensional representations of candidate training
instances, to extend the training set in order to increase recall while
maintaining high precision. Our proposed strategy for generating training data
is studied and evaluated on an established test collection designed for
knowledge base population tasks. The experimental results show that the
Semantic Label Propagation strategy leads to substantial performance gains when
compared to existing approaches, while requiring an almost negligible manual
annotation effort.Comment: Submitted to Knowledge Based Systems, special issue on Knowledge
Bases for Natural Language Processin
Survey of Vector Database Management Systems
There are now over 20 commercial vector database management systems (VDBMSs),
all produced within the past five years. But embedding-based retrieval has been
studied for over ten years, and similarity search a staggering half century and
more. Driving this shift from algorithms to systems are new data intensive
applications, notably large language models, that demand vast stores of
unstructured data coupled with reliable, secure, fast, and scalable query
processing capability. A variety of new data management techniques now exist
for addressing these needs, however there is no comprehensive survey to
thoroughly review these techniques and systems. We start by identifying five
main obstacles to vector data management, namely vagueness of semantic
similarity, large size of vectors, high cost of similarity comparison, lack of
natural partitioning that can be used for indexing, and difficulty of
efficiently answering hybrid queries that require both attributes and vectors.
Overcoming these obstacles has led to new approaches to query processing,
storage and indexing, and query optimization and execution. For query
processing, a variety of similarity scores and query types are now well
understood; for storage and indexing, techniques include vector compression,
namely quantization, and partitioning based on randomization, learning
partitioning, and navigable partitioning; for query optimization and execution,
we describe new operators for hybrid queries, as well as techniques for plan
enumeration, plan selection, and hardware accelerated execution. These
techniques lead to a variety of VDBMSs across a spectrum of design and runtime
characteristics, including native systems specialized for vectors and extended
systems that incorporate vector capabilities into existing systems. We then
discuss benchmarks, and finally we outline research challenges and point the
direction for future work.Comment: 25 page
- …