17,622 research outputs found
A Modified Overlapping Partitioning Clustering Algorithm for Categorical Data Clustering
Clustering is one of the important approaches for Clustering enables the grouping of unlabeled data by partitioning data into clusters with similar patterns. Over the past decades, many clustering algorithms have been developed for various clustering problems. An overlapping partitioning clustering (OPC) algorithm can only handle numerical data. Hence, novel clustering algorithms have been studied extensively to overcome this issue. By increasing the number of objects belonging to one cluster and distance between cluster centers, the study aimed to cluster the textual data type without losing the main functions. The proposed study herein included over twenty newsgroup dataset, which consisted of approximately 20000 textual documents. By introducing some modifications to the traditional algorithm, an acceptable level of homogeneity and completeness of clusters were generated. Modifications were performed on the pre-processing phase and data representation, along with the number methods which influence the primary function of the algorithm. Subsequently, the results were evaluated and compared with the k-means algorithm of the training and test datasets. The results indicated that the modified algorithm could successfully handle the categorical data and produce satisfactory clusters
Clustering Memes in Social Media
The increasing pervasiveness of social media creates new opportunities to
study human social behavior, while challenging our capability to analyze their
massive data streams. One of the emerging tasks is to distinguish between
different kinds of activities, for example engineered misinformation campaigns
versus spontaneous communication. Such detection problems require a formal
definition of meme, or unit of information that can spread from person to
person through the social network. Once a meme is identified, supervised
learning methods can be applied to classify different types of communication.
The appropriate granularity of a meme, however, is hardly captured from
existing entities such as tags and keywords. Here we present a framework for
the novel task of detecting memes by clustering messages from large streams of
social data. We evaluate various similarity measures that leverage content,
metadata, network features, and their combinations. We also explore the idea of
pre-clustering on the basis of existing entities. A systematic evaluation is
carried out using a manually curated dataset as ground truth. Our analysis
shows that pre-clustering and a combination of heterogeneous features yield the
best trade-off between number of clusters and their quality, demonstrating that
a simple combination based on pairwise maximization of similarity is as
effective as a non-trivial optimization of parameters. Our approach is fully
automatic, unsupervised, and scalable for real-time detection of memes in
streaming data.Comment: Proceedings of the 2013 IEEE/ACM International Conference on Advances
in Social Networks Analysis and Mining (ASONAM'13), 201
A Deep Siamese Network for Scene Detection in Broadcast Videos
We present a model that automatically divides broadcast videos into coherent
scenes by learning a distance measure between shots. Experiments are performed
to demonstrate the effectiveness of our approach by comparing our algorithm
against recent proposals for automatic scene segmentation. We also propose an
improved performance measure that aims to reduce the gap between numerical
evaluation and expected results, and propose and release a new benchmark
dataset.Comment: ACM Multimedia 201
Text Line Segmentation of Historical Documents: a Survey
There is a huge amount of historical documents in libraries and in various
National Archives that have not been exploited electronically. Although
automatic reading of complete pages remains, in most cases, a long-term
objective, tasks such as word spotting, text/image alignment, authentication
and extraction of specific fields are in use today. For all these tasks, a
major step is document segmentation into text lines. Because of the low quality
and the complexity of these documents (background noise, artifacts due to
aging, interfering lines),automatic text line segmentation remains an open
research field. The objective of this paper is to present a survey of existing
methods, developed during the last decade, and dedicated to documents of
historical interest.Comment: 25 pages, submitted version, To appear in International Journal on
Document Analysis and Recognition, On line version available at
http://www.springerlink.com/content/k2813176280456k3
Scalable and interpretable product recommendations via overlapping co-clustering
We consider the problem of generating interpretable recommendations by
identifying overlapping co-clusters of clients and products, based only on
positive or implicit feedback. Our approach is applicable on very large
datasets because it exhibits almost linear complexity in the input examples and
the number of co-clusters. We show, both on real industrial data and on
publicly available datasets, that the recommendation accuracy of our algorithm
is competitive to that of state-of-art matrix factorization techniques. In
addition, our technique has the advantage of offering recommendations that are
textually and visually interpretable. Finally, we examine how to implement our
technique efficiently on Graphical Processing Units (GPUs).Comment: In IEEE International Conference on Data Engineering (ICDE) 201
- …