6,788 research outputs found
Automatically Discovering the Number of Clusters in Web Page Datasets
Clustering is well-suited for Web mining by automatically organizing Web pages into categories, each of which contains Web pages having similar contents. However, one problem in clustering is the lack of general methods to automatically determine the number of categories or clusters. For the Web domain in particular, currently there is no such method suitable for Web page clustering. In an attempt to address this problem, we discover a constant factor that characterizes the Web domain, based on which we propose a new method for automatically determining the number of clusters in Web page data sets. We discover that the measure of average inter-cluster similarity reaches a constant of 1.7 when all our experiments produced the best results for clustering Web pages. We determine the number of clusters by using the constant as the stopping factor in our clustering process by arranging individual Web pages into clusters and then arranging the clusters into larger clusters and so on until the average inter-cluster similarity approaches the constant. Having the new method described in this paper together with our new Bidirectional Hierarchical Clustering algorithm reported elsewhere, we have developed a clustering system suitable for mining the Web
Automatic Concept Discovery from Parallel Text and Visual Corpora
Humans connect language and vision to perceive the world. How to build a
similar connection for computers? One possible way is via visual concepts,
which are text terms that relate to visually discriminative entities. We
propose an automatic visual concept discovery algorithm using parallel text and
visual corpora; it filters text terms based on the visual discriminative power
of the associated images, and groups them into concepts using visual and
semantic similarities. We illustrate the applications of the discovered
concepts using bidirectional image and sentence retrieval task and image
tagging task, and show that the discovered concepts not only outperform several
large sets of manually selected concepts significantly, but also achieves the
state-of-the-art performance in the retrieval task.Comment: To appear in ICCV 201
Harnessing Deep Learning Techniques for Text Clustering and Document Categorization
This research paper delves into the realm of deep text clustering algorithms with the aim of enhancing the accuracy of document classification. In recent years, the fusion of deep learning techniques and text clustering has shown promise in extracting meaningful patterns and representations from textual data. This paper provides an in-depth exploration of various deep text clustering methodologies, assessing their efficacy in improving document classification accuracy. Delving into the core of deep text clustering, the paper investigates various feature representation techniques, ranging from conventional word embeddings to contextual embeddings furnished by BERT and GPT models.By critically reviewing and comparing these algorithms, we shed light on their strengths, limitations, and potential applications. Through this comprehensive study, we offer insights into the evolving landscape of document analysis and classification, driven by the power of deep text clustering algorithms.Through an original synthesis of existing literature, this research serves as a beacon for researchers and practitioners in harnessing the prowess of deep learning to enhance the accuracy of document classification endeavors
Web User Session Characterization via Clustering Techniques
We focus on the identification and definition of "Web user-sessions", an aggregation of several TCP connections generated by the same source host on the basis of TCP connection opening time. The identification of a user session is non trivial; traditional approaches rely on threshold based mechanisms, which are very sensitive to the value assumed for the threshold and may be difficult to correctly set. By applying clustering techniques, we define a novel methodology to identify Web user-sessions without requiring an a priori definition of threshold values. We analyze the characteristics of user sessions extracted from real traces, studying the statistical properties of the identified sessions. From the study it emerges that Web user-sessions tend to be Poisson, but correlation may arise during periods of network/hosts anomalous functioning
Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks
Networks are a general language for representing relational information among
objects. An effective way to model, reason about, and summarize networks, is to
discover sets of nodes with common connectivity patterns. Such sets are
commonly referred to as network communities. Research on network community
detection has predominantly focused on identifying communities of densely
connected nodes in undirected networks.
In this paper we develop a novel overlapping community detection method that
scales to networks of millions of nodes and edges and advances research along
two dimensions: the connectivity structure of communities, and the use of edge
directedness for community detection. First, we extend traditional definitions
of network communities by building on the observation that nodes can be densely
interlinked in two different ways: In cohesive communities nodes link to each
other, while in 2-mode communities nodes link in a bipartite fashion, where
links predominate between the two partitions rather than inside them. Our
method successfully detects both 2-mode as well as cohesive communities, that
may also overlap or be hierarchically nested. Second, while most existing
community detection methods treat directed edges as though they were
undirected, our method accounts for edge directions and is able to identify
novel and meaningful community structures in both directed and undirected
networks, using data from social, biological, and ecological domains.Comment: Published in the proceedings of WSDM '1
- …