27 research outputs found
Survey of Document Clustering Approach for Real World Objects (Documents)
Since the amount of text data stored in computer repositories is growing every day, we need more than ever a reliable way to assemble or classify text documents. Clustering can provide a means of introducing some form of organization to the data, which can also serve to highlight significant patterns and trends. Document clustering is used in many fields such as data mining and information retrieval. This thesis presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches of document clustering, agglomerative hierarchical clustering BIRCH and Partitional clustering algorithm K-means. As a result of comparing both algorithms we attempt to establish appropriate clustering technique to generate qualitative clustering of real world document.
DOI: 10.17762/ijritcc2321-8169.15080
On the Role of Social Identity and Cohesion in Characterizing Online Social Communities
Two prevailing theories for explaining social group or community structure
are cohesion and identity. The social cohesion approach posits that social
groups arise out of an aggregation of individuals that have mutual
interpersonal attraction as they share common characteristics. These
characteristics can range from common interests to kinship ties and from social
values to ethnic backgrounds. In contrast, the social identity approach posits
that an individual is likely to join a group based on an intrinsic
self-evaluation at a cognitive or perceptual level. In other words group
members typically share an awareness of a common category membership.
In this work we seek to understand the role of these two contrasting theories
in explaining the behavior and stability of social communities in Twitter. A
specific focal point of our work is to understand the role of these theories in
disparate contexts ranging from disaster response to socio-political activism.
We extract social identity and social cohesion features-of-interest for large
scale datasets of five real-world events and examine the effectiveness of such
features in capturing behavioral characteristics and the stability of groups.
We also propose a novel measure of social group sustainability based on the
divergence in group discussion. Our main findings are: 1) Sharing of social
identities (especially physical location) among group members has a positive
impact on group sustainability, 2) Structural cohesion (represented by high
group density and low average shortest path length) is a strong indicator of
group sustainability, and 3) Event characteristics play a role in shaping group
sustainability, as social groups in transient events behave differently from
groups in events that last longer
Lossless digraph signal processing via polar decomposition
In this paper, we present a signal processing framework for directed graphs.
Unlike undirected graphs, a graph shift operator such as the adjacency matrix
associated with a directed graph usually does not admit an orthogonal
eigenbasis. This makes it challenging to define the Fourier transform. Our
methodology leverages the polar decomposition to define two distinct
eigendecompositions, each associated with different matrices derived from this
decomposition. We propose to extend the frequency domain and introduce a
Fourier transform that jointly encodes the spectral response of a signal for
the two eigenbases from the polar decomposition. This allows us to define
convolution following a standard routine. Our approach has two features: it is
lossless as the shift operator can be fully recovered from factors of the polar
decomposition. Moreover, it subsumes the traditional graph signal processing if
the graph is directed. We present numerical results to show how the framework
can be applied
Detecting anomalies in heterogeneous population-scale VAT networks
Anomaly detection in network science is the method to determine aberrant
edges, nodes, subgraphs or other network events. Heterogeneous networks
typically contain information going beyond the observed network itself. Value
Added Tax (VAT, a tax on goods and services) networks, defined from pairwise
interactions of VAT registered taxpayers, are analysed at a population-scale
requiring scalable algorithms. By adopting a quantitative understanding of the
nature of VAT-anomalies, we define a method that identifies them utilising
information from micro-scale, meso-scale and global-scale patterns that can be
interpreted, and efficiently implemented, as population-scale network analysis.
The proposed method is automatable, and implementable in real time, enabling
revenue authorities to prevent large losses of tax revenues through performing
early identification of fraud within the VAT system.Comment: 14 pages, 5 figures, 3 table
Towards Specificationless Monitoring of Provenance-Emitting Systems
Monitoring often requires insight into the monitored system as well as concrete specifications of expected behavior. More and more systems, however, provide information about their inner procedures by emitting provenance information in a W3C-standardized graph format.
In this work, we present an approach to monitor such provenance data for anomalous behavior by performing spectral graph analysis on slices of the constructed provenance graph and by comparing the characteristics of each slice with those of a sliding window over recently seen slices. We argue that this approach not only simplifies the monitoring of heterogeneous distributed systems, but also enables applying a host of well-studied techniques to monitor such systems