27 research outputs found

    Survey of Document Clustering Approach for Real World Objects (Documents)

    Get PDF
    Since the amount of text data stored in computer repositories is growing every day, we need more than ever a reliable way to assemble or classify text documents. Clustering can provide a means of introducing some form of organization to the data, which can also serve to highlight significant patterns and trends. Document clustering is used in many fields such as data mining and information retrieval. This thesis presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches of document clustering, agglomerative hierarchical clustering BIRCH and Partitional clustering algorithm K-means. As a result of comparing both algorithms we attempt to establish appropriate clustering technique to generate qualitative clustering of real world document. DOI: 10.17762/ijritcc2321-8169.15080

    On the Role of Social Identity and Cohesion in Characterizing Online Social Communities

    Get PDF
    Two prevailing theories for explaining social group or community structure are cohesion and identity. The social cohesion approach posits that social groups arise out of an aggregation of individuals that have mutual interpersonal attraction as they share common characteristics. These characteristics can range from common interests to kinship ties and from social values to ethnic backgrounds. In contrast, the social identity approach posits that an individual is likely to join a group based on an intrinsic self-evaluation at a cognitive or perceptual level. In other words group members typically share an awareness of a common category membership. In this work we seek to understand the role of these two contrasting theories in explaining the behavior and stability of social communities in Twitter. A specific focal point of our work is to understand the role of these theories in disparate contexts ranging from disaster response to socio-political activism. We extract social identity and social cohesion features-of-interest for large scale datasets of five real-world events and examine the effectiveness of such features in capturing behavioral characteristics and the stability of groups. We also propose a novel measure of social group sustainability based on the divergence in group discussion. Our main findings are: 1) Sharing of social identities (especially physical location) among group members has a positive impact on group sustainability, 2) Structural cohesion (represented by high group density and low average shortest path length) is a strong indicator of group sustainability, and 3) Event characteristics play a role in shaping group sustainability, as social groups in transient events behave differently from groups in events that last longer

    Lossless digraph signal processing via polar decomposition

    Full text link
    In this paper, we present a signal processing framework for directed graphs. Unlike undirected graphs, a graph shift operator such as the adjacency matrix associated with a directed graph usually does not admit an orthogonal eigenbasis. This makes it challenging to define the Fourier transform. Our methodology leverages the polar decomposition to define two distinct eigendecompositions, each associated with different matrices derived from this decomposition. We propose to extend the frequency domain and introduce a Fourier transform that jointly encodes the spectral response of a signal for the two eigenbases from the polar decomposition. This allows us to define convolution following a standard routine. Our approach has two features: it is lossless as the shift operator can be fully recovered from factors of the polar decomposition. Moreover, it subsumes the traditional graph signal processing if the graph is directed. We present numerical results to show how the framework can be applied

    Detecting anomalies in heterogeneous population-scale VAT networks

    Full text link
    Anomaly detection in network science is the method to determine aberrant edges, nodes, subgraphs or other network events. Heterogeneous networks typically contain information going beyond the observed network itself. Value Added Tax (VAT, a tax on goods and services) networks, defined from pairwise interactions of VAT registered taxpayers, are analysed at a population-scale requiring scalable algorithms. By adopting a quantitative understanding of the nature of VAT-anomalies, we define a method that identifies them utilising information from micro-scale, meso-scale and global-scale patterns that can be interpreted, and efficiently implemented, as population-scale network analysis. The proposed method is automatable, and implementable in real time, enabling revenue authorities to prevent large losses of tax revenues through performing early identification of fraud within the VAT system.Comment: 14 pages, 5 figures, 3 table

    Towards Specificationless Monitoring of Provenance-Emitting Systems

    Get PDF
    Monitoring often requires insight into the monitored system as well as concrete specifications of expected behavior. More and more systems, however, provide information about their inner procedures by emitting provenance information in a W3C-standardized graph format. In this work, we present an approach to monitor such provenance data for anomalous behavior by performing spectral graph analysis on slices of the constructed provenance graph and by comparing the characteristics of each slice with those of a sliding window over recently seen slices. We argue that this approach not only simplifies the monitoring of heterogeneous distributed systems, but also enables applying a host of well-studied techniques to monitor such systems
    corecore