86 research outputs found
Retrieving Top-N Weighted Spatial k-cliques
Spatial data analysis is a classic yet important topic because of its wide range of applications. Recently, as a spatial data analysis approach, a neighbor graph of a set P of spatial points has often been employed. This paper also considers a spatial neighbor graph and addresses a new problem, namely top-N weighted spatial k-clique retrieval. This problem searches for the N minimum weighted cliques consisting of k points in P, and it has important applications, such as community detection and co-location pattern mining. Recent spatial datasets have many points, and efficiently dealing with such big datasets is one of the main requirements of applications. A straightforward approach to solving our problem is to try to enumerate all k-cliques, which incurs O(nkk2) time. Since k ⼠3, this approach cannot achieve the main requirement, so computing the result without enumerating unnecessary k-cliques is required. This paper achieves this challenging task and proposes a simple practically-efficient algorithm that returns the exact answer. We conduct experiments using two real spatial datasets consisting of million points, and the results show the efficiency of our algorithm, e.g., it can return the exact top-N result within 1 second when N ⤠1000 and k ⤠7.Taniguchi R., Amagata D., Hara T.. Retrieving Top-N Weighted Spatial k-cliques. Proceedings - 2022 IEEE International Conference on Big Data, Big Data 2022 , 4952 (2022); https://doi.org/10.1109/BigData55660.2022.10021071
Analysis of category co-occurrence in Wikipedia networks
Wikipedia has seen a huge expansion of content since its inception. Pages within this online
encyclopedia are organised by assigning them to one or more categories, where Wikipedia
maintains a manually constructed taxonomy graph that encodes the semantic relationship
between these categories. An alternative, called the category co-occurrence graph, can be
produced automatically by linking together categories that have pages in common. Properties
of the latter graph and its relationship to the former is the concern of this thesis.
The analytic framework, called t-component, is introduced to formalise the graphs and
discover category clusters connecting relevant categories together. The m-core, a cohesive
subgroup concept as a clustering model, is used to construct a subgraph depending on the
number of shared pages between the categories exceeding a given threshold t. The significant
of the clustering result of the m-core is validated using a permutation test. This is compared
to the k-core, another clustering model.
TheWikipedia category co-occurrence graphs are scale-free with a few category hubs and
the majority of clusters are size 2. All observed properties for the distribution of the largest
clusters of the category graphs obey power-laws with decay exponent averages around 1.
As the threshold t of the number of shared pages is increased, eventually a critical threshold
is reached when the largest cluster shrinks significantly in size. This phenomena is only
exhibited for the m-core but not the k-core. Lastly, the clustering in the category graph
is shown to be consistent with the distance between categories in the taxonomy graph
Advances in knowledge discovery and data mining Part II
19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p
Combating Threats to the Quality of Information in Social Systems
Many large-scale social systems such as Web-based social networks, online social media sites and Web-scale crowdsourcing systems have been growing rapidly, enabling millions of human participants to generate, share and consume content on a massive scale. This reliance on users can lead to many positive effects, including large-scale growth in the size and content in the community, bottom-up discovery of âcitizen-expertsâ, serendipitous discovery of new resources beyond the scope of the system designers, and new social-based information search and retrieval algorithms. But the relative openness and reliance on users coupled with the widespread interest and growth of these social systems carries risks and raises growing concerns over the quality of information in these systems.
In this dissertation research, we focus on countering threats to the quality of information in self-managing social systems. Concretely, we identify three classes of threats to these systems: (i) content pollution by social spammers, (ii) coordinated campaigns for strategic manipulation, and (iii) threats to collective attention. To combat these threats, we propose three inter-related methods for detecting evidence of these threats, mitigating their impact, and improving the quality of information in social systems. We augment this three-fold defense with an exploration of their origins in âcrowdturfingâ â a sinister counterpart to the enormous positive opportunities of crowdsourcing. In particular, this dissertation research makes four unique contributions:
⢠The first contribution of this dissertation research is a framework for detecting and filtering social spammers and content polluters in social systems. To detect and filter individual social spammers and content polluters, we propose and evaluate a novel social honeypot-based approach.
⢠Second, we present a set of methods and algorithms for detecting coordinated campaigns in large-scale social systems. We propose and evaluate a content- driven framework for effectively linking free text posts with common âtalking pointsâ and extracting campaigns from large-scale social systems.
⢠Third, we present a dual study of the robustness of social systems to collective attention threats through both a data-driven modeling approach and deploy- ment over a real system trace. We evaluate the effectiveness of countermeasures deployed based on the first moments of a bursting phenomenon in a real system.
⢠Finally, we study the underlying ecosystem of crowdturfing for engaging in each of the three threat types. We present a framework for âpulling back the curtainâ on crowdturfers to reveal their underlying ecosystem on both crowdsourcing sites and social media
- âŚ