2,087 research outputs found
On Graph Stream Clustering with Side Information
Graph clustering becomes an important problem due to emerging applications
involving the web, social networks and bio-informatics. Recently, many such
applications generate data in the form of streams. Clustering massive, dynamic
graph streams is significantly challenging because of the complex structures of
graphs and computational difficulties of continuous data. Meanwhile, a large
volume of side information is associated with graphs, which can be of various
types. The examples include the properties of users in social network
activities, the meta attributes associated with web click graph streams and the
location information in mobile communication networks. Such attributes contain
extremely useful information and has the potential to improve the clustering
process, but are neglected by most recent graph stream mining techniques. In
this paper, we define a unified distance measure on both link structures and
side attributes for clustering. In addition, we propose a novel optimization
framework DMO, which can dynamically optimize the distance metric and make it
adapt to the newly received stream data. We further introduce a carefully
designed statistics SGS(C) which consume constant storage spaces with the
progression of streams. We demonstrate that the statistics maintained are
sufficient for the clustering process as well as the distance optimization and
can be scalable to massive graphs with side attributes. We will present
experiment results to show the advantages of the approach in graph stream
clustering with both links and side information over the baselines.Comment: Full version of SIAM SDM 2013 pape
Cores and Other Dense Structures in Complex Networks
Complex networks are a powerful paradigm to model complex systems. Specific
network models, e.g., multilayer networks, temporal networks, and signed
networks, enrich the standard network representation with additional
information to better capture real-world phenomena. Despite the keen interest
in a variety of problems, algorithms, and analysis methods for these types of
network, the problem of extracting cores and dense structures still has
unexplored facets. In this work, we present advancements to the state of the
art by the introduction of novel definitions and algorithms for the extraction
of dense structures from complex networks, mainly cores. At first, we define
core decomposition in multilayer networks together with a series of
applications built on top of it, i.e., the extraction of maximal multilayer
cores only, densest subgraph in multilayer networks, the speed-up of the
extraction of frequent cross-graph quasi-cliques, and the generalization of
community search to the multilayer setting. Then, we introduce the concept of
core decomposition in temporal networks; also in this case, we are interested
in the extraction of maximal temporal cores only. Finally, in the context of
discovering polarization in large-scale online data, we study the problem of
identifying polarized communities in signed networks. The proposed
methodologies are evaluated on a large variety of real-world networks against
na\"{\i}ve approaches, non-trivial baselines, and competing methods. In all
cases, they show effectiveness, efficiency, and scalability. Moreover, we
showcase the usefulness of our definitions in concrete applications and case
studies, i.e., the temporal analysis of contact networks, and the
identification of polarization in debate networks.Comment: arXiv admin note: text overlap with arXiv:1812.0871
A Survey on Index Support for Item Set Mining
It is very difficult to handle the huge amount of information stored in modern databases. To manage with these databases association rule mining is currently used, which is a costly process that involves a significant amount of time and memory. Therefore, it is necessary to develop an approach to overcome these difficulties. A suitable data structures and algorithms must be developed to effectively perform the item set mining. An index includes all necessary characteristics potentially needed during the mining task; the extraction can be executed with the help of the index, without accessing the database. A database index is a data structure that enhances the speed of information retrieval operations on a database table at very low cost and increased storage space. The use index permits user interaction, in which the user can specify different attributes for item set extraction. Therefore, the extraction can be completed with the use index and without accessing the original database. Index also supports for reusing concept to mine item sets with the use of any support threshold. This paper also focuses on the survey of index support for item set mining which are proposed by various authors
Scaling Up Network Analysis and Mining: Statistical Sampling, Estimation, and Pattern Discovery
Network analysis and graph mining play a prominent role in providing insights and studying phenomena across various domains, including social, behavioral, biological, transportation, communication, and financial domains. Across all these domains, networks arise as a natural and rich representation for data. Studying these real-world networks is crucial for solving numerous problems that lead to high-impact applications. For example, identifying the behavior and interests of users in online social networks (e.g., viral marketing), monitoring and detecting virus outbreaks in human contact networks, predicting protein functions in biological networks, and detecting anomalous behavior in computer networks. A key characteristic of these networks is that their complex structure is massive and continuously evolving over time, which makes it challenging and computationally intensive to analyze, query, and model these networks in their entirety. In this dissertation, we propose sampling as well as fast, efficient, and scalable methods for network analysis and mining in both static and streaming graphs
On Optimally Partitioning Variable-Byte Codes
The ubiquitous Variable-Byte encoding is one of the fastest compressed
representation for integer sequences. However, its compression ratio is usually
not competitive with other more sophisticated encoders, especially when the
integers to be compressed are small that is the typical case for inverted
indexes. This paper shows that the compression ratio of Variable-Byte can be
improved by 2x by adopting a partitioned representation of the inverted lists.
This makes Variable-Byte surprisingly competitive in space with the best
bit-aligned encoders, hence disproving the folklore belief that Variable-Byte
is space-inefficient for inverted index compression. Despite the significant
space savings, we show that our optimization almost comes for free, given that:
we introduce an optimal partitioning algorithm that does not affect indexing
time because of its linear-time complexity; we show that the query processing
speed of Variable-Byte is preserved, with an extensive experimental analysis
and comparison with several other state-of-the-art encoders.Comment: Published in IEEE Transactions on Knowledge and Data Engineering
(TKDE), 15 April 201
- …