7,859 research outputs found
FLEET: Butterfly Estimation from a Bipartite Graph Stream
We consider space-efficient single-pass estimation of the number of
butterflies, a fundamental bipartite graph motif, from a massive bipartite
graph stream where each edge represents a connection between entities in two
different partitions. We present a space lower bound for any streaming
algorithm that can estimate the number of butterflies accurately, as well as
FLEET, a suite of algorithms for accurately estimating the number of
butterflies in the graph stream. Estimates returned by the algorithms come with
provable guarantees on the approximation error, and experiments show good
tradeoffs between the space used and the accuracy of approximation. We also
present space-efficient algorithms for estimating the number of butterflies
within a sliding window of the most recent elements in the stream. While there
is a significant body of work on counting subgraphs such as triangles in a
unipartite graph stream, our work seems to be one of the few to tackle the case
of bipartite graph streams.Comment: This is the author's version of the work. It is posted here by
permission of ACM for your personal use. Not for redistribution. The
definitive version was published in Seyed-Vahid Sanei-Mehri, Yu Zhang, Ahmet
Erdem Sariyuce and Srikanta Tirthapura. "FLEET: Butterfly Estimation from a
Bipartite Graph Stream". The 28th ACM International Conference on Information
and Knowledge Managemen
Extractive text summarisation using graph triangle counting approach: proposed method
Currently, with a growing quantity of automated text data, the necessity for the con-struction of Summarisation systems turns out to be vital. Summarisation systems confine and condense the mainly vital ideas of the papers and assist the user to find and understand the foremost facts of the text quicker and easier from the dispensation of information. Compelling set of such systems are those that create summaries of ex-tracts. This type of summary, which is called Extractive Summarisation , is created by choosing large significant fragments of the text without making any amendment to the original. One methodology for generating this type of summary is consuming the graph theory. In graph theory there is one field called graph pruning / reduction, which means, to find the best representation of the main graph with a smaller number of nodes and edges. In this paper, a graph reduction technique called the triangle counting approach is presented to choose the most vital sentences of the text. The first phase is to represent a text as a graph, where nodes are the sentences and edges are the similarity between the sentences. The second phase is to construct the triangles, after that bit vector representation and the final phase is to retrieve the sentences based on the values of bit vector
New probabilistic interest measures for association rules
Mining association rules is an important technique for discovering meaningful
patterns in transaction databases. Many different measures of interestingness
have been proposed for association rules. However, these measures fail to take
the probabilistic properties of the mined data into account. In this paper, we
start with presenting a simple probabilistic framework for transaction data
which can be used to simulate transaction data when no associations are
present. We use such data and a real-world database from a grocery outlet to
explore the behavior of confidence and lift, two popular interest measures used
for rule mining. The results show that confidence is systematically influenced
by the frequency of the items in the left hand side of rules and that lift
performs poorly to filter random noise in transaction data. Based on the
probabilistic framework we develop two new interest measures, hyper-lift and
hyper-confidence, which can be used to filter or order mined association rules.
The new measures show significantly better performance than lift for
applications where spurious rules are problematic
Curvature of Co-Links Uncovers Hidden Thematic Layers in the World Wide Web
Beyond the information stored in pages of the World Wide Web, novel types of
``meta-information'' are created when they connect to each other. This
information is a collective effect of independent users writing and linking
pages, hidden from the casual user. Accessing it and understanding the
inter-relation of connectivity and content in the WWW is a challenging problem.
We demonstrate here how thematic relationships can be located precisely by
looking only at the graph of hyperlinks, gleaning content and context from the
Web without having to read what is in the pages. We begin by noting that
reciprocal links (co-links) between pages signal a mutual recognition of
authors, and then focus on triangles containing such links, since triangles
indicate a transitive relation. The importance of triangles is quantified by
the clustering coefficient (Watts) which we interpret as a curvature
(Gromov,Bridson-Haefliger). This defines a Web-landscape whose connected
regions of high curvature characterize a common topic. We show experimentally
that reciprocity and curvature, when combined, accurately capture this
meta-information for a wide variety of topics. As an example of future
directions we analyze the neural network of C. elegans (White, Wood), using the
same methods.Comment: 8 pages, 5 figures, expanded version of earlier submission with more
example
A Study on Privacy Preserving Data Publishing With Differential Privacy
In the era of digitization it is important to preserve privacy of various sensitive information available around us, e.g., personal information, different social communication and video streaming sites' and services' own users' private information, salary information and structure of an organization, census and statistical data of a country and so on. These data can be represented in different formats such as Numerical and Categorical data, Graph Data, Tree-Structured data and so on. For preventing these data from being illegally exploited and protect it from privacy threats, it is required to apply an efficient privacy model over sensitive data. There have been a great number of studies on privacy-preserving data publishing over the last decades. Differential Privacy (DP) is one of the state of the art methods for preserving privacy to a database. However, applying DP to high dimensional tabular data (Numerical and Categorical) is challenging in terms of required time, memory, and high frequency computational unit. A well-known solution is to reduce the dimension of the given database, keeping its originality and preserving relations among all of its entities. In this thesis, we propose PrivFuzzy, a simple and flexible differentially private method that can publish differentially private data after reducing their original dimension with the help of Fuzzy logic. Exploiting Fuzzy mapping, PrivFuzzy can (1) reduce database columns and create a new low dimensional correlated database, (2) inject noise to each attribute to ensure differential privacy on newly created low dimensional database, and (3) sample each entry in the database and release synthesized database. Existing literatures show the difficulty of applying differential privacy over a high dimensional dataset, which we overcame by proposing a novel fuzzy based approach (PrivFuzzy). By applying our novel fuzzy mapping technique, PrivFuzzy transforms a high dimensional dataset to an equivalent low dimensional one, without losing any relationship within the dataset. Our experiments with real data and comparison with the existing privacy preserving models, PrivBayes and PrivGene, show that our proposed approach PrivFuzzy outperforms existing solutions in terms of the strength of privacy preservation, simplicity and improving utility.
Preserving privacy of Graph structured data, at the time of making some of its part available, is still one of the major problems in preserving data privacy. Most of the present models had tried to solve this issue by coming up with complex solution, as well as mixed up with signal and noise, which make these solutions ineffective in real time use and practice. One of the state of the art solution is to apply differential privacy over the queries on graph data and its statistics. But the challenge to meet here is to reduce the error at the time of publishing the data as mechanism of Differential privacy adds a large amount of noise and introduces erroneous results which reduces the utility of data. In this thesis, we proposed an Expectation Maximization (EM) based novel differentially private model for graph dataset. By applying EM method iteratively in conjunction with Laplace mechanism our proposed private model applies differentially private noise over the result of several subgraph queries on a graph dataset. Besides, to ensure expected utility, by selecting a maximal noise level , our proposed system can generate noisy result with expected utility. Comparing with existing models for several subgraph counting queries, we claim that our proposed model can generate much less noise than the existing models to achieve expected utility and can still preserve privacy
g-FSG Approach for Finding Frequent Sub Graph
Informally, a graph is set of nodes, pairs of which might be connected by edges. In a wide array of disciplines, data can be intuitively cast into this format. For example, computer networks consist of routers/computers (nodes) and the links (edges) between them. Social networks consist of individuals and their interconnections (which could be business relationships or kinship or trust, etc.) Protein interaction networks link proteins which must work together to perform some particular biological function. Ecological food webs link species with predator-prey relationships. In these and many other fields, graphs are seemingly ubiquitous. The problems of detecting abnormalities (outliers) in a given graph and of generating synthetic but realistic graphs have received considerable attention recently. Both are tightly coupled to the problem of finding the distinguishing characteristics of real-world graphs, that is, the patterns that show up frequently in such graphs and can thus be considered as marks of realism. A good generator will create graphs which match these patterns. In this paper we present gFSG, a computationally efficient algorithm for finding frequent patterns corresponding to geometric sub graphs in a large collection of geometric graphs. gFSG is able to discover geometric sub graphs that can be rotation, scaling, and translation invariant, and it can accommodate inherent errors on the coordinates of the vertices
- …