12 research outputs found
The characteristics of cycle-nodes-ratio and its application to network classification
Cycles, which can be found in many different kinds of networks, make the
problems more intractable, especially when dealing with dynamical processes on
networks. On the contrary, tree networks in which no cycle exists, are
simplifications and usually allow for analyticity. There lacks a quantity,
however, to tell the ratio of cycles which determines the extent of network
being close to tree networks. Therefore we introduce the term Cycle Nodes Ratio
(CNR) to describe the ratio of number of nodes belonging to cycles to the
number of total nodes, and provide an algorithm to calculate CNR. CNR is
studied in both network models and real networks. The CNR remains unchanged in
different sized Erd\"os R\'enyi (ER) networks with the same average degree, and
increases with the average degree, which yields a critical turning point. The
approximate analytical solutions of CNR in ER networks are given, which fits
the simulations well. Furthermore, the difference between CNR and two-core
ratio (TCR) is analyzed. The critical phenomenon is explored by analysing the
giant component of networks. We compare the CNR in network models and real
networks, and find the latter is generally smaller. Combining the
coarse-graining method can distinguish the CNR structure of networks with high
average degree. The CNR is also applied to four different kinds of
transportation networks and fungal networks, which give rise to different zones
of effect. It is interesting to see that CNR is very useful in network
recognition of machine learning.Comment: 27 pages,16 figures,3 table
A Central Limit Theorem for Diffusion in Sparse Random Graphs
We consider bootstrap percolation and diffusion in sparse random graphs with
fixed degrees, constructed by configuration model. Every node has two states:
it is either active or inactive. We assume that to each node is assigned a
nonnegative (integer) threshold. The diffusion process is initiated by a subset
of nodes with threshold zero which consists of initially activated nodes,
whereas every other node is inactive. Subsequently, in each round, if an
inactive node with threshold has at least of its neighbours
activated, then it also becomes active and remains so forever. This is repeated
until no more nodes become activated. The main result of this paper provides a
central limit theorem for the final size of activated nodes. Namely, under
suitable assumptions on the degree and threshold distributions, we show that
the final size of activated nodes has asymptotically Gaussian fluctuations.Comment: 17 page
K-Connected Cores Computation in Large Dual Networks
© 2018, The Author(s). Computing k- cores is a fundamental and important graph problem, which can be applied in many areas, such as community detection, network visualization, and network topology analysis. Due to the complex relationship between different entities, dual graph widely exists in the applications. A dual graph contains a physical graph and a conceptual graph, both of which have the same vertex set. Given that there exist no previous studies on the k- core in dual graphs, we formulate a k-connected core (k- CCO) model in dual graphs. A k- CCO is a k- core in the conceptual graph, and also connected in the physical graph. Given a dual graph and an integer k, we propose a polynomial time algorithm for computing all k- CCOs. We also propose three algorithms for computing all maximum-connected cores (MCCO), which are the existing k- CCOs such that a (k+ 1) -CCO does not exist. We further study a subgraph search problem, which is computing a k- CCO that contains a set of query vertices. We propose an index-based approach to efficiently answer the query for any given parameter k. We conduct extensive experiments on six real-world datasets and four synthetic datasets. The experimental results demonstrate the effectiveness and efficiency of our proposed algorithms
Operationalizing anthropological theory: four techniques to simplify networks of co-occurring ethnographic codes
The use of data and algorithms in the social sciences allows for exciting progress, but also poses epistemological challenges. Operations that appear innocent and purely technical may profoundly influence final results. Researchers working with data can make their process less arbitrary and more accountable by making theoretically grounded methodological choices. We apply this approach to the problem of simplifying networks representing ethnographic corpora, in the interest of visual interpretation. Network nodes represent ethnographic codes, and their edges the co-occurrence of codes in a corpus. We introduce and discuss four techniques to simplify such networks and facilitate visual analysis. We show how the mathematical characteristics of each one are aligned with an identifiable approach in sociology or anthropology: structuralism and post-structuralism; identifying the central concepts in a discourse; and discovering hegemonic and counter-hegemonic clusters of meaning. We then provide an example of how the four techniques complement each other in ethnographic analysis
I/O efficient Core Graph Decomposition at web scale.
Core decomposition is a fundamental graph problem with a large number of
applications. Most existing approaches for core decomposition assume that the
graph is kept in memory of a machine. Nevertheless, many real-world graphs are
big and may not reside in memory. In the literature, there is only one work for
I/O efficient core decomposition that avoids loading the whole graph in memory.
However, this approach is not scalable to handle big graphs because it cannot
bound the memory size and may load most parts of the graph in memory. In
addition, this approach can hardly handle graph updates. In this paper, we
study I/O efficient core decomposition following a semi-external model, which
only allows node information to be loaded in memory. This model works well in
many web-scale graphs. We propose a semi-external algorithm and two optimized
algorithms for I/O efficient core decomposition using very simple structures
and data access model. To handle dynamic graph updates, we show that our
algorithm can be naturally extended to handle edge deletion. We also propose an
I/O efficient core maintenance algorithm to handle edge insertion, and an
improved algorithm to further reduce I/O and CPU cost by investigating some new
graph properties. We conduct extensive experiments on 12 real large graphs. Our
optimal algorithm significantly outperform the existing I/O efficient algorithm
in terms of both processing time and memory consumption. In many
memory-resident graphs, our algorithms for both core decomposition and
maintenance can even outperform the in-memory algorithm due to the simple
structures and data access model used. Our algorithms are very scalable to
handle web-scale graphs. As an example, we are the first to handle a web graph
with 978.5 million nodes and 42.6 billion edges using less than 4.2 GB memory
Phase Transitions in Semidefinite Relaxations
Statistical inference problems arising within signal processing, data mining,
and machine learning naturally give rise to hard combinatorial optimization
problems. These problems become intractable when the dimensionality of the data
is large, as is often the case for modern datasets. A popular idea is to
construct convex relaxations of these combinatorial problems, which can be
solved efficiently for large scale datasets.
Semidefinite programming (SDP) relaxations are among the most powerful
methods in this family, and are surprisingly well-suited for a broad range of
problems where data take the form of matrices or graphs. It has been observed
several times that, when the `statistical noise' is small enough, SDP
relaxations correctly detect the underlying combinatorial structures.
In this paper we develop asymptotic predictions for several `detection
thresholds,' as well as for the estimation error above these thresholds. We
study some classical SDP relaxations for statistical problems motivated by
graph synchronization and community detection in networks. We map these
optimization problems to statistical mechanics models with vector spins, and
use non-rigorous techniques from statistical mechanics to characterize the
corresponding phase transitions. Our results clarify the effectiveness of SDP
relaxations in solving high-dimensional statistical problems.Comment: 71 pages, 24 pdf figure
The 2-Core of a Random Inhomogeneous Hypergraph
The k-core of a hypergraph is the unique subgraph where all vertices have degree at least k and which is the maximal induced subgraph with this property. We study the 2-core of a random hypergraph by probabilistic analysis of the following edge removal rule: remove any vertices with degree less than 2, and remove all hyperedges incident to these vertices. This process terminates with the 2-core. The hypergraph model studied is an inhomogeneous model --- where the expected degrees are not identical. The main result we prove is that as the number of vertices n tends to infinity, the number of hyperedges R in the 2-core obeys a limit law: R/n converges in probability to a non-random constant
ENGINEERING COMPRESSED STATIC FUNCTIONS AND MINIMAL PERFECT HASH FUNCTIONS
\emph{Static functions} are data structures meant to store arbitrary mappings from finite sets to integers; that is, given universe of items , a set of pairs where , and , a static function will retrieve given (usually, in constant time). When every key is mapped into a different value this function is called \emph{perfect hash function} and when the data structure yields an injective numbering ; this mapping is called a \emph{minimal perfect hash function}. Big data brought back one of the most critical challenges that computer scientists have been tackling during the last fifty years, that is, analyzing big amounts of data that do not fit in main memory. While for small keysets these mappings can be easily implemented using hash tables, this solution does not scale well for bigger sets. Static functions and MPHFs break the information-theoretical lower bound of storing the set because they are allowed to return \emph{any} value if the queried key is not in the original keyset. The classical constructions technique for static functions can achieve just bits space, where , and the one for MPHFs bits of space (always with constant access time). All these features make static functions and MPHFs powerful techniques when handling, for instance, large sets of strings, and they are essential building blocks of space-efficient data structures such as (compressed) full-text indexes, monotone MPHFs, Bloom filter-like data structures, and prefix-search data structures. The biggest challenge of this construction technique involves lowering the multiplicative constants hidden inside the asymptotic space bounds while keeping feasible construction times. In this thesis, we take advantage of the recent result in random linear systems theory regarding the ratio between the number of variables and number of the equations, and in perfect hash data structures, to achieve practical static functions with the lowest space bounds so far, and construction time comparable with widely used techniques. The new results, however, require solving linear systems that require more than a simple triangulation process, as it happens in current state-of-the-art solutions. The main challenge in making such structures usable is mitigating the cubic running time of Gaussian elimination at construction time. To this purpose, we introduce novel techniques based on \emph{broadword programming} and a heuristic derived from \emph{structured Gaussian elimination}. We obtained data structures that are significantly smaller than commonly used hypergraph-based constructions while maintaining or improving the lookup times and providing still feasible construction.We then apply these improvements to another kind of structures: \emph{compressed static hash functions}. The theoretical construction technique for this kind of data structure uses prefix-free codes with variable length to encode the set of values. Adopting this solution, we can reduce the\n space usage of each element to (essentially) the entropy of the list of output values of the function.Indeed, we need to solve an even bigger linear system of equations, and the time required to build the structure increases. In this thesis, we present the first engineered implementation of compressed hash functions. For example, we were able to store a function with geometrically distributed output, with parameter in just bit per key, independently of the key set, with a construction time double with respect to that of a state-of-the-art non-compressed function, which requires bits per key, where is the number of keys, and similar lookup time. We can also store a function with an output distributed following a Zipfian distribution with parameter and in just bits per key, whereas a non-compressed function would require more than , with a threefold increase in construction time and significantly faster lookups
Topics in Stochastic Analysis and Control
In this dissertation, problems in stochastic analysis and control are investigated, which include mathematical finance, online learning, and mean field game. For math- ematical finance, 1) a martingale optimal transport problem with bounded volatility is studied, which allows to calibrate not only current observation (option prices) but also historical data (stock prices); see Chapter II, 2) the embedding problem in multi-dimension is solved via excursion theory in probability; see Chapter III, 3) size of most stable subgraphs of random graphs, k-core, is determined by using branching processes; see Chapter IV. For online learning, 1) an unprecedented solution to the 4-expert problem with finite stopping is provided, via an explicit construction of the solution to a nonlinear partial differential equation; see Chapter V 2) prediction prob- lems with a limited adversary are studied using partial differential equation tools; see Chapter VI and VII. For mean field game, 1) the convergence phenomenon of N + 1- player Nash equilibrium is studied by the entropy solution to scalar conservative laws; see Chapter VIII, 2) infinite horizon mean field type control and game are solved via McKean-Vlasov forward backward stochastic differential equations; see Chapter IX.PHDMathematicsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/167918/1/zxmars_1.pd