39 research outputs found
Privacy and Anonymization of Neighborhoods in Multiplex Networks
Since the beginning of the digital age, the amount of available data on human behaviour has dramatically increased, along with the risk for the privacy of the represented subjects. Since the analysis of those data can bring advances to science, it is important to share them while preserving the subjects' anonymity. A significant portion of the available information can be modelled as networks, introducing an additional privacy risk related to the structure of the data themselves. For instance, in a social network, people can be uniquely identifiable because of the structure of their neighborhood, formed by the amount of their friends and the connections between them. The neighborhood's structure is the target of an identity disclosure attack on released social network data, called neighborhood attack. To mitigate this threat, algorithms to anonymize networks have been proposed. However, this problem has not been deeply studied on multiplex networks, which combine different social network data into a single representation. The multiplex network representation makes the neighborhood attack setting more complicated, and adds information that an attacker can use to re-identify subjects.
This thesis aims to understand how multiplex networks behave in terms of anonymization difficulty and neighborhood attack. We present two definitions of multiplex neighborhoods, and discuss how the fraction of nodes with unique neighborhoods can be affected.
Through analysis of network models, we study the variation of the uniqueness of neighborhoods in networks with different structure and characteristics. We show that the uniqueness of neighborhoods has a linear trend depending on the network size and average degree. If the network has a more random structure, the uniqueness decreases significantly when the network size increases. On the other hand, if the local structure is more pronounced, the uniqueness is not strongly influenced by the number of nodes. We also conduct a motif analysis to study the recurring patterns that can make social networks' neighborhoods less unique.
Lastly, we propose an algorithm to anonymize a pair of multiplex neighborhoods. This algorithm is the core building block that can be used in a method to prevent neighborhood attacks on multiplex networks
Statistical Analysis and Spectral Methods for Signal-Plus-Noise Matrix Models
The singular value matrix decomposition plays a ubiquitous role in statistics and related fields. Myriad applications including clustering, classification, and dimensionality reduction involve studying and understanding the geometric structure of singular values and singular vectors.
Chapter 2 of this dissertation presents an initial analysis of local (e.g., entrywise) singular vector (resp., eigenvector) perturbations for signal-plus-noise matrix models. We obtain both deterministic and probabilistic upper bounds on singular vector perturbations that complement and in certain settings improve upon classical, well-established benchmark bounds in the literature. We then apply our tools and methods of analysis to problems involving (spike) principal subspace estimation for high-dimensional covariance matrices and network models exhibiting community structure. Subsequently, Chapter 3 obtains precise local eigenvector estimation results under stronger assumptions involving signal strength, probabilistic concentration, and homogeneity. We provide in silico simulation examples to illustrate our theoretical bounds and distributional limit theory. Chapter 4 transitions to the investigation of singular value (resp., eigenvalue) perturbations, still in the signal-plus-noise matrix model framework. There, our results are leveraged for the purpose of better understanding hypothesis testing and change-point detection in statistical random graph analysis. Chapter 5 builds upon recent joint analysis of singular (resp., eigen) values and vectors in order to investigate the asymptotic relationship between spectral embedding performance and underlying network structure for stochastic block model graphs
Complex systems approach to natural language
The review summarizes the main methodological concepts used in studying
natural language from the perspective of complexity science and documents their
applicability in identifying both universal and system-specific features of
language in its written representation. Three main complexity-related research
trends in quantitative linguistics are covered. The first part addresses the
issue of word frequencies in texts and demonstrates that taking punctuation
into consideration restores scaling whose violation in the Zipf's law is often
observed for the most frequent words. The second part introduces methods
inspired by time series analysis, used in studying various kinds of
correlations in written texts. The related time series are generated on the
basis of text partition into sentences or into phrases between consecutive
punctuation marks. It turns out that these series develop features often found
in signals generated by complex systems, like long-range correlations or
(multi)fractal structures. Moreover, it appears that the distances between
punctuation marks comply with the discrete variant of the Weibull distribution.
In the third part, the application of the network formalism to natural language
is reviewed, particularly in the context of the so-called word-adjacency
networks. Parameters characterizing topology of such networks can be used for
classification of texts, for example, from a stylometric perspective. Network
approach can also be applied to represent the organization of word
associations. Structure of word-association networks turns out to be
significantly different from that observed in random networks, revealing
genuine properties of language. Finally, punctuation seems to have a
significant impact not only on the language's information-carrying ability but
also on its key statistical properties, hence it is recommended to consider
punctuation marks on a par with words.Comment: 113 pages, 49 figure
Dynamic Treatment Regimes with Interference
Precision medicine describes healthcare in which patient-level data are used to inform treatment decisions. Within this framework, dynamic treatment regimes (DTRs) are sequences of decision rules that take individual patient information as input, and then output treatment recommendations. The primary purpose of DTR research is to estimate the optimal dynamic treatment regimes: the sequence of treatment rules that will optimize some pre-defined outcomes across a population. The focus of this thesis is on developing methods for estimating optimal DTRs in the presence of interference, where one patient’s outcome can be affected by others’ treatment. DTR estimation methods typically rely on the assumption of no interference. In many social network contexts, such as friendship or family networks, and for many health concerns, such as infectious diseases, this assumption is questionable. Moreover, the existing doubly robust regression-based DTR estimation methods are primarily focused on continuous outcomes. DTR estimation methods for binary or ordinal outcomes are more complicated due to less information being provided by these discrete outcomes. Consequently, very few DTR estimation methods focus on binary or ordinal outcomes, let alone methods when interference is present. To address these problems, for continuous outcomes, we directly establish novel interference-aware DTR estimation methods, and for binary or ordinal outcomes, we develop methods for DTR estimation first in cases without interference and then in ones affected by it.
This thesis contains three main components: (1) a doubly robust method to estimate the optimal DTRs for individuals where the treatments of their connected neighbours in the same social network are taken into account in the decision rules; (2) a doubly robust method to estimate the optimal DTRs for binary outcomes using sequential weighted generalized linear models; (3) a doubly robust method to estimate the optimal DTRs for ordinal outcomes in the presence of household interference. In (1), we study the DTR estimation method of dynamic weighted ordinary least squares (dWOLS), which boasts easy implementation and double robustness, but relies on the no interference assumption. We define a network propensity function and build on it to establish an implementation of dWOLS that remains doubly robust under interference associated with network links. The method's properties are shown via simulation and applied to household pairs data from the Population Assessment of Tobacco and Health (PATH) Study. On the basis of the theories of dWOLS and using our interference-aware version, we focus on developing innovative DTR estimation methods for both binary and ordinal outcomes, in particular, the methods in the presence of interference. In (2), considering binary outcomes, we propose a new method for DTR estimation without interference, the dynamic weighted generalized linear model (dWGLM), which accommodates binary outcomes while offering relatively straightforward implementation and robustness to model misspecification. We introduce the method and its underlying theory, and illustrate both in an analysis of e-cigarette usage and smoking cessation, using the observational data from the PATH study. Finally, in (3), we further extend these regression-based DTR methods to the ordinal outcome case, and also propose a robust method — the dynamic weighted proportional odds model (dWPOM). Moreover, in the presence of household interference, exploring the possible correlation between treatments in the same household, we investigate the covariate balancing weights, which rely on the joint propensity score, and methods for estimating the joint propensity score. Examining different types of balancing weights, we verify the double robustness of dWPOM with our adjusted weights via simulation studies. Lastly, we also illustrate dWPOM in the analysis of data from PATH. For each participant's household, we derive the household treatment configuration recommendations for achieving the best outcome of the pair: both individuals quit or attempt to quit smoking
LIPIcs, Volume 261, ICALP 2023, Complete Volume
LIPIcs, Volume 261, ICALP 2023, Complete Volum
Unsupervised Structural Embedding Methods for Efficient Collective Network Mining
How can we align accounts of the same user across social networks? Can we identify the professional role of an email user from their patterns of communication? Can we predict the medical effects of chemical compounds from their atomic network structure? Many problems in graph data mining, including all of the above, are defined on multiple networks. The central element to all of these problems is cross-network comparison, whether at the level of individual nodes or entities in the network or at the level of entire networks themselves. To perform this comparison meaningfully, we must describe the entities in each network expressively in terms of patterns that generalize across the networks. Moreover, because the networks in question are often very large, our techniques must be computationally efficient.
In this thesis, we propose scalable unsupervised methods that embed nodes in vector space by mapping nodes with similar structural roles in their respective networks, even if they come from different networks, to similar parts of the embedding space. We perform network alignment by matching nodes across two or more networks based on the similarity of their embeddings, and refine this process by reinforcing the consistency of each node’s alignment with those of its neighbors. By characterizing the distribution of node embeddings in a graph, we develop graph-level feature vectors that are highly effective for graph classification. With principled sparsification and randomized approximation techniques, we make all our methods computationally efficient and able to scale to graphs with millions of nodes or edges. We demonstrate the effectiveness of structural node embeddings on industry-scale applications, and propose an extensive set of embedding evaluation techniques that lay the groundwork for further methodological development and application.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/162895/1/mheimann_1.pd
LIPIcs, Volume 251, ITCS 2023, Complete Volume
LIPIcs, Volume 251, ITCS 2023, Complete Volum