11 research outputs found

    Off-Policy Evaluation of Probabilistic Identity Data in Lookalike Modeling

    Full text link
    We evaluate the impact of probabilistically-constructed digital identity data collected from Sep. to Dec. 2017 (approx.), in the context of Lookalike-targeted campaigns. The backbone of this study is a large set of probabilistically-constructed "identities", represented as small bags of cookies and mobile ad identifiers with associated metadata, that are likely all owned by the same underlying user. The identity data allows to generate "identity-based", rather than "identifier-based", user models, giving a fuller picture of the interests of the users underlying the identifiers. We employ off-policy techniques to evaluate the potential of identity-powered lookalike models without incurring the risk of allowing untested models to direct large amounts of ad spend or the large cost of performing A/B tests. We add to historical work on off-policy evaluation by noting a significant type of "finite-sample bias" that occurs for studies combining modestly-sized datasets and evaluation metrics involving rare events (e.g., conversions). We illustrate this bias using a simulation study that later informs the handling of inverse propensity weights in our analyses on real data. We demonstrate significant lift in identity-powered lookalikes versus an identity-ignorant baseline: on average ~70% lift in conversion rate. This rises to factors of ~(4-32)x for identifiers having little data themselves, but that can be inferred to belong to users with substantial data to aggregate across identifiers. This implies that identity-powered user modeling is especially important in the context of identifiers having very short lifespans (i.e., frequently churned cookies). Our work motivates and informs the use of probabilistically-constructed identities in marketing. It also deepens the canon of examples in which off-policy learning has been employed to evaluate the complex systems of the internet economy.Comment: Accepted by WSDM 201

    Clustering and Community Detection in Directed Networks: A Survey

    Full text link
    Networks (or graphs) appear as dominant structures in diverse domains, including sociology, biology, neuroscience and computer science. In most of the aforementioned cases graphs are directed - in the sense that there is directionality on the edges, making the semantics of the edges non symmetric. An interesting feature that real networks present is the clustering or community structure property, under which the graph topology is organized into modules commonly called communities or clusters. The essence here is that nodes of the same community are highly similar while on the contrary, nodes across communities present low similarity. Revealing the underlying community structure of directed complex networks has become a crucial and interdisciplinary topic with a plethora of applications. Therefore, naturally there is a recent wealth of research production in the area of mining directed graphs - with clustering being the primary method and tool for community detection and evaluation. The goal of this paper is to offer an in-depth review of the methods presented so far for clustering directed networks along with the relevant necessary methodological background and also related applications. The survey commences by offering a concise review of the fundamental concepts and methodological base on which graph clustering algorithms capitalize on. Then we present the relevant work along two orthogonal classifications. The first one is mostly concerned with the methodological principles of the clustering algorithms, while the second one approaches the methods from the viewpoint regarding the properties of a good cluster in a directed network. Further, we present methods and metrics for evaluating graph clustering results, demonstrate interesting application domains and provide promising future research directions.Comment: 86 pages, 17 figures. Physics Reports Journal (To Appear

    Graph-Based Machine Learning for Passive Network Reconnaissance within Encrypted Networks

    Get PDF
    Network reconnaissance identifies a network’s vulnerabilities to both prevent and mitigate the impact of cyber-attacks. The difficulty of performing adequate network reconnaissance has been exacerbated by the rising complexity of modern networks (e.g., encryption). We identify that the majority of network reconnaissance solutions proposed in literature are infeasible for widespread deployment in realistic modern networks. This thesis provides novel network reconnaissance solutions to address the limitations of the existing conventional approaches proposed in literature. The existing approaches are limited by their reliance on large, heterogeneous feature sets making them difficult to deploy under realistic network conditions. In contrast, we devise a bipartite graph-based representation to create network reconnaissance solutions that rely only on a single feature (e.g., the Internet protocol (IP) address field). We exploit a widely available feature set to provide network reconnaissance solutions that are scalable, independent of encryption, and deployable across diverse Internet (TCP/IP) networks. We design bipartite graph embeddings (BGE); a graph-based machine learning (ML) technique for extracting insight from the structural properties of the bipartite graph-based representation. BGE is the first known graph embedding technique designed explicitly for network reconnaissance. We validate the use of BGE through an evaluation of a university’s enterprise network. BGE is shown to provide insight into crucial areas of network reconnaissance (e.g., device characterisation, service prediction, and network visualisation). We design an extension of BGE to acquire insight within a private network. Private networks—such as a virtual private network (VPN)—have posed significant challenges for network reconnaissance as they deny direct visibility into their composition. Our extension of BGE provides the first known solution for inferring the composition of both the devices and applications acting behind diverse private networks. This thesis provides novel graph-based ML techniques for two crucial aims of network reconnaissance—device characterisation and intrusion detection. The techniques developed within this thesis provide unique cybersecurity solutions to both prevent and mitigate the impact of cyber-attacks.Thesis (Ph.D.) -- University of Adelaide, School of Electrical and Electronic Engineering , 202

    Algorithms For Community Identification In Complex Networks

    Get PDF
    First and foremost, I would like to extend my deepest gratitude to my advisor, Professor Narsingh Deo, for his excellent guidance and encouragement, and also for introducing me to this wonderful science of complex networks. Without his support this dissertation would not have been possible. I would also like to thank the members of my research committee, professors Charles Hughes, Ratan Guha, Mainak Chatterjee and Yue Zhao for their advice and guidance during the entire process. I am indebted to the faculty and the staff of the Department of Electrical Engineering and Computer Science for providing me the resources and environment to perform this research. I am grateful to my colleagues in the Parallel and Quantum computing lab for the stimulating discussions and support. I would also like to thank Dr. Hemant Balakrishnan and Dr. Sanjeeb Nanda for their valuable suggestions and guidance. My heartfelt thanks to my parents, Vasudevan and Raji, who have always been supportive of my decisions and encouraged me with their best wishes. I would also like to thank my sister Gomathy, for her words of care and affection during tough times. Special thanks to my friends in Orlando for being there when I needed the

    Methods for multilevel analysis and visualisation of geographical networks

    Get PDF

    Graph Analysis and Applications in Clustering and Content-based Image Retrieval

    Get PDF
    About 300 years ago, when studying Seven Bridges of Königsberg problem - a famous problem concerning paths on graphs - the great mathematician Leonhard Euler said, “This question is very banal, but seems to me worthy of attention”. Since then, graph theory and graph analysis have not only become one of the most important branches of mathematics, but have also found an enormous range of important applications in many other areas. A graph is a mathematical model that abstracts entities and the relationships between them as nodes and edges. Many types of interactions between the entities can be modeled by graphs, for example, social interactions between people, the communications between the entities in computer networks and relations between biological species. Although not appearing to be a graph, many other types of data can be converted into graphs by cer- tain operations, for example, the k-nearest neighborhood graph built from pixels in an image. Cluster structure is a common phenomenon in many real-world graphs, for example, social networks. Finding the clusters in a large graph is important to understand the underlying relationships between the nodes. Graph clustering is a technique that partitions nodes into clus- ters such that connections among nodes in a cluster are dense and connections between nodes in different clusters are sparse. Various approaches have been proposed to solve graph clustering problems. A common approach is to optimize a predefined clustering metric using different optimization methods. However, most of these optimization problems are NP-hard due to the discrete set-up of the hard-clustering. These optimization problems can be relaxed, and a sub-optimal solu- tion can be found. A different approach is to apply data clustering algorithms in solving graph clustering problems. With this approach, one must first find appropriate features for each node that represent the local structure of the graph. Limited Random Walk algorithm uses the random walk procedure to explore the graph and extracts ef- ficient features for the nodes. It incorporates the embarrassing parallel paradigm, thus, it can process large graph data efficiently using mod- ern high-performance computing facilities. This thesis gives the details of this algorithm and analyzes the stability issues of the algorithm. Based on the study of the cluster structures in a graph, we define the authenticity score of an edge as the difference between the actual and the expected number of edges that connect the two groups of the neighboring nodes of the two end nodes. Authenticity score can be used in many important applications, such as graph clustering, outlier detection, and graph data preprocessing. In particular, a data clus- tering algorithm that uses the authenticity scores on mutual k-nearest neighborhood graph achieves more reliable and superior performance comparing to other popular algorithms. This thesis also theoretically proves that this algorithm can asymptotically find the complete re- covery of the ground truth of the graphs that were generated by a stochastic r-block model. Content-based image retrieval (CBIR) is an important application in computer vision, media information retrieval, and data mining. Given a query image, a CBIR system ranks the images in a large image database by their “similarities” to the query image. However, because of the ambiguities of the definition of the “similarity”, it is very diffi- cult for a CBIR system to select the optimal feature set and ranking algorithm to satisfy the purpose of the query. Graph technologies have been used to improve the performance of CBIR systems in var- ious ways. In this thesis, a novel method is proposed to construct a visual-semantic graph—a graph where nodes represent semantic concepts and edges represent visual associations between concepts. The constructed visual-semantic graph not only helps the user to locate the target images quickly but also helps answer the questions related to the query image. Experiments show that the efforts of locating the target image are reduced by 25% with the help of visual-semantic graphs. Graph analysis will continue to play an important role in future data analysis. In particular, the visual-semantic graph that captures important and interesting visual associations between the concepts is worthyof further attention

    Algorithms for nonuniform networks

    Get PDF
    In this thesis, observations on structural properties of natural networks are taken as a starting point for developing efficient algorithms for natural instances of different graph problems. The key areas discussed are sampling, clustering, routing, and pattern mining for large, nonuniform graphs. The results include observations on structural effects together with algorithms that aim to reveal structural properties or exploit their presence in solving an interesting graph problem. Traditionally networks were modeled with uniform random graphs, assuming that each vertex was equally important and each edge equally likely to be present. Within the last decade, the approach has drastically changed due to the numerous observations on structural complexity in natural networks, many of which proved the uniform model to be inadequate for some contexts. This quickly lead to various models and measures that aim to characterize topological properties of different kinds of real-world networks also beyond the uniform networks. The goal of this thesis is to utilize such observations in algorithm design, in addition to empowering the process of network analysis. Knowing that a graph exhibits certain characteristics allows for more efficient storage, processing, analysis, and feature extraction. Our emphasis is on local methods that avoid resorting to information of the graph structure that is not relevant to the answer sought. For example, when seeking for the cluster of a single vertex, we compute it without using any global knowledge of the graph, iteratively examining the vicinity of the seed vertex. Similarly we propose methods for sampling and spanning-tree construction according to certain criteria on the outcome without requiring knowledge of the graph as a whole. Our motivation for concentrating on local methods is two-fold: one driving factor is the ever-increasing size of real-world problems, but an equally important fact is the nonuniformity present in many natural graph instances; properties that hold for the entire graph are often lost when only a small subgraph is examined.reviewe
    corecore