356 research outputs found

    Online Spectral Clustering on Network Streams

    Get PDF
    Graph is an extremely useful representation of a wide variety of practical systems in data analysis. Recently, with the fast accumulation of stream data from various type of networks, significant research interests have arisen on spectral clustering for network streams (or evolving networks). Compared with the general spectral clustering problem, the data analysis of this new type of problems may have additional requirements, such as short processing time, scalability in distributed computing environments, and temporal variation tracking. However, to design a spectral clustering method to satisfy these requirements certainly presents non-trivial efforts. There are three major challenges for the new algorithm design. The first challenge is online clustering computation. Most of the existing spectral methods on evolving networks are off-line methods, using standard eigensystem solvers such as the Lanczos method. It needs to recompute solutions from scratch at each time point. The second challenge is the parallelization of algorithms. To parallelize such algorithms is non-trivial since standard eigen solvers are iterative algorithms and the number of iterations can not be predetermined. The third challenge is the very limited existing work. In addition, there exists multiple limitations in the existing method, such as computational inefficiency on large similarity changes, the lack of sound theoretical basis, and the lack of effective way to handle accumulated approximate errors and large data variations over time. In this thesis, we proposed a new online spectral graph clustering approach with a family of three novel spectrum approximation algorithms. Our algorithms incrementally update the eigenpairs in an online manner to improve the computational performance. Our approaches outperformed the existing method in computational efficiency and scalability while retaining competitive or even better clustering accuracy. We derived our spectrum approximation techniques GEPT and EEPT through formal theoretical analysis. The well established matrix perturbation theory forms a solid theoretic foundation for our online clustering method. We facilitated our clustering method with a new metric to track accumulated approximation errors and measure the short-term temporal variation. The metric not only provides a balance between computational efficiency and clustering accuracy, but also offers a useful tool to adapt the online algorithm to the condition of unexpected drastic noise. In addition, we discussed our preliminary work on approximate graph mining with evolutionary process, non-stationary Bayesian Network structure learning from non-stationary time series data, and Bayesian Network structure learning with text priors imposed by non-parametric hierarchical topic modeling

    Learning Interpretable Features of Graphs and Time Series Data

    Get PDF
    Graphs and time series are two of the most ubiquitous representations of data of modern time. Representation learning of real-world graphs and time-series data is a key component for the downstream supervised and unsupervised machine learning tasks such as classification, clustering, and visualization. Because of the inherent high dimensionality, representation learning, i.e., low dimensional vector-based embedding of graphs and time-series data is very challenging. Learning interpretable features incorporates transparency of the feature roles, and facilitates downstream analytics tasks in addition to maximizing the performance of the downstream machine learning models. In this thesis, we leveraged tensor (multidimensional array) decomposition for generating interpretable and low dimensional feature space of graphs and time-series data found from three domains: social networks, neuroscience, and heliophysics. We present the theoretical models and empirical results on node embedding of social networks, biomarker embedding on fMRI-based brain networks, and prediction and visualization of multivariate time-series-based flaring and non-flaring solar events

    Network Sampling: From Static to Streaming Graphs

    Full text link
    Network sampling is integral to the analysis of social, information, and biological networks. Since many real-world networks are massive in size, continuously evolving, and/or distributed in nature, the network structure is often sampled in order to facilitate study. For these reasons, a more thorough and complete understanding of network sampling is critical to support the field of network science. In this paper, we outline a framework for the general problem of network sampling, by highlighting the different objectives, population and units of interest, and classes of network sampling methods. In addition, we propose a spectrum of computational models for network sampling methods, ranging from the traditionally studied model based on the assumption of a static domain to a more challenging model that is appropriate for streaming domains. We design a family of sampling methods based on the concept of graph induction that generalize across the full spectrum of computational models (from static to streaming) while efficiently preserving many of the topological properties of the input graphs. Furthermore, we demonstrate how traditional static sampling algorithms can be modified for graph streams for each of the three main classes of sampling methods: node, edge, and topology-based sampling. Our experimental results indicate that our proposed family of sampling methods more accurately preserves the underlying properties of the graph for both static and streaming graphs. Finally, we study the impact of network sampling algorithms on the parameter estimation and performance evaluation of relational classification algorithms

    Mining subjectively interesting patterns in rich data

    Get PDF

    Geometric, Feature-based and Graph-based Approaches for the Structural Analysis of Protein Binding Sites : Novel Methods and Computational Analysis

    Get PDF
    In this thesis, protein binding sites are considered. To enable the extraction of information from the space of protein binding sites, these binding sites must be mapped onto a mathematical space. This can be done by mapping binding sites onto vectors, graphs or point clouds. To finally enable a structure on the mathematical space, a distance measure is required, which is introduced in this thesis. This distance measure eventually can be used to extract information by means of data mining techniques

    Graph-based, systems approach for detecting violent extremist radicalization trajectories and other latent behaviors, A

    Get PDF
    2017 Summer.Includes bibliographical references.The number and lethality of violent extremist plots motivated by the Salafi-jihadist ideology have been growing for nearly the last decade in both the U.S and Western Europe. While detecting the radicalization of violent extremists is a key component in preventing future terrorist attacks, it remains a significant challenge to law enforcement due to the issues of both scale and dynamics. Recent terrorist attack successes highlight the real possibility of missed signals from, or continued radicalization by, individuals whom the authorities had formerly investigated and even interviewed. Additionally, beyond considering just the behavioral dynamics of a person of interest is the need for investigators to consider the behaviors and activities of social ties vis-à-vis the person of interest. We undertake a fundamentally systems approach in addressing these challenges by investigating the need and feasibility of a radicalization detection system, a risk assessment assistance technology for law enforcement and intelligence agencies. The proposed system first mines public data and government databases for individuals who exhibit risk indicators for extremist violence, and then enables law enforcement to monitor those individuals at the scope and scale that is lawful, and account for the dynamic indicative behaviors of the individuals and their associates rigorously and automatically. In this thesis, we first identify the operational deficiencies of current law enforcement and intelligence agency efforts, investigate the environmental conditions and stakeholders most salient to the development and operation of the proposed system, and address both programmatic and technical risks with several initial mitigating strategies. We codify this large effort into a radicalization detection system framework. The main thrust of this effort is the investigation of the technological opportunities for the identification of individuals matching a radicalization pattern of behaviors in the proposed radicalization detection system. We frame our technical approach as a unique dynamic graph pattern matching problem, and develop a technology called INSiGHT (Investigative Search for Graph Trajectories) to help identify individuals or small groups with conforming subgraphs to a radicalization query pattern, and follow the match trajectories over time. INSiGHT is aimed at assisting law enforcement and intelligence agencies in monitoring and screening for those individuals whose behaviors indicate a significant risk for violence, and allow for the better prioritization of limited investigative resources. We demonstrated the performance of INSiGHT on a variety of datasets, to include small synthetic radicalization-specific data sets, a real behavioral dataset of time-stamped radicalization indicators of recent U.S. violent extremists, and a large, real-world BlogCatalog dataset serving as a proxy for the type of intelligence or law enforcement data networks that could be utilized to track the radicalization of violent extremists. We also extended INSiGHT by developing a non-combinatorial neighbor matching technique to enable analysts to maintain visibility of potential collective threats and conspiracies and account for the role close social ties have in an individual's radicalization. This enhancement was validated on small, synthetic radicalization-specific datasets as well as the large BlogCatalog dataset with real social network connections and tagging behaviors for over 80K accounts. The results showed that our algorithm returned whole and partial subgraph matches that enabled analysts to gain and maintain visibility on neighbors' activities. Overall, INSiGHT led to consistent, informed, and reliable assessments about those who pose a significant risk for some latent behavior in a variety of settings. Based upon these results, we maintain that INSiGHT is a feasible and useful supporting technology with the potential to optimize law enforcement investigative efforts and ultimately enable the prevention of individuals from carrying out extremist violence. Although the prime motivation of this research is the detection of violent extremist radicalization, we found that INSiGHT is applicable in detecting latent behaviors in other domains such as on-line student assessment and consumer analytics. This utility was demonstrated through experiments with real data. For on-line student assessment, we tested INSiGHT on a MOOC dataset of students and time-stamped on-line course activities to predict those students who persisted in the course. For consumer analytics, we tested the performance on a real, large proprietary consumer activities dataset from a home improvement retailer. Lastly, motivated by the desire to validate INSiGHT as a screening technology when ground truth is known, we developed a synthetic data generator of large population, time-stamped, individual-level consumer activities data consistent with an a priori project set designation (latent behavior). This contribution also sets the stage for future work in developing an analogous synthetic data generator for radicalization indicators to serve as a testbed for INSiGHT and other data mining algorithms

    Multilayer Networks

    Full text link
    In most natural and engineered systems, a set of entities interact with each other in complicated patterns that can encompass multiple types of relationships, change in time, and include other types of complications. Such systems include multiple subsystems and layers of connectivity, and it is important to take such "multilayer" features into account to try to improve our understanding of complex systems. Consequently, it is necessary to generalize "traditional" network theory by developing (and validating) a framework and associated tools to study multilayer systems in a comprehensive fashion. The origins of such efforts date back several decades and arose in multiple disciplines, and now the study of multilayer networks has become one of the most important directions in network science. In this paper, we discuss the history of multilayer networks (and related concepts) and review the exploding body of work on such networks. To unify the disparate terminology in the large body of recent work, we discuss a general framework for multilayer networks, construct a dictionary of terminology to relate the numerous existing concepts to each other, and provide a thorough discussion that compares, contrasts, and translates between related notions such as multilayer networks, multiplex networks, interdependent networks, networks of networks, and many others. We also survey and discuss existing data sets that can be represented as multilayer networks. We review attempts to generalize single-layer-network diagnostics to multilayer networks. We also discuss the rapidly expanding research on multilayer-network models and notions like community structure, connected components, tensor decompositions, and various types of dynamical processes on multilayer networks. We conclude with a summary and an outlook.Comment: Working paper; 59 pages, 8 figure

    Detection and Reinforcement of Celiac Communities on Twitter Argentina

    Get PDF
    Social Networks have shown great growth relating the number of their users and generated content. For example, Twitter is used asa means to gather support, express ideas and opinions on various topicsor interact with users with similar interests. In the latter case, the ideaof community formation appears, that is, groups of users that are moreclosely related to each other than the rest of the nodes in the network.In this work we propose the detection of the community of users ofArgentina interested in the celiac disease. We apply a series of techniques to detect and characterize them. In addition, we propose anduse a methodology for the detection of more influential and active nodes(users), showing how the community can be reinforced by the recommendation of some particular links. The results show that with only a lowpercentage of accepted recommendation the network becomes denser andaverage distance between two users decreases quickly, thus improving thespread of information.Fil: Giordano, Luis Andres. Universidad Nacional de Luján. Departamento de Ciencias Básicas; ArgentinaFil: Banchero, Santiago. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Nacional de Luján. Departamento de Ciencias Básicas; ArgentinaFil: Cerny, Natacha. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Nacional de Luján. Departamento de Ciencias Básicas; ArgentinaFil: de Marzi, Mauricio Cesar. Universidad Nacional de Luján. Departamento de Ciencias Básicas; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Tolosa, Gabriel. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentina. Universidad Nacional de Luján. Departamento de Ciencias Básicas; Argentin
    corecore