277 research outputs found

    Diagnosis of Autism Spectrum Disorder Based on Brain Network Clustering

    Get PDF
    Developments in magnetic resonance imaging (MRI) provide new non-invasive approach—functional MRI (fMRI)—to study functions of brain. With the help of fMRI, I can build functional brain networks (FBN) to model correlations of brain activities between cortical regions. Studies focused on brain diseases, including autism spectrum disorder (ASD), have been conducted based on analyzing alterations in FBNs of patients. New biomarkers are identified, and new theories and assumptions are proposed on pathology of brain diseases. Considering that traditional clinical ASD diagnosis instruments, which greatly rely on interviews and observations, can yield large variance, recent studies start to combine machine learning methods and FBN to perform auto-classification of ASD. Such studies have achieved relatively good accuracy. However, in most of these studies, features they use are extracted from the whole brain networks thus the dimension of the features can be high. High-dimensional features may yield overfitting issues and increase computational complexity. Therefore, I need a feature selection strategy that effectively reduces feature dimensions while keeping a good classification performance. In this study, I present a new feature selection strategy that extracting features from functional modules but not the whole brain networks. I will show that my strategy not only reduces feature dimensions, but also improve performances of auto-classifications of ASD. The whole study can be separated into 4 stages: building FBNs, identification of functional modules, statistical analysis of modular alterations and, finally, training classifiers with modular features for auto-classification of ASD. I firstly demonstrate the whole procedure to build FBNs from fMRI images. To identify functional module, I propose a new network clustering algorithm based on joint non-negative matrix factorization. Different from traditional brain network clustering algorithms that mostly perform on an average network of all subjects, I design my algorithm to factorize multiple brain networks simultaneously because the clustering results should be valid not only on the average network but also on each individual network. I show the modules I find are more valid in both views. Then I statistically analyze the alterations in functional modules between ASD and typically developed (TD) group to determine from which modules I extract features from. Several indices based on graph theory are calculated to measure modular properties. I find significant alterations in two modules. With features from these two modules, I train several widely-used classifiers and validate the classifiers on a real-world dataset. The performances of classifiers trained by modular features are better than those with whole-brain features, which demonstrates the effectiveness of my feature selection strategy

    Commonsense Properties from Query Logs and Question Answering Forums

    No full text
    Commonsense knowledge about object properties, human behavior and general concepts is crucial for robust AI applications. However, automatic acquisition of this knowledge is challenging because of sparseness and bias in online sources. This paper presents Quasimodo, a methodology and tool suite for distilling commonsense properties from non-standard web sources. We devise novel ways of tapping into search-engine query logs and QA forums, and combining the resulting candidate assertions with statistical cues from encyclopedias, books and image tags in a corroboration step. Unlike prior work on commonsense knowledge bases, Quasimodo focuses on salient properties that are typically associated with certain objects or concepts. Extensive evaluations, including extrinsic use-case studies, show that Quasimodo provides better coverage than state-of-the-art baselines with comparable quality

    Tensor Learning for Recovering Missing Information: Algorithms and Applications on Social Media

    Get PDF
    Real-time social systems like Facebook, Twitter, and Snapchat have been growing rapidly, producing exabytes of data in different views or aspects. Coupled with more and more GPS-enabled sharing of videos, images, blogs, and tweets that provide valuable information regarding “who”, “where”, “when” and “what”, these real-time human sensor data promise new research opportunities to uncover models of user behavior, mobility, and information sharing. These real-time dynamics in social systems usually come in multiple aspects, which are able to help better understand the social interactions of the underlying network. However, these multi-aspect datasets are often raw and incomplete owing to various unpredictable or unavoidable reasons; for instance, API limitations and data sampling policies can lead to an incomplete (and often biased) perspective on these multi-aspect datasets. This missing data could raise serious concerns such as biased estimations on structural properties of the network and properties of information cascades in social networks. In order to recover missing values or information in social systems, we identify “4S” challenges: extreme sparsity of the observed multi-aspect datasets, adoption of rich side information that is able to describe the similarities of entities, generation of robust models rather than limiting them on specific applications, and scalability of models to handle real large-scale datasets (billions of observed entries). With these challenges in mind, this dissertation aims to develop scalable and interpretable tensor-based frameworks, algorithms and methods for recovering missing information on social media. In particular, this dissertation research makes four unique contributions: _ The first research contribution of this dissertation research is to propose a scalable framework based on low-rank tensor learning in the presence of incomplete information. Concretely, we formally define the problem of recovering the spatio-temporal dynamics of online memes and tackle this problem by proposing a novel tensor-based factorization approach based on the alternative direction method of multipliers (ADMM) with the integration of the latent relationships derived from contextual information among locations, memes, and times. _ The second research contribution of this dissertation research is to evaluate the generalization of the proposed tensor learning framework and extend it to the recommendation problem. In particular, we develop a novel tensor-based approach to solve the personalized expert recommendation by integrating both the latent relationships between homogeneous entities (e.g., users and users, experts and experts) and the relationships between heterogeneous entities (e.g., users and experts, topics and experts) from the geo-spatial, topical, and social contexts. _ The third research contribution of this dissertation research is to extend the proposed tensor learning framework to the user topical profiling problem. Specifically, we propose a tensor-based contextual regularization model embedded into a matrix factorization framework, which leverages the social, textual, and behavioral contexts across users, in order to overcome identified challenges. _ The fourth research contribution of this dissertation research is to scale up the proposed tensor learning framework to be capable of handling real large-scale datasets that are too big to fit in the main memory of a single machine. Particularly, we propose a novel distributed tensor completion algorithm with the trace-based regularization of the auxiliary information based on ADMM under the proposed tensor learning framework, which is designed to scale up to real large-scale tensors (e.g., billions of entries) by efficiently computing auxiliary variables, minimizing intermediate data, and reducing the workload of updating new tensors

    COMMUNITY DETECTION IN GRAPHS

    Get PDF
    Thesis (Ph.D.) - Indiana University, Luddy School of Informatics, Computing, and Engineering/University Graduate School, 2020Community detection has always been one of the fundamental research topics in graph mining. As a type of unsupervised or semi-supervised approach, community detection aims to explore node high-order closeness by leveraging graph topological structure. By grouping similar nodes or edges into the same community while separating dissimilar ones apart into different communities, graph structure can be revealed in a coarser resolution. It can be beneficial for numerous applications such as user shopping recommendation and advertisement in e-commerce, protein-protein interaction prediction in the bioinformatics, and literature recommendation or scholar collaboration in citation analysis. However, identifying communities is an ill-defined problem. Due to the No Free Lunch theorem [1], there is neither gold standard to represent perfect community partition nor universal methods that are able to detect satisfied communities for all tasks under various types of graphs. To have a global view of this research topic, I summarize state-of-art community detection methods by categorizing them based on graph types, research tasks and methodology frameworks. As academic exploration on community detection grows rapidly in recent years, I hereby particularly focus on the state-of-art works published in the latest decade, which may leave out some classic models published decades ago. Meanwhile, three subtle community detection tasks are proposed and assessed in this dissertation as well. First, apart from general models which consider only graph structures, personalized community detection considers user need as auxiliary information to guide community detection. In the end, there will be fine-grained communities for nodes better matching user needs while coarser-resolution communities for the rest of less relevant nodes. Second, graphs always suffer from the sparse connectivity issue. Leveraging conventional models directly on such graphs may hugely distort the quality of generate communities. To tackle such a problem, cross-graph techniques are involved to propagate external graph information as a support for target graph community detection. Third, graph community structure supports a natural language processing (NLP) task to depict node intrinsic characteristics by generating node summarizations via a text generative model. The contribution of this dissertation is threefold. First, a decent amount of researches are reviewed and summarized under a well-defined taxonomy. Existing works about methods, evaluation and applications are all addressed in the literature review. Second, three novel community detection tasks are demonstrated and associated models are proposed and evaluated by comparing with state-of-art baselines under various datasets. Third, the limitations of current works are pointed out and future research tracks with potentials are discussed as well

    Temporally adaptive monitoring procedures with applications in enterprise cyber-security

    Get PDF
    Due to the perpetual threat of cyber-attacks, enterprises must employ and develop new methods of detection as attack vectors evolve and advance. Enterprise computer networks produce a large volume and variety of data including univariate data streams, time series and network graph streams. Motivated by cyber-security, this thesis develops adaptive monitoring tools for univariate and network graph data streams, however, they are not limited to this domain. In all domains, real data streams present several challenges for monitoring including trend, periodicity and change points. Streams often also have high volume and frequency. To deal with the non-stationarity in the data, the methods applied must be adaptive. Adaptability in the proposed procedures throughout the thesis is introduced using forgetting factors, weighting the data accordingly to recency. Secondly, methods applied must be computationally fast with a small or fixed computation burden and fixed storage requirements for timely processing. Throughout this thesis, sequential or sliding window approaches are employed to achieve this. The first part of the thesis is centred around univariate monitoring procedures. A sequential adaptive parameter estimator is proposed using a Bayesian framework. This procedure is then extended for multiple change point detection, where, unlike existing change point procedures, the proposed method is capable of detecting abrupt changes in the presence of trend. We additionally present a time series model which combines short-term and long-term behaviours of a series for improved anomaly detection. Unlike existing methods which primarily focus on point anomalies detection (extreme outliers), our method is capable of also detecting contextual anomalies, when the data deviates from persistent patterns of the series such as seasonality. Finally, a novel multi-type relational clustering methodology is proposed. As multiple relations exist between the different entities within a network (computers, users and ports), multiple network graphs can be generated. We propose simultaneously clustering over all graphs to produce a single clustering for each entity using Non-Negative Matrix Tri-Factorisation. Through simplifications, the proposed procedure is fast and scalable for large network graphs. Additionally, this methodology is extended for graph streams. This thesis provides an assortment of tools for enterprise network monitoring with a focus on adaptability and scalability making them suitable for intrusion detection and situational awareness.Open Acces

    Consistent community detection in uni-layer and multi-layer networks

    Get PDF
    Over the last two decades, we have witnessed a massive explosion of our data collection abilities and the birth of a "big data" age. This has led to an enormous interest in statistical inference of a new type of complex data structure, a graph or network. The surge in interdisciplinary interest on statistical analysis of network data has been driven by applications in Neuroscience, Genetics, Social sciences, Computer science, Economics and Marketing. A network consists of a set of nodes or vertices, representing a set of entities, and a set of edges, representing the relations or interactions among the entities. Networks are flexible frameworks that can model many complex systems. In the majority of the network examples dealt with in the literature, the relations between nodes are assumed to be of the same type such as web page linkage, friendship, co-authorship or protein-protein interaction. However, the complex networks in many modern applications are often multi-layered in the sense that they consist of multiple types of edges/relations among a group of entities. Each of those different types of relations can be viewed as creating its own network, called a layer of the multi-layer network. Multi-layer networks are a more accurate representation of many complex systems since many entities in those systems are involved simultaneously in multiple interactions. In this dissertation we view multi-layer networks in the broad sense that includes multiple types of relations as well as multiple information sources on the same set of nodes (e.g., multiple trials or multiple subjects). The problem of detecting communities or clusters of nodes in a network has received considerable attention in literature. As with uni-layer networks, community detection is an important task in multi-layer networks. This dissertation aims to develop new methods and theory for community detection in both uni-layer and multi-layer networks that can be used to answer scientific questions from experimental data. For community detection in uni and multi-layer graphs, we take three approaches - (1) based on statistical random graph models, (2) based on maximizing quality functions, e.g., the modularity score and (3) based on spectral and matrix factorization methods. In Chapter 2 we consider two random graph models for community detection in multi-layer networks, the multi-layer stochastic block model (MLSBM) and a model with a restricted parameter space, the restricted multi-layer stochastic block model (RMLSBM). We derive consistency results for community assignments of the maximum likelihood estimators (MLEs) in both models where MLSBM is assumed to be the true model, and either the number of nodes or the number of types of edges or both grow. We compared MLEs in the two models among themselves and with other baseline approaches both theoretically and through simulations. We also derived minimax error rates and thresholds for achieving consistency of community detection in MLSBM, which were then used to show the advantage of the multi-layer model over a traditional alternative, the aggregate stochastic block model. In simulations RMLSBM is shown to have advantage over MLSBM when either the growth rate of the number of communities is high or the growth rate of the average degree of the component graphs in the multi-graph is low. A popular method of community detection in uni-layer networks is maximization of a partition quality function called modularity. In Chapter 3 we introduce several multi-layer network modularity measures based on different random graph null models, motivated by empirical observations from a diverse field of applications. In particular, we derived different modularities by defining the multi-layer configuration model, the multi-layer expected degree model and their various modifications as null models for multi-layer networks. These measures are then optimized to detect the optimal community assignment of nodes. We apply the methods to five real multi-layer networks - three social networks from the website Twitter, a complete neuronal network of a nematode, C-elegans and a classroom friendship network of 7th-grade students. In Chapter 4 we present a method based on the orthogonal symmetric non-negative matrix tri-factorization of the normalized Laplacian matrix for community detection in complex networks. While the exact factorization of a given order may not exist and is NP hard to compute, we obtain an approximate factorization by solving an optimization problem. We establish the connection of the factors obtained through the factorization to a non-negative basis of an invariant subspace of the estimated matrix, drawing parallel with the spectral clustering. Using such factorization for clustering in networks is motivated by analyzing a block-diagonal Laplacian matrix with the blocks representing the connected components of a graph. The method is shown to be consistent for community detection in graphs generated from the stochastic block model and the degree corrected stochastic block model. Simulation results and real data analysis show the effectiveness of these methods under a wide variety of situations, including sparse and highly heterogeneous graphs where the usual spectral clustering is known to fail. Our method also performs better than the state of the art in popular benchmark network datasets, e.g., the political web blogs and the karate club data. In Chapter 5 we once again consider the problem of estimating a consensus community structure by combining information from multiple layers of a multi-layer network or multiple snapshots of a time-varying network. Numerous methods have been proposed in the literature for the more general problem of multi-view clustering in the past decade based on the spectral clustering or a low-rank matrix factorization. As a general theme, these "intermediate fusion" methods involve obtaining a low column rank matrix by optimizing an objective function and then using the columns of the matrix for clustering. Such methods can be adapted for community detection in multi-layer networks with minimal modifications. However, the theoretical properties of these methods remain largely unexplored and most authors have relied on performance in synthetic and real data to assess the goodness of the procedures. In the absence of statistical guarantees on the objective functions, it is difficult to determine if the algorithms optimizing the objective will return a good community structure. We apply some of these methods for consensus community detection in multi-layer networks and investigate the consistency properties of the global optimizer of the objective functions under the multi-layer stochastic block model. We derive several new asymptotic results showing consistency of the intermediate fusion techniques along with the spectral clustering of mean adjacency matrix under a high dimensional setup where both the number of nodes and the number of layers of the multi-layer graph grow. We complement the asymptotic analysis with a thorough numerical study to compare the finite sample performance of the methods. Motivated by multi-subject and multi-trial experiments in neuroimaging studies, in Chapter 6 we develop a modeling framework for joint community detection in a group of related networks. The proposed model, which we call the random effects stochastic block model facilitates the study of group differences and subject specific variations in the community structure. In contrast to the previously proposed multi-layer stochastic block models, our model allows community memberships of nodes to vary in each component network or layer with a transition probability matrix, thus modeling the variation in community structure across a group of subjects or trials. We propose two methods to estimate the parameters of the model, a variational-EM algorithm and two non-parametric "two-step" methods based on spectral and matrix factorization respectively. We also develop several hypothesis tests with p-values obtained through resampling (permutation test) for differences in community structure in two groups of subjects both at the whole network level and node level. The methodology is applied to publicly available fMRI datasets from multi-subject experiments involving schizophrenia patients along with healthy controls. Our methods reveal an overall putative community structure representative of the groups as well as subject-specific variations within each group. Using our network level hypothesis tests we are able to ascertain statistically significant difference in community structure between the two groups, while our node level tests help determine the nodes that are driving the difference

    Apprentissage de représentation pour des données générées par des utilisateurs

    Get PDF
    In this thesis, we study how representation learning methods can be applied to user-generated data. Our contributions cover three different applications but share a common denominator: the extraction of relevant user representations. Our first application is the item recommendation task, where recommender systems build user and item profiles out of past ratings reflecting user preferences and item characteristics. Nowadays, textual information is often together with ratings available and we propose to use it to enrich the profiles extracted from the ratings. Our hope is to extract from the textual content shared opinions and preferences. The models we propose provide another opportunity: predicting the text a user would write on an item. Our second application is sentiment analysis and, in particular, polarity classification. Our idea is that recommender systems can be used for such a task. Recommender systems and traditional polarity classifiers operate on different time scales. We propose two hybridizations of these models: the former has better classification performance, the latter highlights a vocabulary of surprise in the texts of the reviews. The third and final application we consider is urban mobility. It takes place beyond the frontiers of the Internet, in the physical world. Using authentication logs of the subway users, logging the time and station at which users take the subway, we show that it is possible to extract robust temporal profiles.Dans cette thèse, nous étudions comment les méthodes d'apprentissage de représentations peuvent être appliquées à des données générées par l'utilisateur. Nos contributions couvrent trois applications différentes, mais partagent un dénominateur commun: l'extraction des représentations d'utilisateurs concernés. Notre première application est la tâche de recommandation de produits, où les systèmes existant créent des profils utilisateurs et objets qui reflètent les préférences des premiers et les caractéristiques des derniers, en utilisant l'historique. De nos jours, un texte accompagne souvent cette note et nous proposons de l'utiliser pour enrichir les profils extraits. Notre espoir est d'en extraire une connaissance plus fine des goûts des utilisateurs. Nous pouvons, en utilisant ces modèles, prédire le texte qu'un utilisateur va écrire sur un objet. Notre deuxième application est l'analyse des sentiments et, en particulier, la classification de polarité. Notre idée est que les systèmes de recommandation peuvent être utilisés pour une telle tâche. Les systèmes de recommandation et classificateurs de polarité traditionnels fonctionnent sur différentes échelles de temps. Nous proposons deux hybridations de ces modèles: la première a de meilleures performances en classification, la seconde exhibe un vocabulaire de surprise. La troisième et dernière application que nous considérons est la mobilité urbaine. Elle a lieu au-delà des frontières d'Internet, dans le monde physique. Nous utilisons les journaux d'authentification des usagers du métro, enregistrant l'heure et la station d'origine des trajets, pour caractériser les utilisateurs par ses usages et habitudes temporelles
    corecore