258 research outputs found

    Dynamic Community Discovery Method Based on Phylogenetic Planted Partition in Temporal Networks

    Get PDF
    As most of the community discovery methods are researched by static thought, some community discovery algorithms cannot represent the whole dynamic network change process efficiently. This paper proposes a novel dynamic community discovery method (Phylogenetic Planted Partition Model, PPPM) for phylogenetic evolution. Firstly, the time dimension is introduced into the typical migration partition model, and all states are treated as variables, and the observation equation is constructed. Secondly, this paper takes the observation equation of the whole dynamic social network as the constraint between variables and the error function. Then, the quadratic form of the error function is minimized. Thirdly, the Levenberg–Marquardt (L–M) method is used to calculate the gradient of the error function, and the iteration is carried out. Finally, simulation experiments are carried out under the experimental environment of artificial networks and real net-works. The experimental results show that: compared with FaceNet, SBM + MLE, CLBM, and Pi-sCES, the proposed PPPM model improves accuracy by 5% and 3%, respectively. It is proven that the proposed PPPM method is robust, reasonable, and effective. This method can also be applied to the general social networking community discovery field

    Four algorithms to solve symmetric multi-type non-negative matrix tri-factorization problem

    Get PDF
    In this paper, we consider the symmetric multi-type non-negative matrix tri-factorization problem (SNMTF), which attempts to factorize several symmetric non-negative matrices simultaneously. This can be considered as a generalization of the classical non-negative matrix tri-factorization problem and includes a non-convex objective function which is a multivariate sixth degree polynomial and a has convex feasibility set. It has a special importance in data science, since it serves as a mathematical model for the fusion of different data sources in data clustering. We develop four methods to solve the SNMTF. They are based on four theoretical approaches known from the literature: the fixed point method (FPM), the block-coordinate descent with projected gradient (BCD), the gradient method with exact line search (GM-ELS) and the adaptive moment estimation method (ADAM). For each of these methods we offer a software implementation: for the former two methods we use Matlab and for the latter Python with the TensorFlow library. We test these methods on three data-sets: the synthetic data-set we generated, while the others represent real-life similarities between different objects. Extensive numerical results show that with sufficient computing time all four methods perform satisfactorily and ADAM most often yields the best mean square error (MSE\mathrm{MSE}). However, if the computation time is limited, FPM gives the best MSE\mathrm{MSE} because it shows the fastest convergence at the beginning. All data-sets and codes are publicly available on our GitLab profile

    COMMUNITY DETECTION IN GRAPHS

    Get PDF
    Thesis (Ph.D.) - Indiana University, Luddy School of Informatics, Computing, and Engineering/University Graduate School, 2020Community detection has always been one of the fundamental research topics in graph mining. As a type of unsupervised or semi-supervised approach, community detection aims to explore node high-order closeness by leveraging graph topological structure. By grouping similar nodes or edges into the same community while separating dissimilar ones apart into different communities, graph structure can be revealed in a coarser resolution. It can be beneficial for numerous applications such as user shopping recommendation and advertisement in e-commerce, protein-protein interaction prediction in the bioinformatics, and literature recommendation or scholar collaboration in citation analysis. However, identifying communities is an ill-defined problem. Due to the No Free Lunch theorem [1], there is neither gold standard to represent perfect community partition nor universal methods that are able to detect satisfied communities for all tasks under various types of graphs. To have a global view of this research topic, I summarize state-of-art community detection methods by categorizing them based on graph types, research tasks and methodology frameworks. As academic exploration on community detection grows rapidly in recent years, I hereby particularly focus on the state-of-art works published in the latest decade, which may leave out some classic models published decades ago. Meanwhile, three subtle community detection tasks are proposed and assessed in this dissertation as well. First, apart from general models which consider only graph structures, personalized community detection considers user need as auxiliary information to guide community detection. In the end, there will be fine-grained communities for nodes better matching user needs while coarser-resolution communities for the rest of less relevant nodes. Second, graphs always suffer from the sparse connectivity issue. Leveraging conventional models directly on such graphs may hugely distort the quality of generate communities. To tackle such a problem, cross-graph techniques are involved to propagate external graph information as a support for target graph community detection. Third, graph community structure supports a natural language processing (NLP) task to depict node intrinsic characteristics by generating node summarizations via a text generative model. The contribution of this dissertation is threefold. First, a decent amount of researches are reviewed and summarized under a well-defined taxonomy. Existing works about methods, evaluation and applications are all addressed in the literature review. Second, three novel community detection tasks are demonstrated and associated models are proposed and evaluated by comparing with state-of-art baselines under various datasets. Third, the limitations of current works are pointed out and future research tracks with potentials are discussed as well

    A Collaborative Filtering Probabilistic Approach for Recommendation to Large Homogeneous and Automatically Detected Groups

    Get PDF
    In the collaborative filtering recommender systems (CFRS) field, recommendation to group of users is mainly focused on stablished, occasional or random groups. These groups have a little number of users: relatives, friends, colleagues, etc. Our proposal deals with large numbers of automatically detected groups. Marketing and electronic commerce are typical targets of large homogenous groups. Large groups present a major difficulty in terms of automatically achieving homogeneity, equilibrated size and accurate recommendations. We provide a method that combines diverse machine learning algorithms in an original way: homogeneous groups are detected by means of a clustering based on hidden factors instead of ratings. Predictions are made using a virtual user model, and virtual users are obtained by performing a hidden factors aggregation. Additionally, this paper selects the most appropriate dimensionality reduction for the explained RS aim. We conduct a set of experiments to catch the maximum cumulative deviation of the ratings information. Results show an improvement on recommendations made to large homogeneous groups. It is also shown the desirability of designing specific methods and algorithms to deal with automatically detected groups

    TLGP: a flexible transfer learning algorithm for gene prioritization based on heterogeneous source domain

    Get PDF
    BackgroundGene prioritization (gene ranking) aims to obtain the centrality of genes, which is critical for cancer diagnosis and therapy since keys genes correspond to the biomarkers or targets of drugs. Great efforts have been devoted to the gene ranking problem by exploring the similarity between candidate and known disease-causing genes. However, when the number of disease-causing genes is limited, they are not applicable largely due to the low accuracy. Actually, the number of disease-causing genes for cancers, particularly for these rare cancers, are really limited. Therefore, there is a critical needed to design effective and efficient algorithms for gene ranking with limited prior disease-causing genes.ResultsIn this study, we propose a transfer learning based algorithm for gene prioritization (called TLGP) in the cancer (target domain) without disease-causing genes by transferring knowledge from other cancers (source domain). The underlying assumption is that knowledge shared by similar cancers improves the accuracy of gene prioritization. Specifically, TLGP first quantifies the similarity between the target and source domain by calculating the affinity matrix for genes. Then, TLGP automatically learns a fusion network for the target cancer by fusing affinity matrix, pathogenic genes and genomic data of source cancers. Finally, genes in the target cancer are prioritized. The experimental results indicate that the learnt fusion network is more reliable than gene co-expression network, implying that transferring knowledge from other cancers improves the accuracy of network construction. Moreover, TLGP outperforms state-of-the-art approaches in terms of accuracy, improving at least 5%.ConclusionThe proposed model and method provide an effective and efficient strategy for gene ranking by integrating genomic data from various cancers

    A Computational Framework for Learning from Complex Data: Formulations, Algorithms, and Applications

    Get PDF
    Many real-world processes are dynamically changing over time. As a consequence, the observed complex data generated by these processes also evolve smoothly. For example, in computational biology, the expression data matrices are evolving, since gene expression controls are deployed sequentially during development in many biological processes. Investigations into the spatial and temporal gene expression dynamics are essential for understanding the regulatory biology governing development. In this dissertation, I mainly focus on two types of complex data: genome-wide spatial gene expression patterns in the model organism fruit fly and Allen Brain Atlas mouse brain data. I provide a framework to explore spatiotemporal regulation of gene expression during development. I develop evolutionary co-clustering formulation to identify co-expressed domains and the associated genes simultaneously over different temporal stages using a mesh-generation pipeline. I also propose to employ the deep convolutional neural networks as a multi-layer feature extractor to generate generic representations for gene expression pattern in situ hybridization (ISH) images. Furthermore, I employ the multi-task learning method to fine-tune the pre-trained models with labeled ISH images. My proposed computational methods are evaluated using synthetic data sets and real biological data sets including the gene expression data from the fruit fly BDGP data sets and Allen Developing Mouse Brain Atlas in comparison with baseline existing methods. Experimental results indicate that the proposed representations, formulations, and methods are efficient and effective in annotating and analyzing the large-scale biological data sets

    Statistical Techniques for Exploratory Analysis of Structured Three-Way and Dynamic Network Data.

    Full text link
    In this thesis, I develop different techniques for the pattern extraction and visual exploration of a collection of data matrices. Specifically, I present methods to help home in on and visualize an underlying structure and its evolution over ordered (e.g., time) or unordered (e.g., experimental conditions) index sets. The first part of the thesis introduces a biclustering technique for such three dimensional data arrays. This technique is capable of discovering potentially overlapping groups of samples and variables that evolve similarly with respect to a subset of conditions. To facilitate and enhance visual exploration, I introduce a framework that utilizes kernel smoothing to guide the estimation of bicluster responses over the array. In the second part of the thesis, I introduce two matrix factorization models. The first is a data integration model that decomposes the data into two factors: a basis common to all data matrices, and a coefficient matrix that varies for each data matrix. The second model is meant for visual clustering of nodes in dynamic network data, which often contains complex evolving structure. Hence, this approach is more flexible and additionally lets the basis evolve for each matrix in the array. Both models utilize a regularization within the framework of non-negative matrix factorization to encourage local smoothness of the basis and coefficient matrices, which improves interpretability and highlights the structural patterns underlying the data, while mitigating noise effects. I also address computational aspects of applying regularized non-negative matrix factorization models to large data arrays by presenting multiple algorithms, including an approximation algorithm based on alternating least squares.PhDStatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/99838/1/smankad_1.pd

    Learning with Attributed Networks: Algorithms and Applications

    Get PDF
    abstract: Attributes - that delineating the properties of data, and connections - that describing the dependencies of data, are two essential components to characterize most real-world phenomena. The synergy between these two principal elements renders a unique data representation - the attributed networks. In many cases, people are inundated with vast amounts of data that can be structured into attributed networks, and their use has been attractive to researchers and practitioners in different disciplines. For example, in social media, users interact with each other and also post personalized content; in scientific collaboration, researchers cooperate and are distinct from peers by their unique research interests; in complex diseases studies, rich gene expression complements to the gene-regulatory networks. Clearly, attributed networks are ubiquitous and form a critical component of modern information infrastructure. To gain deep insights from such networks, it requires a fundamental understanding of their unique characteristics and be aware of the related computational challenges. My dissertation research aims to develop a suite of novel learning algorithms to understand, characterize, and gain actionable insights from attributed networks, to benefit high-impact real-world applications. In the first part of this dissertation, I mainly focus on developing learning algorithms for attributed networks in a static environment at two different levels: (i) attribute level - by designing feature selection algorithms to find high-quality features that are tightly correlated with the network topology; and (ii) node level - by presenting network embedding algorithms to learn discriminative node embeddings by preserving node proximity w.r.t. network topology structure and node attribute similarity. As changes are essential components of attributed networks and the results of learning algorithms will become stale over time, in the second part of this dissertation, I propose a family of online algorithms for attributed networks in a dynamic environment to continuously update the learning results on the fly. In fact, developing application-aware learning algorithms is more desired with a clear understanding of the application domains and their unique intents. As such, in the third part of this dissertation, I am also committed to advancing real-world applications on attributed networks by incorporating the objectives of external tasks into the learning process.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    TrustDL: Use of trust-based dictionary learning to facilitate recommendation in social networks

    Get PDF
    peer reviewedCollaborative filtering (CF) is a widely applied method to perform recommendation tasks in a wide range of domains and applications. Dictionary learning (DL) models, which are highly important in CF-based recommender systems (RSs), are well represented by rating matrices. However, these methods alone do not resolve the cold start and data sparsity issues in RSs. We observed a significant improvement in rating results by adding trust information on the social network. For that purpose, we proposed a new dictionary learning technique based on trust information, called TrustDL, where the social network data were employed in the process of recommendation based on structural details on the trusted network. TrustDL sought to integrate the sources of information, including trust statements and ratings, into the recommendation model to mitigate both problems of cold start and data sparsity. It conducted dictionary learning and trust embedding simultaneously to predict unknown rating values. In this paper, the dictionary learning technique was integrated into rating learning, along with the trust consistency regularization term designed to offer a more accurate understanding of the feature representation. Moreover, partially identical trust embedding was developed, where users with similar rating sets could cluster together, and those with similar rating sets could be represented collaboratively. The proposed strategy appears significantly beneficial based on experiments conducted on four frequently used datasets: Epinions, Ciao, FilmTrust, and Flixster
    • …
    corecore