    Algorithms for partitioning problem

    Orientador: Eduardo Candido XavierDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Investigamos Problemas de Particionamento de objetos que têm relações de similaridade entre si. Instâncias desses problemas podem ser representados por grafos, em que objetos são vértices e a similaridade entre dois objetos é representada por um valor associado à aresta que liga os objetos. O objetivo do problema é particionar os objetos de tal forma que objetos similares pertençam a um mesmo subconjunto de objetos. Nosso foco é o estudo de algoritmos para clusterização em grafos, onde deve-se determinar clusteres tal que arestas ligando vértices de clusteres diferentes tenham peso baixo e ao mesmo tempo as arestas entre vértices de um mesmo cluster tenha peso alto. Problemas de particionamento e clusterização possuem aplicações em diversas áreas, como mineração de dados, recuperação de informação, biologia computacional, entre outros. No caso geral estes problemas são NP-Difíceis. Nosso interesse é investigar algoritmos eficientes (com complexidade de tempo polinomial) e que gerem boas soluções, como Heurísticas, Metaheurísticas e Algoritmos de Aproximação. Dentre os algoritmos estudados, implementamos os mais promissores e fazemos uma comparação de seus resultados utilizando instâncias geradas computacionalmente. Por fim, propomos um algoritmo que utiliza a metaheurística GRASP para o problema considerado e mostramos que, para as instâncias de testes geradas, nosso algoritmo obtém melhores resultadosAbstract: In this work we investigate Partitioning Problems of objects for which a similarity relations is defined. Instance to these problems can be represented by graphs where vertices are objects, and the similarity between two objects is represented by a value associated with an edge that connects objects. The problem objective is to partition the objects such that similar objects belong to the same subset of objects. We study clustering algorithms for graphs, where clusters must be determined such that edges connecting vertices of different clusters have low weight while the edges between vertices of a same cluster have high weight. Partitioning and clustering problems have applications in many areas, such as data mining, information retrieval, computational biology, and others. Many versions of these problems are NP-Hard. Our interest is to study eficient algorithms (with polynomial time complexity) that generate good solutions, such as Heuristics, Approximation Algorithms and Metaheuristics. We implemented the most promising algorithms and compared their results using instances generated computationally. Finally, we propose a GRASP based algorithm for the partition and clustering problem and show that, for the generated test instances, our algorithm achieves better resultsMestradoMestre em Ciência da Computaçã

    Bipartite graph for topic extraction

    This article presents a bipartite graph propagation method to be applied to different tasks in the machine learning unsupervised domain, such as topic extraction and clustering. We introduce the objectives and hypothesis that motivate the use of graph based method, and we give the intuition of the proposed Bipartite Graph Propagation Algorithm. The contribution of this study is the development of new method that allows the use of heuristic knowledge to discover topics in textual data easier than it is possible in the traditional mathematical formalism based on Latent Dirichlet Allocation (LDA). Initial experiments demonstrate that our Bipartite Graph Propagation algorithm return good results in a static context (offline algorithm). Now, our research is focusing on big amount of data and dynamic context (online algorithm).São Paulo Research Foundation (FAPESP) (proj. number 2011/23689-9

    A Naïve Bayes model based on overlapping groups for link prediction in online social networks

    Link prediction in online social networks is useful in numerous applications, mainly for recommendation. Recently, different approaches have considered friendship groups information for increasing the link prediction accuracy. Nevertheless, these approaches do not consider the different roles that common neighbors may play in the different overlapping groups that they belong to. In this paper, we propose a new approach that uses overlapping groups structural information for building a naïve Bayes model. From this proposal, we show three different measures derived from the common neighbors. We perform experiments for both unsupervised and supervised link prediction strategies considering the link imbalance problem. We compare sixteen measures in four well-known online social networks: Flickr, LiveJournal, Orkut and Youtube. Results show that our proposals help to improve the link prediction accuracy.São Paulo Research Foundation (FAPESP) (grants: 2013/12191-5, 2011/21880-3, 2011/23689-9 and 2011/22749-8

    Propagation in bipartite graphs for topic extraction in stream of textual data

    No full text
    Tratar grandes quantidades de dados é uma exigência dos modernos algoritmos de mineração de texto. Para algumas aplicações, documentos são constantemente publicados, o que demanda alto custo de armazenamento em longo prazo. Então, é necessário criar métodos de fácil adaptação para uma abordagem que considere documentos em fluxo, e que analise os dados em apenas um passo sem requerer alto custo de armazenamento. Outra exigência é a de que essa abordagem possa explorar heurísticas a fim de melhorar a qualidade dos resultados. Diversos modelos para a extração automática das informações latentes de uma coleção de documentos foram propostas na literatura, dentre eles destacando-se os modelos probabilísticos de tópicos. Modelos probabilísticos de tópicos apresentaram bons resultados práticos, sendo estendidos para diversos modelos com diversos tipos de informações inclusas. Entretanto, descrever corretamente esses modelos, derivá-los e em seguida obter o apropriado algoritmo de inferência são tarefas difíceis, exigindo um tratamento matemático rigoroso para as descrições das operações efetuadas no processo de descoberta das dimensões latentes. Assim, para a elaboração de um método simples e eficiente para resolver o problema da descoberta das dimensões latentes, é necessário uma apropriada representação dos dados. A hipótese desta tese é a de que, usando a representação de documentos em grafos bipartidos, é possível endereçar problemas de aprendizado de máquinas, para a descoberta de padrões latentes em relações entre objetos, por exemplo nas relações entre documentos e palavras, de forma simples e intuitiva. Para validar essa hipótese, foi desenvolvido um arcabouço baseado no algoritmo de propagação de rótulos utilizando a representação em grafos bipartidos. O arcabouço, denominado PBG (Propagation in Bipartite Graph), foi aplicado inicialmente para o contexto não supervisionado, considerando uma coleção estática de documentos. Em seguida, foi proposta uma versão semissupervisionada, que considera uma pequena quantidade de documentos rotulados para a tarefa de classificação transdutiva. E por fim, foi aplicado no contexto dinâmico, onde se considerou fluxo de documentos textuais. Análises comparativas foram realizadas, sendo que os resultados indicaram que o PBG é uma alternativa viável e competitiva para tarefas nos contextos não supervisionado e semissupervisionado.Handling large amounts of data is a requirement for modern text mining algorithms. For some applications, documents are published constantly, which demand a high cost for long-term storage. So it is necessary easily adaptable methods for an approach that considers documents flow, and be capable of analyzing the data in one step without requiring the high cost of storage. Another requirement is that this approach can exploit heuristics in order to improve the quality of results. Several models for automatic extraction of latent information in a collection of documents have been proposed in the literature, among them probabilistic topic models are prominent. Probabilistic topic models achieve good practical results, and have been extended to several models with different types of information included. However, properly describe these models, derive them, and then get appropriate inference algorithms are difficult tasks, requiring a rigorous mathematical treatment for descriptions of operations performed in the latent dimensions discovery process. Thus, for the development of a simple and efficient method to tackle the problem of latent dimensions discovery, a proper representation of the data is required. The hypothesis of this thesis is that by using bipartite graph for representation of textual data one can address the task of latent patterns discovery, present in the relationships between documents and words, in a simple and intuitive way. For validation of this hypothesis, we have developed a framework based on label propagation algorithm using the bipartite graph representation. The framework, called PBG (Propagation in Bipartite Graph) was initially applied to the unsupervised context for a static collection of documents. Then a semi-supervised version was proposed which need only a small amount of labeled documents to the transductive classification task. Finally, it was applied in the dynamic context in which flow of textual data was considered. Comparative analyzes were performed, and the results indicated that the PBG is a viable and competitive alternative for tasks in the unsupervised and semi-supervised contexts

    Coarsening effects on k-partite network classification

    No full text
    Abstract The growing data size poses challenges for storage and computational processing time in semi-supervised models, making their practical application difficult; researchers have explored the use of reduced network versions as a potential solution. Real-world networks contain diverse types of vertices and edges, leading to using k-partite network representation. However, the existing methods primarily reduce uni-partite networks with a single type of vertex and edge. We develop a new coarsening method applicable to the k-partite networks that maintain classification performance. The empirical analysis of hundreds of thousands of synthetically generated networks demonstrates the promise of coarsening techniques in solving large networks’ storage and processing problems. The findings indicate that the proposed coarsening algorithm achieved significant improvements in storage efficiency and classification runtime, even with modest reductions in the number of vertices, leading to over one-third savings in storage and twice faster classifications; furthermore, the classification performance metrics exhibited low variation on average

    Inductive model generation for text classification using a bipartite heterogeneous network

    No full text
    Algorithms for numeric data classification have been applied for text classification. Usually the vector space model is used to represent text collections. The characteristics of this representation such as sparsity and high dimensionality sometimes impair the quality of general-purpose classifiers. Networks can be used to represent text collections, avoiding the high sparsity and allowing to model relationships among different objects that compose a text collection. Such network-based representations can improve the quality of the classification results. One of the simplest ways to represent textual collections by a network is through a bipartite heterogeneous network, which is composed of objects that represent the documents connected to objects that represent the terms. Heterogeneous bipartite networks do not require computation of similarities or relations among the objects and can be used to model any type of text collection. Due to the advantages of representing text collections through bipartite heterogeneous networks, in this article we present a text classifier which builds a classification model using the structure of a bipartite heterogeneous network. Such an algorithm, referred to as IMBHN (Inductive Model Based on Bipartite Heterogeneous Network), induces a classification model assigning weights to objects that represent the terms for each class of the text collection. An empirical evaluation using a large amount of text collections from different domains shows that the proposed IMBHN algorithm produces significantly better results than k-NN, C4.5, SVM, and Naive Bayes algorithms.São Paulo Research Foundation (FAPESP) of Brasil (Grant Nos. 2011/12823-6, 2011/23689-9, and 2011/19850-9)A preliminary version of the paper was published in the Proceedings of ICDM 201