7 research outputs found

    Bipartite graph for topic extraction

    Get PDF
    This article presents a bipartite graph propagation method to be applied to different tasks in the machine learning unsupervised domain, such as topic extraction and clustering. We introduce the objectives and hypothesis that motivate the use of graph based method, and we give the intuition of the proposed Bipartite Graph Propagation Algorithm. The contribution of this study is the development of new method that allows the use of heuristic knowledge to discover topics in textual data easier than it is possible in the traditional mathematical formalism based on Latent Dirichlet Allocation (LDA). Initial experiments demonstrate that our Bipartite Graph Propagation algorithm return good results in a static context (offline algorithm). Now, our research is focusing on big amount of data and dynamic context (online algorithm).São Paulo Research Foundation (FAPESP) (proj. number 2011/23689-9

    Network-based data classification: combining k-associated optimal graphs and high-level prediction

    Get PDF
    Background: Traditional data classification techniques usually divide the data space into sub-spaces, each representing a class. Such a division is carried out considering only physical attributes of the training data (e.g., distance, similarity, or distribution). This approach is called low-level classification. On the other hand, network or graph-based approach is able to capture spacial, functional, and topological relations among data, providing a so-called high-level classification. Usually, network-based algorithms consist of two steps: network construction and classification. Despite that complex network measures are employed in the classification to capture patterns of the input data, the network formation step is critical and is not well explored. Some of them, such as K-nearest neighbors algorithm (KNN) and -radius, consider strict local information of the data and, moreover, depend on some parameters, which are not easy to be set. \ud Methods: We propose a network-based classification technique, named high-level classification on K-associated optimal graph (HL-KAOG), combining the K-associated optimal graph and high-level prediction. In this way, the network construction algorithm is non-parametric, and it considers both local and global information of the training data. In addition, since the proposed technique combines low-level and high-level terms, it classifies data not only by physical features but also by checking conformation of the test instance to formation pattern of each class component. Computer simulations are conducted to assess the effectiveness of the proposed technique.\ud Results: The results show that a larger portion of the high-level term is required to get correct classification when there is a complex-formed and well-defined pattern in the data set. In this case, we also show that traditional classification algorithms are unable to identify those data patterns. Moreover, computer simulations on real-world data sets show that HL-KAOG and support vector machines provide similar results and they outperform well-known techniques, such as decision trees and K-nearest neighbors. \ud Conclusions: The proposed technique works with a very reduced number of parameters and it is able to obtain good predictive performance in comparison with traditional techniques. In addition, the combination of high level and low level algorithms based on network components can allow greater exploration of patterns in data sets.São Paulo State Research Foundation (FAPESP)Brazilian National Council for Scientific and Technological Development (CNPq

    Inductive Model Generation for Text Categorization Using a Bipartite Heterogeneous Network

    No full text

    From the Occam's Razor to a simple, efficient and robust text categorization approach

    Get PDF
    Orientadores: Akebo Yamakami, Tiago Agostinho de AlmeidaTese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Elétrica e de ComputaçãoResumo: Categorização de textos é um problema que tem recebido muita atenção nos últimos anos devido ao aumento expressivo no volume de informações textuais. O processo manual de categorizar documentos de texto é cansativo, tedioso, demorado e muitas vezes impraticável quando o volume de dados é muito grande. Portanto, existe uma grande demanda para que esse processo seja realizado de maneira automática através de métodos computacionais. Embora vários métodos já tenham sido propostos, muitos sofrem com o problema da maldição da dimensionalidade ou apresentam alto custo computacional, inviabilizando seu uso em cenários reais. Diante disso, esta tese apresenta um método de categorização de texto baseado no princípio da descrição mais simples, nomeado MDLText, que é eficiente, rápido, escalável e multiclasse. Ele possui aprendizado rápido, incremental e é suficientemente robusto para evitar o problema de superajustamento aos dados, o que é altamente desejável em problemas reais, dinâmicos, online e de grande porte. Experimentos realizados com bases de dados reais, grandes e públicas, seguidos por uma análise estatística dos resultados, indicam que o MDLText oferece um excelente balanceamento entre poder preditivo e custo computacional. Diante desses bons resultados, foi proposta uma generalização inicial do método para lidar também com problemas não-textuais, o que resultou em um método de classificação, nomeado MDLClass, que é simples, rápido e pode ser aplicado em problemas binários e multiclasses. A análise estatística dos resultados indicou que ele é equivalente à maioria dos métodos considerados o estado-da-arte em classificaçãoAbstract: ext categorization has received attention in recent years because of the ever-increasing volume of text information. For large number of documents, a manual classification is tiresome, tedious, time-consuming, and impractical, making computational methods attractive to deal with this task. The available methods that address this problem suffer from their computational burden and the curse of dimensionality, undermining their applicability in real scenarios. To overcome this limitation, we propose a simpler, faster, scalable and more efficient classification method based on the minimum description length principle, named MDLText. Its incremental and faster learning process makes it suitable to cope with data overfitting, which is desirable for real and large-scale problems. Experiments performed on real, public, and large-scale datasets followed by statistical analyses indicate that the MDLText provides an excellent trade-off between predictive capability and computational cost. Motivated by these results, we propose a generalized method, named MDLClass, to encompass non-textual problems. Similar to MDLText, this extension is simple and fast, and can also be applied to binary and multiclass classification problems. Statistical analyses show that MDLClass is equivalent to most of the state-of-the-art classification methodsDoutoradoAutomaçãoDoutor em Engenharia Elétrica141089/2013-0CNP

    A framework for dynamic heterogeneous information networks change discovery based on knowledge engineering and data mining methods

    Get PDF
    Information Networks are collections of data structures that are used to model interactions in social and living phenomena. They can be either homogeneous or heterogeneous and static or dynamic depending upon the type and nature of relations between the network entities. Static, homogeneous and heterogenous networks have been widely studied in data mining but recently, there has been renewed interest in dynamic heterogeneous information networks (DHIN) analysis because the rich temporal, structural and semantic information is hidden in this kind of network. The heterogeneity and dynamicity of the real-time networks offer plenty of prospects as well as a lot of challenges for data mining. There has been substantial research undertaken on the exploration of entities and their link identification in heterogeneous networks. However, the work on the formal construction and change mining of heterogeneous information networks is still infant due to its complex structure and rich semantics. Researchers have used clusters-based methods and frequent pattern-mining techniques in the past for change discovery in dynamic heterogeneous networks. These methods only work on small datasets, only provide the structural change discovery and fail to consider the quick and parallel process on big data. The problem with these methods is also that cluster-based approaches provide the structural changes while the pattern-mining provide semantic characteristics of changes in a dynamic network. Another interesting but challenging problem that has not been considered by past studies is to extract knowledge from these semantically richer networks based on the user-specific constraint.This study aims to develop a new change mining system ChaMining to investigate dynamic heterogeneous network data, using knowledge engineering with semantic web technologies and data mining to overcome the problems of previous techniques, this system and approach are important in academia as well as real-life applications to support decision-making based on temporal network data patterns. This research has designed a novel framework “ChaMining” (i) to find relational patterns in dynamic networks locally and globally by employing domain ontologies (ii) extract knowledge from these semantically richer networks based on the user-specific (meta-paths) constraints (iii) Cluster the relational data patterns based on structural properties of nodes in the dynamic network (iv) Develop a hybrid approach using knowledge engineering, temporal rule mining and clustering to detect changes in the dynamic heterogeneous networks.The evidence is presented in this research shows that the proposed framework and methods work very efficiently on the benchmark big dynamic heterogeneous datasets. The empirical results can contribute to a better understanding of the rich semantics of DHIN and how to mine them using the proposed hybrid approach. The proposed framework has been evaluated with the previous six dynamic change detection algorithms or frameworks and it performs very well to detect microscopic as well as macroscopic human-understandable changes. The number of change patterns extracted in this approach was higher than the previous approaches which help to reduce the information loss
    corecore