8 research outputs found

    Music classification by transductive learning using bipartite heterogeneous networks

    Get PDF
    The popularization of music distribution in electronic format has increased the amount of music with incomplete metadata. The incompleteness of data can hamper some important tasks, such as music and artist recommendation. In this scenario, transductive classification can be used to classify the whole dataset considering just few labeled instances. Usually transductive classification is performed through label propagation, in which data are represented as networks and the examples propagate their labels through\ud their connections. Similarity-based networks are usually applied to model data as network. However, this kind of representation requires the definition of parameters, which significantly affect the classification accuracy, and presentes a high cost due to the computation of similarities among all dataset instances. In contrast, bipartite heterogeneous networks have appeared as an alternative to similarity-based networks in text mining applications. In these networks, the words are connected to the documents which they occur. Thus, there is no parameter or additional costs to generate such networks. In this paper, we propose the use of the bipartite network representation to perform transductive classification of music, using a bag-of-frames approach to describe music signals. We demonstrate that the proposed approach outperforms other music classification approaches when few labeled instances are available.Sao Paulo Research Foundation (FAPESP) (grants 2011/12823-6, 2012/50714-7, 2013/26151-5, and 2014/08996-0

    Representation of textual document collections through association rules

    No full text
    O n√ļmero de documentos textuais dispon√≠veis em formato digital tem aumentado incessantemente. T√©cnicas de Minera√ß√£o de Textos s√£o cada vez mais utilizadas para organizar e extrair conhecimento de grandes cole√ß√Ķes de documentos textuais. Para o uso dessas t√©cnicas √© necess√°rio que os documentos textuais estejam representados em um formato apropriado. A maioria das pesquisas de Minera√ß√£o de Textos utiliza a abordagem bag-of-words para representar os documentos da cole√ß√£o. Essa representa√ß√£o usa cada palavra presente na cole√ß√£o de documentos como poss√≠vel atributo, ignorando a ordem das palavras, informa √ß√Ķes de pontua√ß√£o ou estruturais, e √© caracterizada pela alta dimensionalidade e por dados esparsos. Por outro lado, a maioria dos conceitos s√£o compostos por mais de uma palavra, como Intelig√™ncia Articial, Rede Neural, e Minera√ß√£o de Textos. As abordagens que geram atributos compostos por mais de uma palavra apresentam outros problemas al√©m dos apresentados pela representa√ß√£o bag-of-words, como a gera√ß√£o de atributos com pouco signicado e uma dimensionalidade muito maior. Neste projeto de mestrado foi proposta uma abordagem para representar documentos textuais nomeada bag-of-related-words. A abordagem proposta gera atributos compostos por palavras relacionadas com o uso de regras de associa√ß√£o. Com as regras de associa√ß√£o, espera-se identicar rela√ß√Ķes entre palavras de um documento, al√©m de reduzir a dimensionalidade, pois s√£o consideradas apenas as palavras que ocorrem ou que coocorrem acima de uma determinada frequ√™ncia para gerar as regras. Diferentes maneiras de mapear o documento em transa√ß√Ķes para possibilitar a gera√ß√£o de regras de associa√ß√£o s√£o analisadas. Diversas medidas de interesse aplicadas √†s regras de associa√ß√£o para a extra√ß√£o de atributos mais signicativos e a redu√ß√£o do n√ļmero de atributos tamb√©m s√£o analisadas. Para avaliar o quanto a representa√ß√£o bag-of-related-words pode auxiliar na organiza√ß√£o e extra√ß√£o de conhecimento de cole√ß√Ķes de documentos textuais, e na interpretabilidade dos resultados, foram realizados tr√™s grupos de experimentos: 1) classica√ß√£o de documentos textuais para avaliar o quanto os atributos da representa√ß√£o bag-of-related-words s√£o bons para distinguir as categorias dos documentos; 2) agrupamento de documentos textuais para avaliar a qualidade dos grupos obtidos com a bag-of-related-words e consequentemente auxiliar na obten√ß√£o da estrutura de uma hierarquia de t√≥picos; e 3) constru√ß√£o e avalia√ß√£o de hierarquias de t√≥picos por especialistas de dom√≠nio. Todos os resultados e dimensionalidades foram comparados com a representa√ß√£o bag-of-words. Pelos resultados dos experimentos realizados, pode-se vericar que os atributos da representa√ß√£o bag-of-related-words possuem um poder preditivo t√£o bom quanto os da representa√ß√£o bag-of-words. A qualidade dos agrupamentos de documentos textuais utilizando a representa√ß√£o bag-of-related-words foi t√£o boa quanto utilizando a representa√ß√£o bag-of-words. Na avalia√ß√£o de hierarquias de t√≥picos por especialistas de dom√≠nio, a utiliza√ß√£o da representa√ß√£o bag-of-related-words apresentou melhores resultados em todos os quesitos analisadosThe amount of textual documents available in digital format is incredibly large. Text Mining techniques are becoming essentials to manage and extract knowledge in big textual document collections. In order to use these techniques, the textual documents need to be represented in an appropriate format to allow the construction of a model that represents the embedded knowledge in these textual documents. Most of the researches on Text Mining uses the bag-of-words approach to represent textual document collections. This representation uses each word in a collection as feature, ignoring the order of the words, structural information, and it is characterized by the high dimensionality and data sparsity. On the other hand, most of the concepts are compounded by more than one word, such as Articial Intelligence, Neural Network, and Text Mining. The approaches which generate features compounded by more than one word to solve this problem, suer from other problems, as the generation of features without meaning and a dimensionality much higher than that of the bag-of-words. An approach to represent textual documents named bag-of-related-words was proposed in this master thesis. The proposed approach generates features compounded by related words using association rules. We hope to identify relationships among words and reduce the dimensionality with the use of association rules, since only the words that occur and cooccur over a frequency threshold will be used to generate rules. Dierent ways to map the document into transactions to allow the extraction of association rules are analyzed. Dierent objective interest measures applied to the association rules to generate more meaningful features and to the reduce the feature number are also analyzed. To evaluate how much the textual document representation proposed in this master project can aid the managing and knowledge extraction from textual document collections, and the understanding of the results, three experiments were carried out: 1) textual document classication to analyze the predictive power of the bag-of-related-words features, 2) textual document clustering to analyze the quality of the cluster using the bag-of-related-words representation 3) topic hierarchies building and evaluation by domain experts. All the results and dimensionalities were compared to the bag-of-words representation. The results presented that the features of the bag-of-related-words representation have a predictive power as good as the features of the bag-of-words representation. The quality of the textual document clustering also was as good as the bag-of-words. The evaluation of the topic hierarchies by domain specialists presented better results when using the bag-of-related-words representation in all the questions analyze

    Text automatic classification through machine learning based on networks

    No full text
    Nos dias atuais h√° uma quantidade massiva de dados textuais sendo produzida e armazenada diariamente na forma de e-mails, relat√≥rios, artigos e postagens em redes sociais ou blogs. Processar, organizar ou gerenciar essa grande quantidade de dados textuais manualmente exige um grande esfor√ßo humano, sendo muitas vezes imposs√≠vel de ser realizado. Al√©m disso, h√° conhecimento embutido nos dados textuais, e analisar e extrair conhecimento de forma manual tamb√©m torna-se invi√°vel devido √† grande quantidade de textos. Com isso, t√©cnicas computacionais que requerem pouca interven√ß√£o humana e que permitem a organiza√ß√£o, gerenciamento e extra√ß√£o de conhecimento de grandes quantidades de textos t√™m ganhado destaque nos √ļltimos anos e v√™m sendo aplicadas tanto na academia quanto em empresas e organiza√ß√Ķes. Dentre as t√©cnicas, destaca-se a classifica√ß√£o autom√°tica de textos, cujo objetivo √© atribuir r√≥tulos (identificadores de categorias pr√©-definidos) √† documentos textuais ou por√ß√Ķes de texto. Uma forma vi√°vel de realizar a classifica√ß√£o autom√°tica de textos √© por meio de algoritmos de aprendizado de m√°quina, que s√£o capazes de aprender, generalizar, ou ainda extrair padr√Ķes das classes das cole√ß√Ķes com base no conte√ļdo e r√≥tulos de documentos textuais. O aprendizado de m√°quina para a tarefa de classifica√ß√£o autom√°tica pode ser de 3 tipos: (i) indutivo supervisionado, que considera apenas documentos rotulados para induzir um modelo de classifica√ß√£o e classificar novos documentos; (ii) transdutivo semissupervisionado, que classifica documentos n√£o rotulados de uma cole√ß√£o com base em documentos rotulados; e (iii) indutivo semissupervisionado, que considera documentos rotulados e n√£o rotulados para induzir um modelo de classifica√ß√£o e utiliza esse modelo para classificar novos documentos. Independente do tipo, √© necess√°rio que as cole√ß√Ķes de documentos textuais estejam representadas em um formato estruturado para os algoritmos de aprendizado de m√°quina. Normalmente os documentos s√£o representados em um modelo espa√ßo-vetorial, no qual cada documento √© representado por um vetor, e cada posi√ß√£o desse vetor corresponde a um termo ou atributo da cole√ß√£o de documentos. Algoritmos baseados no modelo espa√ßo-vetorial consideram que tanto os documentos quanto os termos ou atributos s√£o independentes, o que pode degradar a qualidade da classifica√ß√£o. Uma alternativa √† representa√ß√£o no modelo espa√ßo-vetorial √© a representa√ß√£o em redes, que permite modelar rela√ß√Ķes entre entidades de uma cole√ß√£o de textos, como documento e termos. Esse tipo de representa√ß√£o permite extrair padr√Ķes das classes que dificilmente s√£o extra√≠dos por algoritmos baseados no modelo espa√ßo-vetorial, permitindo assim aumentar a performance de classifica√ß√£o. Al√©m disso, a representa√ß√£o em redes permite representar cole√ß√Ķes de textos utilizando diferentes tipos de objetos bem como diferentes tipos de rela√ß√Ķes, o que permite capturar diferentes caracter√≠sticas das cole√ß√Ķes. Entretanto, observa-se na literatura alguns desafios para que se possam combinar algoritmos de aprendizado de m√°quina e representa√ß√Ķes de cole√ß√Ķes de textos em redes para realizar efetivamente a classifica√ß√£o autom√°tica de textos. Os principais desafios abordados neste projeto de doutorado s√£o (i) o desenvolvimento de representa√ß√Ķes em redes que possam ser geradas eficientemente e que tamb√©m permitam realizar um aprendizado de maneira eficiente; (ii) redes que considerem diferentes tipos de objetos e rela√ß√Ķes; (iii) representa√ß√Ķes em redes de cole√ß√Ķes de textos de diferentes l√≠nguas e dom√≠nios; e (iv) algoritmos de aprendizado de m√°quina eficientes e que fa√ßam um melhor uso das representa√ß√Ķes em redes para aumentar a qualidade da classifica√ß√£o autom√°tica. Neste projeto de doutorado foram propostos e desenvolvidos m√©todos para gerar redes que representem cole√ß√Ķes de textos, independente de dom√≠nio e idioma, considerando diferentes tipos de objetos e rela√ß√Ķes entre esses objetos. Tamb√©m foram propostos e desenvolvidos algoritmos de aprendizado de m√°quina indutivo supervisionado, indutivo semissupervisionado e transdutivo semissupervisionado, uma vez que n√£o foram encontrados na literatura algoritmos para lidar com determinados tipos de rela√ß√Ķes, al√©m de sanar a defici√™ncia dos algoritmos existentes em rela√ß√£o √† performance e/ou tempo de classifica√ß√£o. √Č apresentado nesta tese (i) uma extensa avalia√ß√£o emp√≠rica demonstrando o benef√≠cio do uso das representa√ß√Ķes em redes para a classifica√ß√£o de textos em rela√ß√£o ao modelo espa√ßo-vetorial, (ii) o impacto da combina√ß√£o de diferentes tipos de rela√ß√Ķes em uma √ļnica rede e (iii) que os algoritmos propostos baseados em redes s√£o capazes de superar a performance de classifica√ß√£o de algoritmos tradicionais e estado da arte tanto considerando algoritmos de aprendizado supervisionado quanto semissupervisionado. As solu√ß√Ķes propostas nesta tese demonstraram ser √ļteis e aconselh√°veis para serem utilizadas em diversas aplica√ß√Ķes que envolvam classifica√ß√£o de textos de diferentes dom√≠nios, diferentes caracter√≠sticas ou para diferentes quantidades de documentos rotulados.A massive amount of textual data, such as e-mails, reports, articles and posts in social networks or blogs, has been generated and stored on a daily basis. The manual processing, organization and management of this huge amount of texts require a considerable human effort and sometimes these tasks are impossible to carry out in practice. Besides, the manual extraction of knowledge embedded in textual data is also unfeasible due to the large amount of texts. Thus, computational techniques which require little human intervention and allow the organization, management and knowledge extraction from large amounts of texts have gained attention in the last years and have been applied in academia, companies and organizations. The tasks mentioned above can be carried out through text automatic classification, in which labels (identifiers of predefined categories) are assigned to texts or portions of texts. A viable way to perform text automatic classification is through machine learning algorithms, which are able to learn, generalize or extract patterns from classes of text collections based on the content and labels of the texts. There are three types of machine learning algorithms for automatic classification: (i) inductive supervised, in which only labeled documents are considered to induce a classification model and this model are used to classify new documents; (ii) transductive semi-supervised, in which all known unlabeled documents are classified based on some labeled documents; and (iii) inductive semi-supervised, in which labeled and unlabeled documents are considered to induce a classification model in order to classify new documents. Regardless of the learning algorithm type, the texts of a collection must be represented in a structured format to be interpreted by the algorithms. Usually, the texts are represented in a vector space model, in which each text is represented by a vector and each dimension of the vector corresponds to a term or feature of the text collection. Algorithms based on vector space model consider that texts, terms or features are independent and this assumption can degrade the classification performance. Networks can be used as an alternative to vector space model representations. Networks allow the representations of relations among the entities of a text collection, such as documents and terms. This type of representation allows the extraction patterns which are not extracted by algorithms based on vector-space model. Moreover, text collections can be represented by networks composed of different types of entities and relations, which provide the extraction of different patterns from the texts. However, there are some challenges to be solved in order to allow the combination of machine learning algorithms and network-based representations to perform text automatic classification in an efficient way. The main challenges addressed in this doctoral project are (i) the development of network-based representations efficiently generated which also allows an efficient learning; (ii) the development of networks which represent different types of entities and relations; (iii) the development of networks which can represent texts written in different languages and about different domains; and (iv) the development of efficient learning algorithms which make a better use of the network-based representations and increase the classification performance. In this doctoral project we proposed and developed methods to represent text collections into networks considering different types of entities and relations and also allowing the representation of texts written in any language or from any domain. We also proposed and developed supervised inductive, semi-supervised transductive and semi-supervised inductive learning algorithms to interpret and learn from the proposed network-based representations since there were no algorithms to handle certain types of relations considered in this thesis. Besides, the proposed algorithms also attempt to obtain a higher classification performance and a faster classification than the existing network-based algorithms. In this doctoral thesis we present (i) an extensive empirical evaluation demonstrating the benefits about the use of network-based representations for text classification, (ii) the impact of the combination of different types of relations in a single network and (iii) that the proposed network-based algorithms are able to surpass the classification performance of traditional and state-of-the-art algorithms considering both supervised and semi-supervised learning. The solutions proposed in this doctoral project have proved to be advisable to be used in many applications involving classification of texts from different domains, areas, characteristics or considering different numbers of labeled documents

    A parameter-free label propagation algorithm using bipartite heterogeneous networks for text classification

    No full text
    A bipartite heterogeneous network is one of the simplest ways to represent a textual document collection. In such case, the network consists of two types of vertices, representing documents and terms, and links connecting terms to the documents. Transductive algorithms are usually applied to perform classi cation of networked objects. This type of classi cation is usually applied when few labeled examples are available, which may be worthwhile for practical situations. Nevertheless, for existing transductive algorithms users have to set several parameters that signi cantly affect the classi cation accuracy. In this paper, we propose a parameter-free algorithm for transductive classi cation of textual data, referred to as LPBHN (Label Propagation using Bipartite Heterogeneous Networks). LPBHN uses a bipartite heterogeneous network to perform the classi c√°tion task. The proposed algorithm presents accuracy equivalente or higher than state-of-the-art algorithms for transductive classi cation in heterogeneous or homogeneous networks

    Inductive model generation for text classification using a bipartite heterogeneous network

    No full text
    Algorithms for numeric data classification have been applied for text classification. Usually the vector space model is used to represent text collections. The characteristics of this representation such as sparsity and high dimensionality sometimes impair the quality of general-purpose classifiers. Networks can be used to represent text collections, avoiding the high sparsity and allowing to model relationships among different objects that compose a text collection. Such network-based representations can improve the quality of the classification results. One of the simplest ways to represent textual collections by a network is through a bipartite heterogeneous network, which is composed of objects that represent the documents connected to objects that represent the terms. Heterogeneous bipartite networks do not require computation of similarities or relations among the objects and can be used to model any type of text collection. Due to the advantages of representing text collections through bipartite heterogeneous networks, in this article we present a text classifier which builds a classification model using the structure of a bipartite heterogeneous network. Such an algorithm, referred to as IMBHN (Inductive Model Based on Bipartite Heterogeneous Network), induces a classification model assigning weights to objects that represent the terms for each class of the text collection. An empirical evaluation using a large amount of text collections from different domains shows that the proposed IMBHN algorithm produces significantly better results than k-NN, C4.5, SVM, and Naive Bayes algorithms.S√£o Paulo Research Foundation (FAPESP) of Brasil (Grant Nos. 2011/12823-6, 2011/23689-9, and 2011/19850-9)A preliminary version of the paper was published in the Proceedings of ICDM 201
    corecore