10 research outputs found

    Multilabel Classification Based on Graph Neural Networks

    Get PDF
    Typical Laplacian embedding focuses on building Laplacian matrices prior to minimizing weights of connected graph components. However, for multilabel problems, it is difficult to determine such Laplacian graphs owing to multiple relations between vertices. Unlike typical approaches that require precomputed Laplacian matrices, this chapter presents a new method for automatically constructing Laplacian graphs during Laplacian embedding. By using trace minimization techniques, the topology of the Laplacian graph can be learned from input data, subsequently creating robust Laplacian embedding and influencing graph convolutional networks. Experiments on different open datasets with clean data and Gaussian noise were carried out. The noise level ranged from 6% to 12% of the maximum value of each dataset. Eleven different multilabel classification algorithms were used as the baselines for comparison. To verify the performance, three evaluation metrics specific to multilabel learning are proposed because multilabel learning is much more complicated than traditional single-label settings; each sample can be associated with multiple labels. The experimental results show that the proposed method performed better than the baselines, even when the data were contaminated by noise. The findings indicate that the proposed method is reliably robust against noise

    Multi-Label Quantification

    Get PDF
    The work of A. Moreo and F. Sebastiani has been supported by the SoBigData++ project, funded by the European Commission (Grant 871042) under the H2020 Programme INFRAIA-2019-1, by the AI4Media project, funded by the European Commission (Grant 951911) under the H2020 Programme ICT-48-2020, and by the SoBigData.it and FAIR projects funded by the Italian Ministry of University and Research under the NextGenerationEU program; the authors’ opinions do not necessarily reflect those of the funding agencies. The work of M. Francisco has been supported by the FPI 2017 predoctoral programme, from the Spanish Ministry of Economy and Competitiveness (MINECO), grant BES-2017-081202.Quantification, variously called supervised prevalence estimation or learning to quantify, is the supervised learning task of generating predictors of the relative frequencies (a.k.a. prevalence values) of the classes of interest in unlabelled data samples. While many quantification methods have been proposed in the past for binary problems and, to a lesser extent, single-label multiclass problems, the multi-label setting (i.e., the scenario in which the classes of interest are not mutually exclusive) remains by and large unexplored. A straightforward solution to the multi-label quantification problem could simply consist of recasting the problem as a set of independent binary quantification problems. Such a solution is simple but naïve, since the independence assumption upon which it rests is, in most cases, not satisfied. In these cases, knowing the relative frequency of one class could be of help in determining the prevalence of other related classes. We propose the first truly multi-label quantification methods, i.e., methods for inferring estimators of class prevalence values that strive to leverage the stochastic dependencies among the classes of interest in order to predict their relative frequencies more accurately. We show empirical evidence that natively multi-label solutions outperform the naïve approaches by a large margin. The code to reproduce all our experiments is available online.SoBigData++ project, funded by the European Commission (Grant 871042) under the H2020 Programme INFRAIA-2019-1AI4Media project, funded by the European Commission (Grant 951911) under the H2020 Programme ICT-48-2020SoBigData.it and FAIR projects funded by the Italian Ministry of University and Research under the NextGenerationEU programPI 2017 predoctoral programme, from the Spanish Ministry of Economy and Competitiveness (MINECO), grant BES-2017-08120

    A study on the statistical evaluation of classifiers

    Get PDF
    Statistical significance analysis, based on hypothesis tests, is a common approach for comparing classifiers. However, many studies oversimplify this analysis by simply checking the condition p-value < 0.05, ignoring important concepts such as the effect size and the statistical power of the test. This problem is so worrying that the American Statistical Association has taken a strong stand on the subject, noting that although the p-value is a useful statistical measure, it has been abusively used and misinterpreted. This work highlights problems caused by the misuse of hypothesis tests and shows how the effect size and the power of the test can provide important information for better decision-making. To investigate these issues, we perform empirical studies with different classifiers and 50 datasets, using the Student’s t-test and the Wilcoxon test to compare classifiers. The results show that an isolated p-value analysis can lead to wrong conclusions and that the evaluation of the effect size and the power of the test contributes to a more principled decision-making

    Cost-sensitive Label Embedding for Multi-label Classification

    No full text
    在解決多標籤分類(Multi-label Classification)的方法當中,標籤嵌入法(Label Embedding)是一系列常見且重要的演算法。標籤嵌入法同時考量全部標籤之間的關係,並得到更好的分類表現。不同的多標籤分類應用通常會使用不同的成本計算方式來衡量演算法的分類表現,但是目前的標籤嵌入演算法都只考慮少數幾種、甚至只有一種成本計算方式,因此這些演算法在未考慮到的成本計算方式上的表現就會不如預期。為了解決這個問題,本論文提出了成本導向標籤嵌入演算法(Cost-sensitive Label Embedding with Multidimensional Scaling)。此演算法利用多維標度法(Multidimensional Scaling)找出標籤的嵌入向量(Embedded Vectos),再將成本計算方式的資訊嵌入到嵌入向量彼此之間的距離,最終的預測結果會透過所預測的嵌入向量以及我們設計的最近距離解碼函數(Nearest-neighbor Decoding Function)得出。由於嵌入向量隱含了成本計算方式的資訊,所以預測結果是成本導向的,在不同的成本計算方式上都能夠得到好的分類表現。雖然嵌入向量彼此之間的距離是對稱的,不過所提出的演算法不但能夠處理對稱式成本計算方式,也能處理非對稱式成本計算方式。除此之外,我們為所提出的演算法的提供了理論保證,最後再用大量的實驗結果來驗證所提出的演算法的確能夠在不同的成本計算方式上表現得比現有的標籤嵌入法以及成本導向演算法還要好。Label embedding (LE) is an important family of multi-label classification algorithms that digest the label information jointly for better performance. Different real-world applications evaluate performance by different cost functions of interest. Current LE algorithms often aim to optimize one specific cost function, but they can suffer from bad performance with respect to other cost functions. In this paper, we resolve the performance issue by proposing a novel cost-sensitive LE algorithm that takes the cost function of interest into account. The proposed algorithm, cost-sensitive label embedding with multidimensional scaling (CLEMS), approximates the cost information with the distances of the embedded vectors using the classic multidimensional scaling approach for manifold learning. CLEMS is able to deal with both symmetric and asymmetric cost functions, and effectively makes cost-sensitive decisions by nearest-neighbor decoding within the embedded vectors. Theoretical results justify that CLEMS achieves the cost-sensitivity and extensive experimental results demonstrate that CLEMS is significantly better than a wide spectrum of existing LE algorithms and state-of-the-art cost-sensitive algorithms across different cost functions

    Data Mining

    Get PDF
    The availability of big data due to computerization and automation has generated an urgent need for new techniques to analyze and convert big data into useful information and knowledge. Data mining is a promising and leading-edge technology for mining large volumes of data, looking for hidden information, and aiding knowledge discovery. It can be used for characterization, classification, discrimination, anomaly detection, association, clustering, trend or evolution prediction, and much more in fields such as science, medicine, economics, engineering, computers, and even business analytics. This book presents basic concepts, ideas, and research in data mining

    Multi-label Rule Learning

    Get PDF
    Research on multi-label classification is concerned with developing and evaluating algorithms that learn a predictive model for the automatic assignment of data points to a subset of predefined class labels. This is in contrast to traditional classification settings, where individual data points cannot be assigned to more than a single class. As many practical use cases demand a flexible categorization of data, where classes must not necessarily be mutually exclusive, multi-label classification has become an established topic of machine learning research. Nowadays, it is used for the assignment of keywords to text documents, the annotation of multimedia files, such as images, videos, or audio recordings, as well as for diverse applications in biology, chemistry, social network analysis, or marketing. During the past decade, increasing interest in the topic has resulted in a wide variety of different multi-label classification methods. Following the principles of supervised learning, they derive a model from labeled training data, which can afterward be used to obtain predictions for yet unseen data. Besides complex statistical methods, such as artificial neural networks, symbolic learning approaches have not only been shown to provide state-of-the-art performance in many applications but are also a common choice in safety-critical domains that demand human-interpretable and verifiable machine learning models. In particular, rule learning algorithms have a long history of active research in the scientific community. They are often argued to meet the requirements of interpretable machine learning due to the human-legible representation of learned knowledge in terms of logical statements. This work presents a modular framework for implementing multi-label rule learning methods. It does not only provide a unified view of existing rule-based approaches to multi-label classification, but also facilitates the development of new learning algorithms. Two novel instantiations of the framework are investigated to demonstrate its flexibility. Whereas the first one relies on traditional rule learning techniques and focuses on interpretability, the second one is based on a generalization of the gradient boosting framework and focuses on predictive performance rather than the simplicity of models. Motivated by the increasing demand for highly scalable learning algorithms that are capable of processing large amounts of training data, this work also includes an extensive discussion of algorithmic optimizations and approximation techniques for the efficient induction of rules. As the novel multi-label classification methods that are presented in this work can be viewed as instantiations of the same framework, they can both benefit from most of these principles. Their effectiveness and efficiency are compared to existing baselines experimentally
    corecore