1,254 research outputs found

    A study of hierarchical and flat classification of proteins

    Get PDF
    Automatic classification of proteins using machine learning is an important problem that has received significant attention in the literature. One feature of this problem is that expert-defined hierarchies of protein classes exist and can potentially be exploited to improve classification performance. In this article we investigate empirically whether this is the case for two such hierarchies. We compare multi-class classification techniques that exploit the information in those class hierarchies and those that do not, using logistic regression, decision trees, bagged decision trees, and support vector machines as the underlying base learners. In particular, we compare hierarchical and flat variants of ensembles of nested dichotomies. The latter have been shown to deliver strong classification performance in multi-class settings. We present experimental results for synthetic, fold recognition, enzyme classification, and remote homology detection data. Our results show that exploiting the class hierarchy improves performance on the synthetic data, but not in the case of the protein classification problems. Based on this we recommend that strong flat multi-class methods be used as a baseline to establish the benefit of exploiting class hierarchies in this area

    Coupling different methods for overcoming the class imbalance problem

    Get PDF
    Many classification problems must deal with imbalanced datasets where one class \u2013 the majority class \u2013 outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical. Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches. To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature. Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357

    Transductive hyperspectral image classification: toward integrating spectral and relational features via an iterative ensemble system

    Get PDF
    Remotely sensed hyperspectral image classification is a very challenging task due to the spatial correlation of the spectral signature and the high cost of true sample labeling. In light of this, the collective inference paradigm allows us to manage the spatial correlation between spectral responses of neighboring pixels, as interacting pixels are labeled simultaneously. The transductive inference paradigm allows us to reduce the inference error for the given set of unlabeled data, as sparsely labeled pixels are learned by accounting for both labeled and unlabeled information. In this paper, both these paradigms contribute to the definition of a spectral-relational classification methodology for imagery data. We propose a novel algorithm to assign a class to each pixel of a sparsely labeled hyperspectral image. It integrates the spectral information and the spatial correlation through an ensemble system. For every pixel of a hyperspectral image, spatial neighborhoods are constructed and used to build application-specific relational features. Classification is performed with an ensemble comprising a classifier learned by considering the available spectral information (associated with the pixel) and the classifiers learned by considering the extracted spatio-relational information (associated with the spatial neighborhoods). The more reliable labels predicted by the ensemble are fed back to the labeled part of the image. Experimental results highlight the importance of the spectral-relational strategy for the accurate transductive classification of hyperspectral images and they validate the proposed algorithm

    An academic review: applications of data mining techniques in finance industry

    Get PDF
    With the development of Internet techniques, data volumes are doubling every two years, faster than predicted by Moore’s Law. Big Data Analytics becomes particularly important for enterprise business. Modern computational technologies will provide effective tools to help understand hugely accumulated data and leverage this information to get insights into the finance industry. In order to get actionable insights into the business, data has become most valuable asset of financial organisations, as there are no physical products in finance industry to manufacture. This is where data mining techniques come to their rescue by allowing access to the right information at the right time. These techniques are used by the finance industry in various areas such as fraud detection, intelligent forecasting, credit rating, loan management, customer profiling, money laundering, marketing and prediction of price movements to name a few. This work aims to survey the research on data mining techniques applied to the finance industry from 2010 to 2015.The review finds that Stock prediction and Credit rating have received most attention of researchers, compared to Loan prediction, Money Laundering and Time Series prediction. Due to the dynamics, uncertainty and variety of data, nonlinear mapping techniques have been deeply studied than linear techniques. Also it has been proved that hybrid methods are more accurate in prediction, closely followed by Neural Network technique. This survey could provide a clue of applications of data mining techniques for finance industry, and a summary of methodologies for researchers in this area. Especially, it could provide a good vision of Data Mining Techniques in computational finance for beginners who want to work in the field of computational finance

    Using ensembles for accurate modelling of manufacturing processes in an IoT data-acquisition solution

    Get PDF
    The development of complex real-time platforms for the Internet of Things (IoT) opens up a promising future for the diagnosis and the optimization of machining processes. Many issues have still to be solved before IoT platforms can be profitable for small workshops with very flexible workloads and workflows. The main obstacles refer to sensor implementation, IoT architecture, and data processing, and analysis. In this research, the use of different machine-learning techniques is proposed, for the extraction of different information from an IoT platform connected to a machining center, working under real industrial conditions in a workshop. The aim is to evaluate which algorithmic technique might be the best to build accurate prediction models for one of the main demands of workshops: the optimization of machining processes. This evaluation, completed under real industrial conditions, includes very limited information on the machining workload of the machining center and unbalanced datasets. The strategy is validated for the classification of the state of a machining center, its working mode, and the prediction of the thermal evolution of the main machine-tool motors: the axis motors and the milling head motor. The results show the superiority of the ensembles for both classification problems under analysis and all four regression problems. In particular, Rotation Forest-based ensembles turned out to have the best performance in the experiments for all the metrics under study. The models are accurate enough to provide useful conclusions applicable to current industrial practice, such as improvements in machine programming to avoid cutting conditions that might greatly reduce tool lifetime and damage machine components.Projects TIN2015-67534-P (MINECO/FEDER, UE) of the Ministerio de Economía Competitividad of the Spanish Government and projects CCTT1/17/BU/0003 and BU085P17 (JCyL/FEDER, UE) of the Junta de Castilla y León, all of them co-financed through European-Union FEDER funds

    Stochastic functional descent for learning Support Vector Machines

    Full text link
    We present a novel method for learning Support Vector Machines (SVMs) in the online setting. Our method is generally applicable in that it handles the online learning of the binary, multiclass, and structural SVMs in a unified view. The SVM learning problem consists of optimizing a convex objective function that is composed of two parts: the hinge loss and quadratic regularization. To date, the predominant family of approaches for online SVM learning has been gradient-based methods, such as Stochastic Gradient Descent (SGD). Unfortunately, we note that there are two drawbacks in such approaches: first, gradient-based methods are based on a local linear approximation to the function being optimized, but since the hinge loss is piecewise-linear and nonsmooth, this approximation can be ill-behaved. Second, existing online SVM learning approaches share the same problem formulation with batch SVM learning methods, and they all need to tune a fixed global regularization parameter by cross validation. On the one hand, global regularization is ineffective in handling local irregularities encountered in the online setting; on the other hand, even though the learning problem for a particular global regularization parameter value may be efficiently solved, repeatedly solving for a wide range of values can be costly. We intend to tackle these two problems with our approach. To address the first problem, we propose to perform implicit online update steps to optimize the hinge loss, as opposed to explicit (or gradient-based) updates that utilize subgradients to perform local linearization. Regarding the second problem, we propose to enforce local regularization that is applied to individual classifier update steps, rather than having a fixed global regularization term. Our theoretical analysis suggests that our classifier update steps progressively optimize the structured hinge loss, with the rate controlled by a sequence of regularization parameters; setting these parameters is analogous to setting the stepsizes in gradient-based methods. In addition, we give sufficient conditions for the algorithm's convergence. Experimentally, our online algorithm can match optimal classification performances given by other state-of-the-art online SVM learning methods, as well as batch learning methods, after only one or two passes over the training data. More importantly, our algorithm can attain these results without doing cross validation, while all other methods must perform time-consuming cross validation to determine the optimal choice of the global regularization parameter
    corecore