517 research outputs found

    Classifying sequences by the optimized dissimilarity space embedding approach: a case study on the solubility analysis of the E. coli proteome

    Full text link
    We evaluate a version of the recently-proposed classification system named Optimized Dissimilarity Space Embedding (ODSE) that operates in the input space of sequences of generic objects. The ODSE system has been originally presented as a classification system for patterns represented as labeled graphs. However, since ODSE is founded on the dissimilarity space representation of the input data, the classifier can be easily adapted to any input domain where it is possible to define a meaningful dissimilarity measure. Here we demonstrate the effectiveness of the ODSE classifier for sequences by considering an application dealing with the recognition of the solubility degree of the Escherichia coli proteome. Solubility, or analogously aggregation propensity, is an important property of protein molecules, which is intimately related to the mechanisms underlying the chemico-physical process of folding. Each protein of our dataset is initially associated with a solubility degree and it is represented as a sequence of symbols, denoting the 20 amino acid residues. The herein obtained computational results, which we stress that have been achieved with no context-dependent tuning of the ODSE system, confirm the validity and generality of the ODSE-based approach for structured data classification.Comment: 10 pages, 49 reference

    Recent Advances in Transfer Learning for Cross-Dataset Visual Recognition: A Problem-Oriented Perspective

    Get PDF
    This paper takes a problem-oriented perspective and presents a comprehensive review of transfer learning methods, both shallow and deep, for cross-dataset visual recognition. Specifically, it categorises the cross-dataset recognition into seventeen problems based on a set of carefully chosen data and label attributes. Such a problem-oriented taxonomy has allowed us to examine how different transfer learning approaches tackle each problem and how well each problem has been researched to date. The comprehensive problem-oriented review of the advances in transfer learning with respect to the problem has not only revealed the challenges in transfer learning for visual recognition, but also the problems (e.g. eight of the seventeen problems) that have been scarcely studied. This survey not only presents an up-to-date technical review for researchers, but also a systematic approach and a reference for a machine learning practitioner to categorise a real problem and to look up for a possible solution accordingly

    One-class classifiers based on entropic spanning graphs

    Get PDF
    One-class classifiers offer valuable tools to assess the presence of outliers in data. In this paper, we propose a design methodology for one-class classifiers based on entropic spanning graphs. Our approach takes into account the possibility to process also non-numeric data by means of an embedding procedure. The spanning graph is learned on the embedded input data and the outcoming partition of vertices defines the classifier. The final partition is derived by exploiting a criterion based on mutual information minimization. Here, we compute the mutual information by using a convenient formulation provided in terms of the α\alpha-Jensen difference. Once training is completed, in order to associate a confidence level with the classifier decision, a graph-based fuzzy model is constructed. The fuzzification process is based only on topological information of the vertices of the entropic spanning graph. As such, the proposed one-class classifier is suitable also for data characterized by complex geometric structures. We provide experiments on well-known benchmarks containing both feature vectors and labeled graphs. In addition, we apply the method to the protein solubility recognition problem by considering several representations for the input samples. Experimental results demonstrate the effectiveness and versatility of the proposed method with respect to other state-of-the-art approaches.Comment: Extended and revised version of the paper "One-Class Classification Through Mutual Information Minimization" presented at the 2016 IEEE IJCNN, Vancouver, Canad

    Mapping microarray gene expression data into dissimilarity spaces for tumor classification

    Get PDF
    Microarray gene expression data sets usually contain a large number of genes, but a small number of samples. In this article, we present a two-stage classification model by combining feature selection with the dissimilarity-based representation paradigm. In the preprocessing stage, the ReliefF algorithm is used to generate a subset with a number of topranked genes; in the learning/classification stage, the samples represented by the previously selected genes are mapped into a dissimilarity space, which is then used to construct a classifier capable of separating the classes more easily than a feature-based model. The ultimate aim of this paper is not to find the best subset of genes, but to analyze the performance of the dissimilarity-based models by means of a comprehensive collection of experiments for the classification of microarray gene expression data. To this end, we compare the classification results of an artificial neural network, a support vector machine and the Fisher’s linear discriminant classifier built on the feature (gene) space with those on the dissimilarity space when varying the number of genes selected by ReliefF, using eight different microarray databases. The results show that the dissimilarity-based classifiers systematically outperform the feature-based models. In addition, classification through the proposed representation appears to be more robust (i.e. less sensitive to the number of genes) than that with the conventional feature-based representation

    Multiple Instance Learning: A Survey of Problem Characteristics and Applications

    Full text link
    Multiple instance learning (MIL) is a form of weakly supervised learning where training instances are arranged in sets, called bags, and a label is provided for the entire bag. This formulation is gaining interest because it naturally fits various problems and allows to leverage weakly labeled data. Consequently, it has been used in diverse application fields such as computer vision and document classification. However, learning from bags raises important challenges that are unique to MIL. This paper provides a comprehensive survey of the characteristics which define and differentiate the types of MIL problems. Until now, these problem characteristics have not been formally identified and described. As a result, the variations in performance of MIL algorithms from one data set to another are difficult to explain. In this paper, MIL problem characteristics are grouped into four broad categories: the composition of the bags, the types of data distribution, the ambiguity of instance labels, and the task to be performed. Methods specialized to address each category are reviewed. Then, the extent to which these characteristics manifest themselves in key MIL application areas are described. Finally, experiments are conducted to compare the performance of 16 state-of-the-art MIL methods on selected problem characteristics. This paper provides insight on how the problem characteristics affect MIL algorithms, recommendations for future benchmarking and promising avenues for research

    One-Class Classification: Taxonomy of Study and Review of Techniques

    Full text link
    One-class classification (OCC) algorithms aim to build classification models when the negative class is either absent, poorly sampled or not well defined. This unique situation constrains the learning of efficient classifiers by defining class boundary just with the knowledge of positive class. The OCC problem has been considered and applied under many research themes, such as outlier/novelty detection and concept learning. In this paper we present a unified view of the general problem of OCC by presenting a taxonomy of study for OCC problems, which is based on the availability of training data, algorithms used and the application domains applied. We further delve into each of the categories of the proposed taxonomy and present a comprehensive literature review of the OCC algorithms, techniques and methodologies with a focus on their significance, limitations and applications. We conclude our paper by discussing some open research problems in the field of OCC and present our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure

    Advances in Sensors, Big Data and Machine Learning in Intelligent Animal Farming

    Get PDF
    Animal production (e.g., milk, meat, and eggs) provides valuable protein production for human beings and animals. However, animal production is facing several challenges worldwide such as environmental impacts and animal welfare/health concerns. In animal farming operations, accurate and efficient monitoring of animal information and behavior can help analyze the health and welfare status of animals and identify sick or abnormal individuals at an early stage to reduce economic losses and protect animal welfare. In recent years, there has been growing interest in animal welfare. At present, sensors, big data, machine learning, and artificial intelligence are used to improve management efficiency, reduce production costs, and enhance animal welfare. Although these technologies still have challenges and limitations, the application and exploration of these technologies in animal farms will greatly promote the intelligent management of farms. Therefore, this Special Issue will collect original papers with novel contributions based on technologies such as sensors, big data, machine learning, and artificial intelligence to study animal behavior monitoring and recognition, environmental monitoring, health evaluation, etc., to promote intelligent and accurate animal farm management

    Efficient mixture model for clustering of sparse high dimensional binary data

    Get PDF
    Clustering is one of the fundamental tools for preliminary analysis of data. While most of the clustering methods are designed for continuous data, sparse high-dimensional binary representations became very popular in various domains such as text mining or cheminformatics. The application of classical clustering tools to this type of data usually proves to be very inefficient, both in terms of computational complexity as well as in terms of the utility of the results. In this paper we propose a mixture model, SparseMix, for clustering of sparse high dimensional binary data, which connects model-based with centroid-based clustering. Every group is described by a representative and a probability distribution modeling dispersion from this representative. In contrast to classical mixture models based on the EM algorithm, SparseMix: is specially designed for the processing of sparse data; can be efficiently realized by an on-line Hartigan optimization algorithm; describes every cluster by the most representative vector. We have performed extensive experimental studies on various types of data, which confirmed that SparseMix builds partitions with a higher compatibility with reference grouping than related methods. Moreover, constructed representatives often better reveal the internal structure of data

    Similarity-based methods for machine diagnosis

    Get PDF
    This work presents a data-driven condition-based maintenance system based on similarity-based modeling (SBM) for automatic machinery fault diagnosis. The proposed system provides information about the equipment current state (degree of anomaly), and returns a set of exemplars that can be employed to describe the current state in a sparse fashion, which can be examined by the operator to assess a decision to be made. The system is modular and data-agnostic, enabling its use in different equipment and data sources with small modifications. The main contributions of this work are: the extensive study of the proposition and use of multiclass SBM on different databases, either as a stand-alone classification method or in combination with an off-the-shelf classifier; novel methods for selecting prototypes for the SBM models; the use of new similarity functions; and a new production-ready fault detection service. These contributions achieved the goal of increasing the SBM models performance in a fault classification scenario while reducing its computational complexity. The proposed system was evaluated in three different databases, achieving higher or similar performance when compared with previous works on the same database. Comparisons with other methods are shown for the recently developed Machinery Fault Database (MaFaulDa) and for the Case Western Reserve University (CWRU) bearing database. The proposed techniques increase the generalization power of the similarity model and of the associated classifier, having accuracies of 98.5% on MaFaulDa and 98.9% on CWRU database. These results indicate that the proposed approach based on SBM is worth further investigation.Este trabalho apresenta um sistema de manutenção preditiva para diagnóstico automático de falhas em máquinas. O sistema proposto, baseado em uma técnica denominada similarity-based modeling (SBM), provê informações sobre o estado atual do equipamento (grau de anomalia), e retorna um conjunto de amostras representativas que pode ser utilizado para descrever o estado atual de forma esparsa, permitindo a um operador avaliar a melhor decisão a ser tomada. O sistema é modular e agnóstico aos dados, permitindo que seja utilizado em variados equipamentos e dados com pequenas modificações. As principais contribuições deste trabalho são: o estudo abrangente da proposta do classificador SBM multi-classe e o seu uso em diferentes bases de dados, seja como um classificador ou auxiliando outros classificadores comumente usados; novos métodos para a seleção de amostras representativas para os modelos SBM; o uso de novas funções de similaridade; e um serviço de detecção de falhas pronto para ser utilizado em produção. Essas contribuições atingiram o objetivo de melhorar o desempenho dos modelos SBM em cenários de classificação de falhas e reduziram sua complexidade computacional. O sistema proposto foi avaliado em três bases de dados, atingindo desempenho igual ou superior ao desempenho de trabalhos anteriores nas mesmas bases. Comparações com outros métodos são apresentadas para a recém-desenvolvida Machinery Fault Database (MaFaulDa) e para a base de dados da Case Western Reserve University (CWRU). As técnicas propostas melhoraram a capacidade de generalização dos modelos de similaridade e do classificador final, atingindo acurácias de 98.5% na MaFaulDa e 98.9% na base de dados CWRU. Esses resultados apontam que a abordagem proposta baseada na técnica SBM tem potencial para ser investigada em mais profundidade
    • …
    corecore