33 research outputs found

    On the optimal usage of labelled examples in semi-supervised multi-class classification problems

    Get PDF
    In recent years, the performance of semi-supervised learning has been theoretically investigated. However, most of this theoretical development has focussed on binary classification problems. In this paper, we take it a step further by extending the work of Castelli and Cover [1] [2] to the multi-class paradigm. Particularly, we consider the key problem in semi-supervised learning of classifying an unseen instance x into one of K different classes, using a training dataset sampled from a mixture density distribution and composed of l labelled records and u unlabelled examples. Even under the assumption of identifiability of the mixture and having infinite unlabelled examples, labelled records are needed to determine the K decision regions. Therefore, in this paper, we first investigate the minimum number of labelled examples needed to accomplish that task. Then, we propose an optimal multi-class learning algorithm which is a generalisation of the optimal procedure proposed in the literature for binary problems. Finally, we make use of this generalisation to study the probability of error when the binary class constraint is relaxed

    Efficient learning of decomposable models with a bounded clique size

    Get PDF
    The learning of probability distributions from data is a ubiquitous problem in the fields of Statistics and Artificial Intelligence. During the last decades several learning algorithms have been proposed to learn probability distributions based on decomposable models due to their advantageous theoretical properties. Some of these algorithms can be used to search for a maximum likelihood decomposable model with a given maximum clique size, k, which controls the complexity of the model. Unfortunately, the problem of learning a maximum likelihood decomposable model given a maximum clique size is NP-hard for k > 2. In this work, we propose a family of algorithms which approximates this problem with a computational complexity of O(k · n^2 log n) in the worst case, where n is the number of implied random variables. The structures of the decomposable models that solve the maximum likelihood problem are called maximal k-order decomposable graphs. Our proposals, called fractal trees, construct a sequence of maximal i-order decomposable graphs, for i = 2, ..., k, in k − 1 steps. At each step, the algorithms follow a divide-and-conquer strategy based on the particular features of this type of structures. Additionally, we propose a prune-and-graft procedure which transforms a maximal k-order decomposable graph into another one, increasing its likelihood. We have implemented two particular fractal tree algorithms called parallel fractal tree and sequential fractal tree. These algorithms can be considered a natural extension of Chow and Liu’s algorithm, from k = 2 to arbitrary values of k. Both algorithms have been compared against other efficient approaches in artificial and real domains, and they have shown a competitive behavior to deal with the maximum likelihood problem. Due to their low computational complexity they are especially recommended to deal with high dimensional domains

    Triku: a feature selection method based on nearest neighbors for single-cell data

    Get PDF
    Abstract BACKGROUND: Feature selection is a relevant step in the analysis of single-cell RNA sequencing datasets. Most of the current feature selection methods are based on general univariate descriptors of the data such as the dispersion or the percentage of zeros. Despite the use of correction methods, the generality of these feature selection methods biases the genes selected towards highly expressed genes, instead of the genes defining the cell populations of the dataset. RESULTS: Triku is a feature selection method that favors genes defining the main cell populations. It does so by selecting genes expressed by groups of cells that are close in the k-nearest neighbor graph. The expression of these genes is higher than the expected expression if the k-cells were chosen at random. Triku efficiently recovers cell populations present in artificial and biological benchmarking datasets, based on adjusted Rand index, normalized mutual information, supervised classification, and silhouette coefficient measurements. Additionally, gene sets selected by triku are more likely to be related to relevant Gene Ontology terms and contain fewer ribosomal and mitochondrial genes. CONCLUSION: Triku is developed in Python 3 and is available at https://github.com/alexmascension/triku.This work was supported by grants from Instituto de Salud Carlos III (AC17/00012 and PI19/01621), cofunded by the Euro- pean Union (European Regional Development Fund/European Sci- ence Foundation, Investing in your future) and the 4D-HEALING project (ERA-Net program EracoSysMed, JTC-2 2017); Diputación Foral de Gipuzkoa, and the Department of Economic Devel- opment and Infrastructures of the Basque Government (KK- 2019/00006, KK-2019/00093); European Union FET project Cir- cular Vision (H2020-FETOPEN, Project 899417), Ministry of Sci- ence and Innovation of Spain; and PID2020-119715GB-I00 funded by MCIN/AEI/10.13039/501100011033 and by “ERDF A way of making Europe. A.M.A. was supported by a Basque Govern- ment Postgraduate Diploma fellowship (PRE_2020_2_0081), and O.I.S. was supported by a Postgraduate Diploma fellowship from la Caixa Foundation (identification document 100010434; code LCF/BQ/IN18/11660065)

    Approaching Sentiment Analysis by Using Semi-supervised Learning of Multidimensional Classifiers

    Get PDF
    Sentiment Analysis is defined as the computational study of opinions, sentiments and emotions expressed in text. Within this broad field, most of the work has been focused on either Sentiment Polarity classification, where a text is classified as having positive or negative sentiment, or Subjectivity classification, in which a text is classified as being subjective or objective. However, in this paper, we consider instead a real-world problem in which the attitude of the author is characterised by three different (but related) target variables: Subjectivity, Sentiment Polarity, Will to Influence, unlike the two previously stated problems, where there is only a single variable to be predicted. For that reason, the (uni-dimensional) common approaches used in this area yield suboptimal solutions to this problem. In order to bridge this gap, we propose, for the first time, the use of the novel multi-dimensional classification paradigm in the Sentiment Analysis domain. This methodology is able to join the different target variables in the same classification task so as to take advantage of the potential statistical relations between them. In addition, and in order to take advantage of the huge amount of unlabelled information available nowadays in this context, we propose the extension of the multi-dimensional classification framework to the semi-supervised domain. Experimental results for this problem show that our semi-supervised multi-dimensional approach outperforms the most common Sentiment Analysis approaches, concluding that our approach is beneficial to improve the recognition rates for this problem, and in extension, could be considered to solve future Sentiment Analysis problems

    Microarray analysis of autoimmune diseases by machine learning procedures

    Get PDF
    —Microarray-based global gene expression profiling, with the use of sophisticated statistical algorithms is providing new insights into the pathogenesis of autoimmune diseases. We have applied a novel statistical technique for gene selection based on machine learning approaches to analyze microarray expression data gathered from patients with systemic lupus erythematosus (SLE) and primary antiphospholipid syndrome (PAPS), two autoimmune diseases of unknown genetic origin that share many common features. The methodology included a combination of three data discretization policies, a consensus gene selection method, and a multivariate correlation measurement. A set of 150 genes was found to discriminate SLE and PAPS patients from healthy individuals. Statistical validations demonstrate the relevance of this gene set from an univariate and multivariate perspective. Moreover, functional characterization of these genes identified an interferon-regulated gene signature, consistent with previous reports. It also revealed the existence of other regulatory pathways, including those regulated by PTEN, TNF, and BCL-2, which are altered in SLE and PAPS. Remarkably, a significant number of these genes carry E2F binding motifs in their promoters, projecting a role for E2F in the regulation of autoimmunity

    A review of estimation of distribution algorithms in bioinformatics

    Get PDF
    Evolutionary search algorithms have become an essential asset in the algorithmic toolbox for solving high-dimensional optimization problems in across a broad range of bioinformatics problems. Genetic algorithms, the most well-known and representative evolutionary search technique, have been the subject of the major part of such applications. Estimation of distribution algorithms (EDAs) offer a novel evolutionary paradigm that constitutes a natural and attractive alternative to genetic algorithms. They make use of a probabilistic model, learnt from the promising solutions, to guide the search process. In this paper, we set out a basic taxonomy of EDA techniques, underlining the nature and complexity of the probabilistic model of each EDA variant. We review a set of innovative works that make use of EDA techniques to solve challenging bioinformatics problems, emphasizing the EDA paradigm's potential for further research in this domain

    Modelos gráficos probabilísticos para la clasificación supervisada empleando la estimación basada en kernels Gaussianos esféricos

    Full text link
    El clasificador naive Bayes ha demostrado comportarse sorprendentemente bien en la clasificación supervisada a pesar de que asume que las variables predictoras son condicionalmente independientes dada la clase, lo que generalmente no se cumple. El clasificador red Bayesiana aumentada a árbol rompe con esta suposición tan fuerte ya que permite dependencias entre las variables predictoras, por lo que se comporta mejor que el naive Bayes en ciertos dominios. Muchos de los clasificadores basados en redes Bayesianas (naive Bayes, red Bayesiana aumentada a árbol, red Bayesiana k-dependiente, semi naive Bayes...) únicamente emplean variables discretas, a pesar de que muchos dominios reales incluyen variables continuas. Existen tres opciones para estimarlas funciones de densidad de las variables continuas: 1. Discretizar las variables continuas con la consecuente pérdida de información. 2. Aproximarse a la función de densidad de los datos mediante una estimación paramétrica (habitualmente Gaussiana), con el consecuente error en la estimación si la distribución real difiere de la distribución paramétrica seleccionada. 3. Aproximar la densidad mediante una estimación no paramétrica (kernels,...). La estimación no paramétrica es más flexible que la estimación paramétrica, ya que se ajusta razonablemente mejor a la mayoría de las funciones de densidad. Este trabajo presenta el paradigma red flexible condicionada. No pretende ser un estudio en profundidad del nuevo paradigma, sino su introducción para la clasificación supervisada. Dicho paradigma emplea la estimación basada en kernels para modelar la densidad de las variables continuas. La red flexible condicionada puede ser entendida como una extensión de los paradigmas red Bayesiana y red Gaussiana condicionada, ya que permite una estimación más flexible y precisa de la función de densidad de las variables. A modo de ejemplo práctico, se incluye la adaptación del algoritmo red Bayesiana aumentada a árbol de Friedman y col. (1997) a las redes flexibles condicionadas. Esta adaptación, puede ser considerada como la extensión del clasificador flexible Bayes de John y Langley (1995), de la misma manera que la red Bayesiana aumentada a árbol es una extensión del naive Bayes. Además, y con el fin de sentar las bases de nuestra línea de trabajo se propone un estimador para la cantidad de información mutua entre dos variables continuas multidimensionales cuya densidad está basada en kernels

    On applying Supervised Classification techniques in medicine

    Full text link
    This paper presents an overview of the Supervised Classification Techniques that can be applied in medicine. Supervised Classification concerns to the Machine Learning area, and many paradigms have been used in order to develop Decision Support Systems that could help the physician in the diagnosis task. Different families of classifiers can be distinguished based on the model used to do the final classification: Classification Rules, Decision Trees, Instance Based Learning and Bayesian Classifiers are presented in this paper. These techniques have been extended to many research and application fields, and some examples in the medical world are presented for each paradigm
    corecore