6 research outputs found

    Building Gene Expression Profile Classifiers with a Simple and Efficient Rejection Option in R

    Get PDF
    Background: The collection of gene expression profiles from DNA microarrays and their analysis with pattern recognition algorithms is a powerful technology applied to several biological problems. Common pattern recognition systems classify samples assigning them to a set of known classes. However, in a clinical diagnostics setup, novel and unknown classes (new pathologies) may appear and one must be able to reject those samples that do not fit the trained model. The problem of implementing a rejection option in a multi-class classifier has not been widely addressed in the statistical literature. Gene expression profiles represent a critical case study since they suffer from the curse of dimensionality problem that negatively reflects on the reliability of both traditional rejection models and also more recent approaches such as one-class classifiers. Results: This paper presents a set of empirical decision rules that can be used to implement a rejection option in a set of multi-class classifiers widely used for the analysis of gene expression profiles. In particular, we focus on the classifiers implemented in the R Language and Environment for Statistical Computing (R for short in the remaining of this paper). The main contribution of the proposed rules is their simplicity, which enables an easy integration with available data analysis environments. Since in the definition of a rejection model tuning of the involved parameters is often a complex and delicate task, in this paper we exploit an evolutionary strategy to automate this process. This allows the final user to maximize the rejection accuracy with minimum manual intervention. Conclusions: This paper shows how the use of simple decision rules can be used to help the use of complex machine learning algorithms in real experimental setups. The proposed approach is almost completely automated and therefore a good candidate for being integrated in data analysis flows in labs where the machine learning expertise required to tune traditional classifiers might not be availabl

    A cDNA Microarray Gene Expression Data Classifier for Clinical Diagnostics Based on Graph Theory

    Get PDF
    Despite great advances in discovering cancer molecular profiles, the proper application of microarray technology to routine clinical diagnostics is still a challenge. Current practices in the classification of microarrays' data show two main limitations: the reliability of the training data sets used to build the classifiers, and the classifiers' performances, especially when the sample to be classified does not belong to any of the available classes. In this case, state-of-the-art algorithms usually produce a high rate of false positives that, in real diagnostic applications, are unacceptable. To address this problem, this paper presents a new cDNA microarray data classification algorithm based on graph theory and is able to overcome most of the limitations of known classification methodologies. The classifier works by analyzing gene expression data organized in an innovative data structure based on graphs, where vertices correspond to genes and edges to gene expression relationships. To demonstrate the novelty of the proposed approach, the authors present an experimental performance comparison between the proposed classifier and several state-of-the-art classification algorithm

    Joint sub-classifiers one class classification model for avian influenza outbreak detection

    Full text link
    H5N1 avian influenza outbreak detection is a significant issue for early warning of epidemics. This paper proposes domain knowledge-based joint one class classification model for avian influenza outbreak. Instead of focusing on manipulations of the one class classification model, we delve into the one class avian influenza dataset, divide it into sub-classes by domain knowledge, train the sub-class classifiers and unify the result of each classifier. The proposed joint method solves the one class classification and features selection problems together. The experiment results demonstrate that the proposed joint model definitely outperforms the normal one class classification model on the animal avian influenza dataset. © 2011 Imperial College Press

    Aplicação de transcriptômica e proteômica como avaliação complementar de alimentos através de análise multivariada

    Get PDF
    Tese (doutorado) - Universidade Federal de Santa Catarina, Centro de Ciências Agrárias, Programa de Pós-Graduação em Ciência dos Alimentos, Florianópolis, 2014.A crescente presença de novos produtos alimentícios no mercado desperta discussões relacionadas à segurança de alimentos. Cada país ou região tem suas próprias leis para liberação de novos alimentos para consumo, porém existe um consenso internacional no que diz respeito às regulamentações para segurança do consumo destes produtos, inclusive de alimentos provenientes de tecnologia de DNA recombinante. O conceito de equivalência substancial tem sido utilizado com este fim, fundamentado no fato de que os alimentos já existentes no comércio são admitidos como seguros para o consumo e servem como base para comparação por meio e análises de componentes específicos. Apesar de convenientes, estas análises-alvo são bastante limitadas por pesquisarem a presença de somente alguns elementos previamente conhecidos. Deste modo, abordagens mais completas para avaliação de alimentos têm sido propostas, como análises transcriptômica e proteômica. Como estas análises geram um grande volume de dados, enfoques utilizando análise estatística multivariada têm sido sugeridos para interpretação dos resultados. Neste trabalho verificou-se a aplicação de ferramentas estatísticas multivariadas para análises transcriptômica (microarranjo) e proteômica (eletroforese bidimensional) com vistas à sua utilização como análise complementar na avaliação de segurança de novos alimentos. Para isso, analisaram-se cinco variedades de batatas reconhecidas como seguras para consumo (Biogold, Fontane, Innovator, Lady Rosetta e Maris Piper). As análises do transcriptoma das amostras revelaram que foi possível a classificação das amostras utilizando a ferramenta SIMCA com uma classe. Foram desenvolvidos dois cenários contendo um conjunto de cinco classificadores e, em cada cenário, foram testadas duas amostras independentes sabidamente seguras, porém analisadas em diferentes momentos (incluindo, assim, variabilidade técnica no teste). Em cada conjunto de classificadores, as amostras teste que foram mais vezes classificadas como não pertencentes aos modelos (ou seja, não classificadas como seguras) representam as amostras com maior variabilidade técnica, pois foram cultivadas e analisadas em tempos diferentes daquelas utilizadas para a construção dos classificadores. Já as amostras que foram reconhecidas como seguras na maioria dos classificadores possuem menor variabilidade técnica. Foi também realizada a análise proteômica por eletroforese bidimensional destas amostras. Utilizou-se tiras de gradiente de pH imobilizado (IPG) de dois comprimentos diferentes, 13 e 24 cm, todas na faixa de pH de 4-7, e os conjuntos de dados gerados, representando a porcentagem de volume de spots (tendo os valores omissos substituídos ou simplesmente eliminados), foram visualizados por diagramas de análise de componentes principais (PCA). Foi verificada clara separação das variedades já nos dois primeiros componentes principais do conjunto de dados contendo valores omissos substituídos. Estes resultados revelam a possibilidade de se construir ferramentas de classificação por técnicas de análise ampla de perfil como a transcriptômica e proteômica, explorando assim uma nova abordagem para avaliação de segurança de alimentos. Para aprimorar o trabalho, a análise de um maior número de amostras permitirá maior precisão dos resultados, incluindo-se assim um alto nível de variabilidade técnica na construção dos classificadores. Desta forma, será possível a reprodução em pequena escala de situações reais de avaliação de segurança de alimentos.Abstract : The increasing occurrence of new food products in the market stimulates ever more discussions related to food safety. Each country or region possess their own laws for releasing new foods for consumption, but there is an international consensus regarding the regulations for the safety of consumption of these products, including foods derived from recombinant DNA technology. The concept of substantial equivalence has been used for this purpose, based on the fact that food already commercialized are accepted as safe for consumption and serve as basis for comparison through analysis of specific components recognized as toxic. These targeted analyzes are convenient, but they are rather limited because they search for the presence of only a few elements which are previously known. Thus, more comprehensive approaches such as transcriptomics and proteomics analyses have been proposed for food safety evaluation. Multivariate statistical approaches have been suggested for interpretation of results, given that these analyzes generate a large amount of data. On this study the application of multivariate statistical tools for analysis of data from transcriptomics (microarray) and proteomics (two-dimensional electrophoresis) techniques was verified, aiming its use as a complementary tool in safety assessment of novel foods. For that, five varieties of potatoes recognized as safe for consumption (Biogold, Fontane, Innovator, Lady Rosetta and Maris Piper) were analyzed. Transcriptome analysis of samples showed that it was possible to classify the samples using the SIMCA one class. Two scenarios containing a set of five classifiers have been developed, and each set of two independent samples considered as safe were tested, but analyzed at different time points (including technical variability in the test). In each set of classifiers, the test samples which were most often classified as not belonging to the models (i.e. not classified as safe) represent the samples with higher technical variability, given they were grown and analyzed at different time points from those used to construct the classifiers. However, the samples that have been recognized as safe in most classifiers have lower technical variability. In addition, proteomic analysis using two-dimensional electrophoresis was performed with these samples. Immobilized pH Gradient (IPG) strips pH 4-7 of two different lengths, 13 and 24 cm, were used, and the generated datasets representing percentage of volume of spots (missing values have been replaced or simply removed) were visualized by PCA. Clear separation of the varieties was verified already in the first two principal components of the dataset containing replaced missing values. These results reveal the possibility of building classification tools through profiling techniques such as transcriptomics and proteomics, thus exploring a complimentary approach for food safety assessment. To improve the work, analysis of an increased amount of samples will enable more accurate results, thus including a high level of technical variability in the construction of classifiers. As a result, it is possible to represent real situations of food safety assessment in small-scale
    corecore