25 research outputs found

    Chi-square-based scoring function for categorization of MEDLINE citations

    Full text link
    Objectives: Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE citations containing genetic relevant topic. Methods: Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those citations containing MeSH descriptors typical for the genetic domain. Results: Validation was done on a set of 734 manually annotated MEDLINE citations. It achieved predictive accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method by comparing it to three machine learning algorithms (support vector machines, decision trees, na\"ive Bayes). Although the differences were not statistically significantly different, results showed that our chi-square scoring performs as good as compared machine learning algorithms. Conclusions: We suggest that the chi-square scoring is an effective solution to help categorize MEDLINE citations. The algorithm is implemented in the BITOLA literature-based discovery support system as a preprocessor for gene symbol disambiguation process.Comment: 34 pages, 2 figure

    Minería de Datos: Conceptos y Tendencias

    Get PDF
    Hoy en día, la minería de datos (MD) está consiguiendo cada vez más captar la atención de las empresas. Todavía es infrecuente oír frases como “deberíamos segmentar a nuestros clientes utilizando herramientas de MD”, “la MD incrementará la satisfacción del cliente”, o “la competencia está utilizando MD para ganar cuota de mercado”. Sin embargo, todo apunta a que más temprano que tarde la minería de datos será usada por la sociedad, al menos con el mismo peso que actualmente tiene la Estadística. Así que ¿qué es la minería de datos y qué beneficios aporta? ¿Cómo puede influir esta tecnología en la resolución de los problemas diarios de las empresas y la sociedad en general? ¿Qué tecnologías están detrás de la minería de datos? ¿Cuál es el ciclo de vida de un proyecto típico de minería de datos? En este artículo, se intentarán aclarar estas cuestiones mediante una introducción a la minería de datos: definición, ejemplificar problemas que se pueden resolver con minería de datos, las tareas de la minería de datos, técnicas usadas y finalmente retos y tendencias en minería de datos

    Aplicação de Multiclassificadores Heterogêneos no Reconhecimento de Classes Estruturais de Proteínas

    Get PDF
    O reconhecimento de dobras de proteína é um dos principais problemas em aberto da biologia molecular e uma importante abordagem para a descoberta de estruturas de proteínas desconsiderando a similaridade de suas seqüências. Neste contexto, as ferramentas computacionais, principalmente as técnicas da Aprendizagem de Máquina (AM), tornaram-se alternativas essenciais para tratar esse problema, considerando o grande volume de dados empregado. Este trabalho apresenta os resultados obtidos com a aplicação de diferentes sistemas multiclassificadores heterogêneos (Stacking, StackingC e Vote), empregando tipos distintos de classificadores base (Árvores de Decisão, K-Vizinhos Mais próximos, Naive Bayes, Máquinas de Vetores Suporte e Redes Neurais), à tarefa de predição de classes estruturais de proteína

    Hierarchical cost-sensitive algorithms for genome-wide gene function prediction

    Get PDF
    In this work we propose new ensemble methods for the hierarchical classification of gene functions. Our methods exploit the hierarchical relationships between the classes in different ways: each ensemble node is trained \u201clocally\u201d, according to its position in the hierarchy; moreover, in the evaluation phase the set of predicted annotations is built so to minimize a global loss function defined over the hierarchy. We also address the problem of sparsity of annotations by introducing a cost- sensitive parameter that allows to control the precision-recall trade-off. Experiments with the model organism S. cerevisiae, using the FunCat taxonomy and 7 biomolecular data sets, reveal a significant advantage of our techniques over \u201cflat\u201d and cost-insensitive hierarchical ensembles

    Random subspace ensembles for the bio-molecular diagnosis of tumors.

    Get PDF
    The bio-molecular diagnosis of malignancies, based on DNA microarray biotechnologies, is a difficult learning task, because of the high dimensionality and low cardinality of the data. Many supervised learning techniques, among them support vector machines (SVMs), have been experimented, using also feature selection methods to reduce the dimensionality of the data. In this paper we investigate an alternative approach based on random subspace ensemble methods. The high dimensionality of the data is reduced by randomly sampling subsets of features (gene expression levels), and accuracy is improved by aggregating the resulting base classifiers. Our experiments, in the area of the diagnosis of malignancies at bio-molecular level, show the effectiveness of the proposed approach

    Weighted True Path Rule: a multilabel hierarchical algorithm for gene function prediction

    Get PDF
    The genome-wide hierarchical classification of gene functions, using biomolecular data from high-throughput biotechnologies, is one of the central topics in bioinformatics and functional genomics. In this paper we present a multilabel hierarchical algorithm inspired by the \u201ctrue path rule\u201d that governs both the Gene Ontology and the Functional Catalogue (FunCat). In particular we propose an enhanced version of the True Path Rule (TPR) algorithm, by which we can control the flow of information between the classifiers of the hierarchical ensemble, thus allowing to tune the precision/recall characteristics of the overall hierarchical classification system. Results with the model organism S. cerevisiae show that the proposed method significantly improves on the basic version of the TPR algorithm, as well as on the Hierarchical Top-down and Flat ensembles
    corecore