25 research outputs found
Chi-square-based scoring function for categorization of MEDLINE citations
Objectives: Text categorization has been used in biomedical informatics for
identifying documents containing relevant topics of interest. We developed a
simple method that uses a chi-square-based scoring function to determine the
likelihood of MEDLINE citations containing genetic relevant topic. Methods: Our
procedure requires construction of a genetic and a nongenetic domain document
corpus. We used MeSH descriptors assigned to MEDLINE citations for this
categorization task. We compared frequencies of MeSH descriptors between two
corpora applying chi-square test. A MeSH descriptor was considered to be a
positive indicator if its relative observed frequency in the genetic domain
corpus was greater than its relative observed frequency in the nongenetic
domain corpus. The output of the proposed method is a list of scores for all
the citations, with the highest score given to those citations containing MeSH
descriptors typical for the genetic domain. Results: Validation was done on a
set of 734 manually annotated MEDLINE citations. It achieved predictive
accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method
by comparing it to three machine learning algorithms (support vector machines,
decision trees, na\"ive Bayes). Although the differences were not statistically
significantly different, results showed that our chi-square scoring performs as
good as compared machine learning algorithms. Conclusions: We suggest that the
chi-square scoring is an effective solution to help categorize MEDLINE
citations. The algorithm is implemented in the BITOLA literature-based
discovery support system as a preprocessor for gene symbol disambiguation
process.Comment: 34 pages, 2 figure
Minería de Datos: Conceptos y Tendencias
Hoy en día, la minería de datos (MD) está consiguiendo cada vez más captar la atención de las empresas. Todavía es
infrecuente oír frases como “deberíamos segmentar a nuestros clientes utilizando herramientas de MD”, “la MD
incrementará la satisfacción del cliente”, o “la competencia está utilizando MD para ganar cuota de mercado”. Sin
embargo, todo apunta a que más temprano que tarde la minería de datos será usada por la sociedad, al menos con el
mismo peso que actualmente tiene la Estadística. Así que ¿qué es la minería de datos y qué beneficios aporta?
¿Cómo puede influir esta tecnología en la resolución de los problemas diarios de las empresas y la sociedad en
general? ¿Qué tecnologías están detrás de la minería de datos? ¿Cuál es el ciclo de vida de un proyecto típico de
minería de datos? En este artículo, se intentarán aclarar estas cuestiones mediante una introducción a la minería de
datos: definición, ejemplificar problemas que se pueden resolver con minería de datos, las tareas de la minería de
datos, técnicas usadas y finalmente retos y tendencias en minería de datos
MÉTODOS ESTIMADORES DE ERROR
SÓLO VISIÓN PROYECTABLE
Recommended from our members
Comparing predictions made by a prediction model, clinical score, and physicians Pediatric asthma exacerbations in the emergency department
Background: Asthma exacerbations are one of the most common medical reasons for children to be brought to the hospital emergency department (ED). Various prediction models have been proposed to support diagnosis of exacerbations and evaluation of their severity. Objectives: First, to evaluate prediction models constructed from data using machine learning techniques and to select the best performing model. Second, to compare predictions from the selected model with predictions from the Pediatric Respiratory Assessment Measure (PRAM) score, and predictions made by ED physicians.
Design: A two-phase study conducted in the ED of an academic pediatric hospital. In phase 1 data collected prospectively using paper forms was used to construct and evaluate five prediction models, and the best performing model was selected. In phase 2, data collected prospectively using a mobile system was used to compare the predictions of the selected prediction model with those from PRAM and ED physicians.
Measurements: Area under the receiver operating characteristic curve and accuracy in phase 1; accuracy, sensitivity, specificity, positive and negative predictive values in phase 2.
Results: In phase 1 prediction models were derived from a data set of 240 patients and evaluated using 10-fold cross validation. A naive Bayes (NB) model demonstrated the best performance and it was selected for phase 2. Evaluation in phase 2 was conducted on data from 82 patients. Predictions made by the NB model were less accurate than the PRAM score and physicians (accuracy of 70.7%, 73.2% and 78.0% respectively), however, according to McNemar’s test it is not possible to conclude that the differences between predictions are statistically significant.
Conclusion: Both the PRAM score and the NB model were less accurate than physicians. The NB model can handle incomplete patient data and as such may complement the PRAM score. However, it requires further research to improve its accuracy
Aplicação de Multiclassificadores Heterogêneos no Reconhecimento de Classes Estruturais de Proteínas
O reconhecimento de dobras de proteína é um dos principais problemas em aberto da biologia molecular e uma importante abordagem para a descoberta de estruturas de proteínas desconsiderando a similaridade de suas seqüências. Neste contexto, as ferramentas computacionais, principalmente as técnicas da Aprendizagem de Máquina (AM), tornaram-se alternativas essenciais para tratar esse problema, considerando o grande volume de dados empregado. Este trabalho apresenta os resultados obtidos com a aplicação de diferentes sistemas multiclassificadores heterogêneos (Stacking, StackingC e Vote), empregando tipos distintos de classificadores base (Árvores de Decisão, K-Vizinhos Mais próximos, Naive Bayes, Máquinas de Vetores Suporte e Redes Neurais), à tarefa de predição de classes estruturais de proteína
Hierarchical cost-sensitive algorithms for genome-wide gene function prediction
In this work we propose new ensemble methods for the hierarchical classification of gene functions. Our methods exploit the hierarchical relationships between the classes in different ways: each ensemble node is trained \u201clocally\u201d, according to its position in the hierarchy; moreover, in the evaluation phase the set of predicted annotations is built so
to minimize a global loss function defined over the hierarchy. We also
address the problem of sparsity of annotations by introducing a cost-
sensitive parameter that allows to control the precision-recall trade-off.
Experiments with the model organism S. cerevisiae, using the FunCat
taxonomy and 7 biomolecular data sets, reveal a significant advantage of
our techniques over \u201cflat\u201d and cost-insensitive hierarchical ensembles
Random subspace ensembles for the bio-molecular diagnosis of tumors.
The bio-molecular diagnosis of malignancies, based on DNA
microarray biotechnologies, is a difficult learning task, because of the
high dimensionality and low cardinality of the data. Many supervised
learning techniques, among them support vector machines (SVMs), have
been experimented, using also feature selection methods to reduce the
dimensionality of the data. In this paper we investigate an alternative
approach based on random subspace ensemble methods. The high dimensionality
of the data is reduced by randomly sampling subsets of
features (gene expression levels), and accuracy is improved by aggregating
the resulting base classifiers. Our experiments, in the area of the
diagnosis of malignancies at bio-molecular level, show the effectiveness
of the proposed approach
Weighted True Path Rule: a multilabel hierarchical algorithm for gene function prediction
The genome-wide hierarchical classification of gene functions, using biomolecular data from high-throughput biotechnologies, is one of the central topics in bioinformatics and functional genomics. In this paper we present a multilabel hierarchical algorithm inspired by the \u201ctrue path rule\u201d that governs both the Gene Ontology and the Functional Catalogue (FunCat). In particular we propose an enhanced version of
the True Path Rule (TPR) algorithm, by which we can control the flow of information between the classifiers of the hierarchical ensemble, thus allowing to tune the precision/recall characteristics of the overall hierarchical classification system. Results with the model organism S. cerevisiae show that the proposed method significantly improves on the basic version of the TPR algorithm, as well as on the Hierarchical Top-down and Flat ensembles