46 research outputs found

    Chi-square-based scoring function for categorization of MEDLINE citations

    Full text link
    Objectives: Text categorization has been used in biomedical informatics for identifying documents containing relevant topics of interest. We developed a simple method that uses a chi-square-based scoring function to determine the likelihood of MEDLINE citations containing genetic relevant topic. Methods: Our procedure requires construction of a genetic and a nongenetic domain document corpus. We used MeSH descriptors assigned to MEDLINE citations for this categorization task. We compared frequencies of MeSH descriptors between two corpora applying chi-square test. A MeSH descriptor was considered to be a positive indicator if its relative observed frequency in the genetic domain corpus was greater than its relative observed frequency in the nongenetic domain corpus. The output of the proposed method is a list of scores for all the citations, with the highest score given to those citations containing MeSH descriptors typical for the genetic domain. Results: Validation was done on a set of 734 manually annotated MEDLINE citations. It achieved predictive accuracy of 0.87 with 0.69 recall and 0.64 precision. We evaluated the method by comparing it to three machine learning algorithms (support vector machines, decision trees, na\"ive Bayes). Although the differences were not statistically significantly different, results showed that our chi-square scoring performs as good as compared machine learning algorithms. Conclusions: We suggest that the chi-square scoring is an effective solution to help categorize MEDLINE citations. The algorithm is implemented in the BITOLA literature-based discovery support system as a preprocessor for gene symbol disambiguation process.Comment: 34 pages, 2 figure

    Exploring the KEER knowledge landscape over the past decade

    Get PDF
    The aim of this paper was to systematically explore the knowledge landscape of papers presented at KEER conferences over the last decade. We collected all papers published in conference proceedings between 2010 and 2020. We (i) used a text mining pipeline to extract, clean, and normalize keywords from the Title and Abstract fields, and (ii) created a co-occurrence network reflecting the relationships between keywords. The network was then characterized at different levels of granularity (static analysis vs. time slice analysis and whole network vs. node-level analysis). The exploratory analysis showed a stable expansion of the network over time. The cluster structure revealed several groups of keywords that did not change over time and reflected both domain-specific and method-specific topics of research in Kansei engineering

    Quantifying the consistency of scientific databases

    Full text link
    Science is a social process with far-reaching impact on our modern society. In the recent years, for the first time we are able to scientifically study the science itself. This is enabled by massive amounts of data on scientific publications that is increasingly becoming available. The data is contained in several databases such as Web of Science or PubMed, maintained by various public and private entities. Unfortunately, these databases are not always consistent, which considerably hinders this study. Relying on the powerful framework of complex networks, we conduct a systematic analysis of the consistency among six major scientific databases. We found that identifying a single "best" database is far from easy. Nevertheless, our results indicate appreciable differences in mutual consistency of different databases, which we interpret as recipes for future bibliometric studies.Comment: 20 pages, 5 figures, 4 table

    Exploring the Knowledge Landscape of <em>Escherichia coli</em> Research: A Scientometric Overview

    Get PDF
    Escherichia coli (E. coli) has the hallmark of being the most extensively studied organism. This is shown by the thousands of articles published since its discovery by T. Escherich in 1885. On the other hand, very little is known about the intellectual landscape in E. coli research. For example, how the trend of publications on E. coli has evolved over time and which scientific topics have been the focus of interest for researchers. In this chapter, we present the results of a large-scale scientometric analysis of about 100,000 bibliographic records from PubMed over the period 1981–2021. To examine the evolution of research topics over time, we divided the dataset into four intervals of equal width. We created co-occurrence networks from keywords indexed in the Medical Subject Headings vocabulary and systematically examined the structure and evolution of scientific knowledge about E. coli. The extracted research topics were visualized in strategic diagrams and qualitatively characterized in terms of their maturity and cohesion

    Meta-analysis: Its role in psychological methodology

    No full text
    Meta-analysis refers to the statistical analysis of a large collection of independent observations for the purpose of integrating results. The main objectives of this article are to define meta-analysis as a method of data integration, to draw attention to some particularities of its use, and to encourage researchers to use meta-analysis in their work. The benefits of meta-analysis include more effective exploitation of existing data from independent sources and contribution to more powerful domain knowledge. It may also serve as a support tool to generate new research hypothesis. The idea of combining results of independent studies addressing the same research question dates back to sixteenth century. Metaanalysis was reinvented in 1976 by Glass, to refute the conclusion of an eminent colleague, Eysenck, that psychotherapy was essentially ineffective. We review some major historical landmarks of metaanalysis and its statistical background. We present the concept of effect size measure, the problem of heterogeneity and two models which are used to combine individual effect sizes (fixed and random effect model) in great details. Two visualization techniques, forest and funnel plot graphics are demonstrated. We developed RMetaWeb, simple and fast web server application to conduct meta-analysis online. RMetaWeb is the first web meta-analysis application and is completely based on R software environment for statistical computing and graphics

    Item Response Theory Modeling for Microarray Gene Expression Data

    No full text
    The high dimensionality of global gene expression profiles, where number of variables (genes) is very large compared to the number of observations (samples), presents challenges that affect generalizability and applicability of microarray analysis. Latent variable modeling offers a promising approach to deal with high-dimensional microarray data. The latent variable model is based on a few latent variables that capture most of the gene expression information. Here, we describe how to accomplish a reduction in dimension by a latent variable methodology, which can greatly reduce the number of features used to characterize microarray data. We propose a general latent variable framework for prediction of predefined classes of samples using gene expression profiles from microarray experiments. The framework consists of (i) selection of smaller number of genes that are most differentially expressed between samples, (ii) dimension reduction using hierarchical clustering, where each cluster partition is identified as latent variable, (iii) discretization of gene expression matrix, (iv) fitting the Rasch item response model for genes in each cluster partition to estimate the expression of latent variable, and (v) construction of prediction model with latent variables as covariates to study the relationship between latent variables and phenotype. Two different microarray data sets are used to illustrate a general framework of the approach. We show that the predictive performance of our method is comparable to the current best approach based on an all-gene space. The method is general and can be applied to the other high-dimensional data problems.

    Knowledge discovery and data mining in psychology: Using decision trees to predict the Sensation Seeking Scale score

    No full text
    Knowledge discovery from data is an interdisciplinary research field combining technology and knowledge from domains of statistics, databases, machine learning and artificial intelligence. Data mining is the most important part of knowledge discovery process. The objective of this paper is twofold. The first objective is to point out the qualitative shift in research methodology due to evolving knowledge discovery technology. The second objective is to introduce the technique of decision trees to psychological domain experts. We illustrate the utility of the decision trees on the prediction model of sensation seeking. Prediction of the Zuckerman&#39;s Sensation Seeking Scale (SSS-V) score was based on the bundle of Eysenck&#39;s personality traits and Pavlovian temperament properties. Predictors were operationalized on the basis of Eysenck Personality Questionnaire (EPQ) and Slovenian adaptation of the Pavlovian Temperament Survey (SVTP). The standard statistical technique of multiple regression was used as a baseline method to evaluate the decision trees methodology. The multiple regression model was the most accurate model in terms of predictive accuracy. However, the decision trees could serve as a powerful general method for initial exploratory data analysis, data visualization and knowledge discovery
    corecore