600 research outputs found

    Enhancing Twitter Data Analysis with Simple Semantic Filtering: Example in Tracking Influenza-Like Illnesses

    Full text link
    Systems that exploit publicly available user generated content such as Twitter messages have been successful in tracking seasonal influenza. We developed a novel filtering method for Influenza-Like-Illnesses (ILI)-related messages using 587 million messages from Twitter micro-blogs. We first filtered messages based on syndrome keywords from the BioCaster Ontology, an extant knowledge model of laymen's terms. We then filtered the messages according to semantic features such as negation, hashtags, emoticons, humor and geography. The data covered 36 weeks for the US 2009 influenza season from 30th August 2009 to 8th May 2010. Results showed that our system achieved the highest Pearson correlation coefficient of 98.46% (p-value<2.2e-16), an improvement of 3.98% over the previous state-of-the-art method. The results indicate that simple NLP-based enhancements to existing approaches to mine Twitter data can increase the value of this inexpensive resource.Comment: 10 pages, 5 figures, IEEE HISB 2012 conference, Sept 27-28, 2012, La Jolla, California, U

    Doubly Optimized Calibrated Support Vector Machine (DOC-SVM): an algorithm for joint optimization of discrimination and calibration.

    Get PDF
    Historically, probabilistic models for decision support have focused on discrimination, e.g., minimizing the ranking error of predicted outcomes. Unfortunately, these models ignore another important aspect, calibration, which indicates the magnitude of correctness of model predictions. Using discrimination and calibration simultaneously can be helpful for many clinical decisions. We investigated tradeoffs between these goals, and developed a unified maximum-margin method to handle them jointly. Our approach called, Doubly Optimized Calibrated Support Vector Machine (DOC-SVM), concurrently optimizes two loss functions: the ridge regression loss and the hinge loss. Experiments using three breast cancer gene-expression datasets (i.e., GSE2034, GSE2990, and Chanrion's datasets) showed that our model generated more calibrated outputs when compared to other state-of-the-art models like Support Vector Machine (p=0.03, p=0.13, and p&lt;0.001) and Logistic Regression (p=0.006, p=0.008, and p&lt;0.001). DOC-SVM also demonstrated better discrimination (i.e., higher AUCs) when compared to Support Vector Machine (p=0.38, p=0.29, and p=0.047) and Logistic Regression (p=0.38, p=0.04, and p&lt;0.0001). DOC-SVM produced a model that was better calibrated without sacrificing discrimination, and hence may be helpful in clinical decision making

    Grid multi-category response logistic models.

    Get PDF
    BackgroundMulti-category response models are very important complements to binary logistic models in medical decision-making. Decomposing model construction by aggregating computation developed at different sites is necessary when data cannot be moved outside institutions due to privacy or other concerns. Such decomposition makes it possible to conduct grid computing to protect the privacy of individual observations.MethodsThis paper proposes two grid multi-category response models for ordinal and multinomial logistic regressions. Grid computation to test model assumptions is also developed for these two types of models. In addition, we present grid methods for goodness-of-fit assessment and for classification performance evaluation.ResultsSimulation results show that the grid models produce the same results as those obtained from corresponding centralized models, demonstrating that it is possible to build models using multi-center data without losing accuracy or transmitting observation-level data. Two real data sets are used to evaluate the performance of our proposed grid models.ConclusionsThe grid fitting method offers a practical solution for resolving privacy and other issues caused by pooling all data in a central site. The proposed method is applicable for various likelihood estimation problems, including other generalized linear models

    Ranking Medical Subject Headings using a factor graph model.

    Get PDF
    Automatically assigning MeSH (Medical Subject Headings) to articles is an active research topic. Recent work demonstrated the feasibility of improving the existing automated Medical Text Indexer (MTI) system, developed at the National Library of Medicine (NLM). Encouraged by this work, we propose a novel data-driven approach that uses semantic distances in the MeSH ontology for automated MeSH assignment. Specifically, we developed a graphical model to propagate belief through a citation network to provide robust MeSH main heading (MH) recommendation. Our preliminary results indicate that this approach can reach high Mean Average Precision (MAP) in some scenarios

    Finding Related Publications: Extending the Set of Terms Used to Assess Article Similarity.

    Get PDF
    Recommendation of related articles is an important feature of the PubMed. The PubMed Related Citations (PRC) algorithm is the engine that enables this feature, and it leverages information on 22 million citations. We analyzed the performance of the PRC algorithm on 4584 annotated articles from the 2005 Text REtrieval Conference (TREC) Genomics Track data. Our analysis indicated that the PRC highest weighted term was not always consistent with the critical term that was most directly related to the topic of the article. We implemented term expansion and found that it was a promising and easy-to-implement approach to improve the performance of the PRC algorithm for the TREC 2005 Genomics data and for the TREC 2014 Clinical Decision Support Track data. For term expansion, we trained a Skip-gram model using the Word2Vec package. This extended PRC algorithm resulted in higher average precision for a large subset of articles. A combination of both algorithms may lead to improved performance in related article recommendations
    corecore