42,313 research outputs found

    Predicting Disease Outbreaks using a Support Vector Machine model

    Get PDF
    The purpose of this research is to create an efficient way of detecting disease outbreaks from news articles using Support Vector Machines (SVM). An SVM is a supervised machine learning method used for classification and regression problems. The role of the SVM in this project is to “learn” to distinguish between news articles that may indicate a disease outbreak and those that do not. A series of health-related articles from the World Health Organization is parsed using a Java program in order to create vectors for the SVM. Each such article thus results in a vector. A basic negation detection algorithm is also built in this parser in order to detect what words are negated and improve vector accuracy. Using the resulting vectors, an SVM model is trained on approximately 63% of the vectors, while the remainder is used to test the accuracy of the SVM. The findings of this research might be useful for other projects aiming to develop systems for predicting and preventing disease outbreaks

    Next move in movement disorders (NEMO):Developing a computer-aided classification tool for hyperkinetic movement disorders

    Get PDF
    Introduction: Our aim is to develop a novel approach to hyperkinetic movement disorder classification, that combines clinical information, electromyography, accelerometry and video in a computer-aided classification tool. We see this as the next step towards rapid and accurate phenotype classification, the cornerstone of both the diagnostic and treatment process. Methods and analysis: The Next Move in Movement Disorders (NEMO) study is a cross-sectional study at Expertise Centre Movement Disorders Groningen, University Medical Centre Groningen. It comprises patients with single and mixed phenotype movement disorders. Single phenotype groups will first include dystonia, myoclonus and tremor, and then chorea, tics, ataxia and spasticity. Mixed phenotypes are myoclonus-dystonia, dystonic tremor, myoclonus ataxia and jerky/tremulous functional movement disorders. Groups will contain 20 patients, or 40 healthy participants. The gold standard for inclusion consists of interobserver agreement on the phenotype among three independent clinical experts. Electromyography, accelerometry and three-dimensional video data will be recorded during performance of a set of movement tasks, chosen by a team of specialists to elicit movement disorders. These data will serve as input for the machine learning algorithm. Labels for supervised learning are provided by the expert-based classification, allowing the algorithm to learn to predict what the output label should be when given new input data. Methods using manually engineered features based on existing clinical knowledge will be used, as well as deep learning methods which can detect relevant and possibly new features. Finally, we will employ visual analytics to visualise how the classification algorithm arrives at its decision. Ethics and dissemination: Ethical approval has been obtained from the relevant local ethics committee. The NEMO study is designed to pioneer the application of machine learning of movement disorders. We expect to publish articles in multiple related fields of research and patients will be informed of important results via patient associations and press releases

    Applications of Machine Learning Methods in Health Outcomes Research: Heart Failure in Women

    Get PDF
    There is robust evidence that heart failure (HF) is associated with substantial mortality, morbidity, poor health-related quality of life, healthcare utilization, and economic burden. Previous research has revealed that there are sex differences in the epidemiology, etiology, and disease burden of HF. However, research on HF among women, especially postmenopausal women, is limited. To fill the knowledge gap, the three related aims of this dissertation were to: (1) identify knowledge gaps in HF research among women, especially postmenopausal women, using unsupervised machine learning methods and big data (i.e., articles published in PubMed); (2) identify emerging predictors (i.e., polypharmacy and some prescription medications) of incident HF among postmenopausal women using supervised machine learning methods; (3) identify leading predictors of HF-related emergency room use among postmenopausal women using supervised machine learning methods with data from a large commercial insurance claims database in the United States. This study utilized machine learning methods. In the first aim, non-negative matrix factorization algorithms were used to cluster HF articles based on the primary topic. Clusters were independently validated and labeled by three investigators familiar with HF research. The most understudied area among women was atrial fibrillation. Among postmenopausal women, the most understudied topic was stress-induced cardiomyopathy. For the second and third aims, a retrospective cohort design and Optum’s de-identified Clinformatics® Data Mart Database (Optum, Eden Prairie, MN), de-identified health insurance claims data, were used. In the second aim, multivariable logistic regression and three classification machine learning algorithms (cross-validated logistic regression (CVLR), random forest (RF), and eXtreme Gradient Boosting (XGBoost) algorithms) were used to identify predictors of incident HF among postmenopausal women. The associations of the leading predictors to incident HF were explored with an interpretable machine learning SHapley Additive exPlanations (SHAP) technique. The eight leading predictors of incident HF consistent across all models were: older age, arrhythmia, polypharmacy, Medicare, chronic obstructive pulmonary disease (COPD), coronary artery disease, hypertension, and chronic kidney disease. Some prescription medications such as sulfonylureas and antibiotics other than fluoroquinolones predicted incident HF in some machine learning algorithms. In the third aim, a random forest algorithm was used to identify predictors of HF-related emergency room use among postmenopausal women. Interpretable machine learning techniques were used to explain the association of leading predictors to HF-related emergency room use. Random forest algorithm had high predictive accuracy in the test dataset (Area Under the Curve: 94%, sensitivity: 93%, specificity: 77%, and accuracy: 0.81). We found that the number of HF-related emergency room visits at baseline, fragmented care, age, insurance type (Health Maintenance Organization), and coronary artery disease were the top five predictors of HF-related emergency room use among postmenopausal women. Partial dependence plots suggested positive associations of the top predictors with HF-related emergency room use. However, insurance type was found to be negatively associated with HF-related emergency room use. Findings from this dissertation suggest that machine learning algorithms can achieve comparable and better predictive accuracy compared to traditional statistical models

    Self-Training Naive Bayes Berbasis Word2Vec untuk Kategorisasi Berita Bahasa Indonesia

    Get PDF
    News as one kind of information that is needed in daily life has been available on the internet. News website often categorizes their articles to each topic to help users access the news more easily. Document classification has widely used to do this automatically. The current availability of labeled training data is insufficient for the machine to create a good model. The problem in data annotation is that it requires a considerable cost and time to get sufficient quantity of labeled training data. A semi-supervised algorithm is proposed to solve this problem by using labeled and unlabeled data to create classification model. This paper proposes semi-supervised learning news classification system using Self-Training Naive Bayes algorithm. The feature that is used in text classification is Word2Vec Skip-Gram Model. This model is widely used in computational linguistics or text mining research as one of the methods in word representation. Word2Vec is used as a feature because it can bring the semantic meaning of the word in this classification task. The data used in this paper consists of 29,587 news documents from Indonesian online news websites. The Self-Training Naive Bayes algorithm achieved the highest F1-Score of 94.17%

    Strategies for Improving Semi-automated Topic Classification of Media and Parliamentary documents

    Get PDF
    Since 1995 the techniques and capacities to store new electronic data and to make it available to many persons have become a common good. As of then, different organizations, such as research institutes, universities, libraries, and private companies (Google) started to scan older documents and make them electronically available as well. This has generated a lot of new research opportunities for all kinds of academic disciplines. The use of software to analyze large datasets has become an important part of doing research in the social sciences. Most academics rely on human coded datasets, both in qualitative and quantitative research. However, with the increasing amount of datasets and the complexity of the questions scholars pose to the datasets, the quest for more efficient and effective methods is now on the agenda. One of the most common techniques of content analysis is the Boolean key-word search method. To find certain topics in a dataset, the researcher creates first a list of keywords, added with certain parameters (AND, OR etc.). All keys are usually grouped in families and the entire list of keys and groups is called the ontology. Then the keywords are searched in the dataset, retrieving all documents containing the specified keywords. The online newspaper dataset, LexisNexis, provides the user with such a Boolean search method. However, the Boolean key-word search is not always satisfying in terms of reliability and validity. For that reason social scientists rely on hand-coding. Two projects that do so are the congressional bills project (www.congressionalbills.org ) and the policy agenda-setting project (see www.policyagendas.org ). They developed a topic code book and coded various different sources, such as, the state of the union speeches, bills, newspaper articles etcetera. The continuous improving automated coding techniques, and the increasing number of agenda setting projects (in especially European countries), however, has made the use of automated coding software a feasible option and also a necessity

    Counterfactual Risk Minimization: Learning from Logged Bandit Feedback

    Full text link
    We develop a learning principle and an efficient algorithm for batch learning from logged bandit feedback. This learning setting is ubiquitous in online systems (e.g., ad placement, web search, recommendation), where an algorithm makes a prediction (e.g., ad ranking) for a given input (e.g., query) and observes bandit feedback (e.g., user clicks on presented ads). We first address the counterfactual nature of the learning problem through propensity scoring. Next, we prove generalization error bounds that account for the variance of the propensity-weighted empirical risk estimator. These constructive bounds give rise to the Counterfactual Risk Minimization (CRM) principle. We show how CRM can be used to derive a new learning method -- called Policy Optimizer for Exponential Models (POEM) -- for learning stochastic linear rules for structured output prediction. We present a decomposition of the POEM objective that enables efficient stochastic gradient optimization. POEM is evaluated on several multi-label classification problems showing substantially improved robustness and generalization performance compared to the state-of-the-art.Comment: 10 page
    • …
    corecore