6,957 research outputs found

    PadChest: A large chest x-ray image dataset with multi-label annotated reports

    Get PDF
    We present a labeled large-scale, high resolution chest x-ray dataset for the automated exploration of medical images along with their associated reports. This dataset includes more than 160,000 images obtained from 67,000 patients that were interpreted and reported by radiologists at Hospital San Juan Hospital (Spain) from 2009 to 2017, covering six different position views and additional information on image acquisition and patient demography. The reports were labeled with 174 different radiographic findings, 19 differential diagnoses and 104 anatomic locations organized as a hierarchical taxonomy and mapped onto standard Unified Medical Language System (UMLS) terminology. Of these reports, 27% were manually annotated by trained physicians and the remaining set was labeled using a supervised method based on a recurrent neural network with attention mechanisms. The labels generated were then validated in an independent test set achieving a 0.93 Micro-F1 score. To the best of our knowledge, this is one of the largest public chest x-ray database suitable for training supervised models concerning radiographs, and the first to contain radiographic reports in Spanish. The PadChest dataset can be downloaded from http://bimcv.cipf.es/bimcv-projects/padchest/

    Research on Keywords Variations in Linguistics Based on TF-IDF and N-gram

    Get PDF
    The rapid development of natural language processing (NLP) holds great promise for bridging the divide among languages. One of its main innovative applications is to use broad data to explore the historical trend of a subject. However, since Saussure pioneered modern linguistics, there is relatively inadequate research work done in the linguistic research on the field\u27s variations to comprehensively reveal the linguistic trends. To trace the changes in linguistic research hotspots, we use a dataset of more than 30,000 linguistics-related literature with their titles from the Web of Science and apply NLP techniques to the data consisting of their keywords and publication years. It is found that the co-occurrence relationship between keywords, NGRAM, and their relationship with years can effectively present changes in linguistic research themes. This research is supposed to provide further insights and new methods that can be applied in the field of linguistics and related disciplines

    Extracting Scales of Measurement Automatically from Biomedical Text with Special Emphasis on Comparative and Superlative Scales

    Get PDF
    Abstract In this thesis, the focus is on the topic of โ€œExtracting Scales of Measurement Automatically from Biomedical Text with Special Emphasis on Comparative and Superlative Scales.โ€ Comparison sentences, when considered as a critical part of scales of measurement, play a highly significant role in the process of gathering information from a large number of biomedical research papers. A comparison sentence is defined as any sentence that contains two or more entities that are being compared. This thesis discusses several different types of comparison sentences such as gradable comparisons and non-gradable comparisons. The main goal is extracting comparison sentences automatically from the full text of biomedical articles. Therefore, the thesis presents a Java program that could be used to analyze biomedical text to identify comparison sentences by matching the sentences in the text to 37 syntactic and semantic features. These features or qualities would be helpful to extract comparative sentences from any biomedical text. Two machine learning techniques are used with the 37 roles to assess the curated dataset. The results of this study are compared with earlier studies

    Sentiment Polarity Classification of Comments on Korean News Articles Using Feature Reweighting

    Get PDF
    ์ผ๋ฐ˜์ ์œผ๋กœ ์ธํ„ฐ๋„ท ์‹ ๋ฌธ ๊ธฐ์‚ฌ์— ๋Œ€ํ•œ ๋Œ“๊ธ€์€ ๊ทธ ์‹ ๋ฌธ ๊ธฐ์‚ฌ์— ๋Œ€ํ•œ ์ฃผ๊ด€์ ์ธ ๊ฐ์ •์ด๋‚˜ ์˜๊ฒฌ์„ ํฌํ•จํ•˜๊ณ  ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋Ÿฐ ์‹ ๋ฌธ ๊ธฐ์‚ฌ์˜ ๋Œ“๊ธ€์— ๋Œ€ํ•œ ๊ฐ์ •์„ ์ธ์‹ํ•˜๊ณ  ๋ถ„๋ฅ˜ํ•˜๋Š” ๋ฐ์—๋Š” ๊ทธ ์‹ ๋ฌธ ๊ธฐ์‚ฌ์˜ ์›๋ฌธ ๋‚ด์šฉ์ด ์ค‘์š”ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์นœ๋‹ค. ์ด๋Ÿฐ ์ ์— ์ฐฉ์•ˆํ•˜์—ฌ ๋ณธ ๋…ผ๋ฌธ์€ ๊ธฐ์‚ฌ์˜ ์›๋ฌธ ๋‚ด์šฉ๊ณผ ๊ฐ์ • ์‚ฌ์ „์„ ์ด์šฉํ•˜๋Š” ๊ฐ€์ค‘์น˜ ์กฐ์ • ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜๊ณ , ์ œ์•ˆ๋œ ๊ฐ€์ค‘์น˜ ์กฐ์ • ๋ฐฉ๋ฒ•์„ ์ด์šฉํ•ด์„œ ํ•œ๊ตญ์–ด ์‹ ๋ฌธ ๊ธฐ์‚ฌ์˜ ๋Œ“๊ธ€์— ๋Œ€ํ•œ ๊ฐ์ • ์ด์ง„ ๋ถ„๋ฅ˜ ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ๊ฐ€์ค‘์น˜ ์กฐ์ • ๋ฐฉ๋ฒ•์—๋Š” ๋‹ค์–‘ํ•œ ์ž์งˆ ์ง‘ํ•ฉ์ด ์‚ฌ์šฉ๋˜๋Š”๋ฐ ๊ทธ๊ฒƒ์€ ๋Œ“๊ธ€์— ํฌํ•จ๋œ ๊ฐ์ • ๋‹จ์–ด, ๊ทธ๋ฆฌ๊ณ  ๊ฐ์ • ์‚ฌ์ „๊ณผ ๋‰ด์Šค ๊ธฐ์‚ฌ์˜ ๋ณธ๋ฌธ์— ๊ด€๋ จ๋œ ์ž์งˆ๋“ค, ๋งˆ์ง€๋ง‰์œผ๋กœ ๋‰ด์Šค ๊ธฐ์‚ฌ์˜ ์นดํ…Œ๊ณ ๋ฆฌ ์ •๋ณด๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ๋‹ค. ์—ฌ๊ธฐ์„œ ๋งํ•˜๋Š” ๊ฐ์ • ์‚ฌ์ „์€ ํ•œ๊ตญ์–ด ๊ฐ์ • ์‚ฌ์ „์„ ์˜๋ฏธํ•˜๋ฉฐ ์•„์ง ๊ณต๊ฐœ๋œ ๊ฒƒ์ด ์—†๊ธฐ ๋•Œ๋ฌธ์—, ๊ธฐ์กด์— ์žˆ๋Š” ์˜์–ด ๊ฐ์ • ์‚ฌ์ „์„ ์ด์šฉํ•˜์—ฌ ๊ตฌ์ถ•ํ•˜์˜€๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆ๋œ ๊ฐ์ • ์ด์ง„ ๋ถ„๋ฅ˜๋Š” ๊ธฐ๊ณ„ ํ•™์Šต์„ ์ด์šฉํ•œ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ๊ธฐ๊ณ„ ํ•™์Šต์„ ์œ„ํ•ด์„œ๋Š” ํ•™์Šต ๋ง๋ญ‰์น˜๊ฐ€ ํ•„์š”ํ•œ๋ฐ ํŠน๋ณ„ํžˆ ๊ฐ์ • ๋ถ„๋ฅ˜ ๋ฌธ์ œ์—์„œ๋Š” ๊ธ์ • ํ˜น์€ ๋ถ€์ • ๊ฐ์ • ํƒœ๊ทธ๊ฐ€ ๋ถ€์ฐฉ๋œ ๋ง๋ญ‰์น˜๊ฐ€ ํ•„์š”ํ•˜๋‹ค. ์ด ๋ง๋ญ‰์น˜์˜ ๊ฒฝ์šฐ๋„, ๊ณต๊ฐœ๋œ ํ•œ๊ตญ์–ด ๊ฐ์ • ๋ง๋ญ‰์น˜๊ฐ€ ์•„์ง ์—†๊ธฐ ๋•Œ๋ฌธ์— ๋ง๋ญ‰์น˜๋ฅผ ์ง์ ‘ ๊ตฌ์ถ•ํ•˜์˜€๋‹ค. ์‚ฌ์šฉ๋œ ๊ธฐ๊ณ„ ํ•™์Šต ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” Na&iumlve Bayes, k-NN, SVM์ด ์žˆ๊ณ , ์ž์งˆ ์„ ํƒ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š” Document Frequency, ฯ‡^2 statistic, Information Gain์ด ์žˆ๋‹ค. ๊ทธ ๊ฒฐ๊ณผ, ๋Œ“๊ธ€ ์•ˆ์— ํฌํ•จ๋œ ๊ฐ์ • ๋‹จ์–ด์™€ ๊ทธ ๋Œ“๊ธ€์— ๋Œ€ํ•œ ๊ธฐ์‚ฌ ๋ณธ๋ฌธ์ด ๊ฐ์ • ๋ถ„๋ฅ˜์— ๋งค์šฐ ํšจ๊ณผ์ ์ธ ์ž์งˆ์ž„์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค.Chapter 1 Introduction 1 Chapter 2 Related Works 4 2.1 Sentiment Classification 4 2.2 Feature Weighting in Vector Space Model 5 2.3 Feature Extraction and Selection 7 2.4 Classifiers 10 2.5 Accuracy Measures 14 Chapter 3 Feature Reweighting 16 3.1 Feature extraction in Korean 16 3.2 Feature Reweighting Methods 17 3.3 Examples of Feature Reweighting Methods 18 Chapter 4 Sentiment Polarity Classification System 21 4.1 Model Generation 21 4.2 Sentiment Polarity Classification 23 Chapter 5 Data Preparation 25 5.1 Korean Sentiment Corpus 25 5.2 Korean Sentiment Lexicon 27 Chapter 6 Experiments 29 6.1 Experimental Environment 29 6.2 Experimental Results 30 Chapter 7 Conclusions and Future Works 38 Bibliography 40 Acknowledgments 4

    Empirical Methodology for Crowdsourcing Ground Truth

    Full text link
    The process of gathering ground truth data through human annotation is a major bottleneck in the use of information extraction methods for populating the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the attempt to solve the issues related to volume of data and lack of annotators. Typically these practices use inter-annotator agreement as a measure of quality. However, in many domains, such as event detection, there is ambiguity in the data, as well as a multitude of perspectives of the information examples. We present an empirically derived methodology for efficiently gathering of ground truth data in a diverse set of use cases covering a variety of domains and annotation tasks. Central to our approach is the use of CrowdTruth metrics that capture inter-annotator disagreement. We show that measuring disagreement is essential for acquiring a high quality ground truth. We achieve this by comparing the quality of the data aggregated with CrowdTruth metrics with majority vote, over a set of diverse crowdsourcing tasks: Medical Relation Extraction, Twitter Event Identification, News Event Extraction and Sound Interpretation. We also show that an increased number of crowd workers leads to growth and stabilization in the quality of annotations, going against the usual practice of employing a small number of annotators.Comment: in publication at the Semantic Web Journa

    Natural language processing

    Get PDF
    Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems
    • โ€ฆ
    corecore