15 research outputs found

    Detecting hospital-acquired infections : A document classification approach using support vector machines and gradient tree boosting

    Get PDF
    Hospital-acquired infections pose a significant risk to patient health, while their surveillance is an additional workload for hospital staff. Our overall aim is to build a surveillance system that reliably detects all patient records that potentially include hospital-acquired infections. This is to reduce the burden of having the hospital staff manually check patient records. This study focuses on the application of text classification using support vector machines and gradient tree boosting to the problem. Support vector machines and gradient tree boosting have never been applied to the problem of detecting hospital-acquired infections in Swedish patient records, and according to our experiments, they lead to encouraging results. The best result is yielded by gradient tree boosting, at 93.7percent recall, 79.7percent precision and 85.7percent F1 score when using stemming. We can show that simple preprocessing techniques and parameter tuning can lead to high recall (which we aim for in screening patient records) with appropriate precision for this task.Peer reviewe

    Negation Scope Delimitation in Clinical Text Using Three Approaches: NegEx, PyConTextNLP and SynNeg

    Get PDF
    ABSTRACT Negation detection is a key component in clinical information extraction systems, as health record text contains reasonings in which the physician excludes different diagnoses by negating them. Many systems for negation detection rely on negation cues (e.g. not), but only few studies have investigated if the syntactic structure of the sentences can be used for determining the scope of these cues. We have in this paper compared three different systems for negation detection in Swedish clinical text (NegEx, PyConTextNLP and SynNeg), which have different approaches for determining the scope of negation cues. NegEx uses the distance between the cue and the disease, PyConTextNLP relies on a list of conjunctions limiting the scope of a cue, and in SynNeg the boundaries of the sentence units, provided by a syntactic parser, limit the scope of the cues. The three systems produced similar results, detecting negation with an F-score of around 80%, but using a parser had advantages when handling longer, complex sentences or short sentences with contradictory statements

    Citation Estimation Method Using Abstracts of Research Data Articles : Using Abstracts of Scientific Data Articles as An Example

    Full text link
    オープンサイエンス時代の到来により,研究データの公開,利活用に向けた取り組みが盛んに行われている.公開もしくは共有された研究データの分野融合型研究への利活用を考慮した,異分野研究の研究者にも伝わりやすい抄録の記述方法の開発も今後期待される.これまで学術論文の論文誌のように抄録,本文,参照文献という形式をとる,研究データに特化した論文誌が登場してきている.本研究では研究データに特化した論文誌である「Scientific Data」の抄録に着目し,抄録を構成する英文の品詞の出現数,単語数,キーワード数と研究データ論文の被引用数を重回帰分析することで,各品詞等が研究データの利活用に及ぼす影響を考察した.また,それらの結果をもとに,説明変数を名詞,動詞,その他品詞,単語数,キーワード数に設定し,目的変数を被引用数として機械学習を行い,被引用数を予測する分類器を開発した.これにより,今後の研究データ利活用に向けた,研究データ公開の際の抄録記述の留意点についての議論に繋がることを期待する.With the trend of open science, efforts have been made to open and utilize research data. Considering the use of published or shared research data for interdisciplinary research, it is expected to develop a method of writing abstracts that can be easily understood by researchers in different research fields. Journals specializing in research data that have a format of abstract, text, and references like academic journals have emerged. In this study, we focus on the abstract of "Scientific Data", a journal specialized in research data, and examine the influence of each part of speech on the utilization of research data through multiple regression analysis of the number of occurrences of the part of speech, the number of words and the number of keywords in the abstract, and the number of citations to the research data article. Based on these results, we set the explanatory variables as the number of occurrences of nouns, verbs, the other parts of speech, the number of words, and the number of keywords in the abstract, and developed a classifier to estimate the number of citations by machine learning with the number of citations as the objective variable. We hope that this will lead to a discussion of the issues that need to be considered when writing abstracts for publication of research data for future use of research data

    Calculating Prevalence of Comorbidity and Comorbidity Combinations with Diabetes in Hospital Care in Sweden Using a Health Care Record Database

    No full text
    Abstract. We have studied the prevalence of comorbidity in the Stockholm EPR corpus containing almost 600,000 patients from 900 clinics using the ICD- 10 codes assigned to each patient record. The proportion of patients with a valid ICD-10 code was 83.0%, and 41.5 % of these had at least one comorbidity. The most frequent comorbidity combination with type 2 diabetes was essential hypertension (43.1%). Our approach seems feasible for large scale analysis of diagnostic codes in EPR databases. Keywords: comorbidity, chronic disease, ICD-10, medical records systems, computerized medical record, Sweden

    Citation Estimation Method Using Abstracts of Research Data Articles: A Focus on Scientific Data

    Full text link
    With the trend of open science, efforts have been made to openly utilize research data. Considering the use of shared research data for interdisciplinary research, developing a researcher-friendly abstract writing method in different research fields is pertinent. In this study, we focus on abstracts from Scientific Data, a journal specializing in research data. We examine the influence of each part of speech on the utilization of research data through multiple regression analysis of the number of occurrences of the part of speech, the number of words and index-keywords in the abstract, and the number of citations research data article. Based on these results, we set the explanatory variables as the number of nouns, verbs, the other parts of speech, words, and index-keywords in the abstract. Thereafter, we developed a classifier to estimate the number of citations using machine learning. An analysis of the relationship between the number of citations and index keywords was also conducted.Kai Naoto, Yoshihisa Tomoki, et al. Citation Estimation Method Using Abstracts of Research Data Articles: A Focus on Scientific Data. Lecture Notes on Data Engineering and Communications Technologies 189, 1 (2023); https://doi.org/10.1007/978-3-031-46970-1_1

    Negation Scope Delimitation in Clinical Text Using Three Approaches : NegEx, PyConTextNLP and SynNeg

    No full text
    Negation detection is a key component in clinical information extraction systems, as health record text contains reasonings in which the physician excludes different diagnoses by negating them. Many systems for negation detection rely on negation cues (e.g. not), but only few studies have investigated if the syntactic structure of the sentences can be used for determining the scope of these cues. We have in this paper compared three different systems for negation detection in Swedish clinical text (NegEx, PyConTextNLP and SynNeg), which have different approaches for determining the scope of negation cues. NegEx uses the distance between the cue and the disease, PyConTextNLP relies on a list of conjunctions limiting the scope of a cue, and in SynNeg the boundaries of the sentence units, provided by a syntactic parser, limit the scope of the cues. The three systems produced similar results, detecting negation with an F-score of around 80%, but using a parser had advantages when handling longer, complex sentences or short sentences with contradictory statements

    Terminology Expansion with Prototype Embeddings: Extracting Symptoms of Urinary Tract Infection from Clinical Text

    No full text
    Many natural language processing applications rely on the availability of domain-specific terminologies containing synonyms. To that end, semi-automatic methods for extracting additional synonyms of a given concept from corpora are useful, especially in low-resource domains and noisy genres such as clinical text, where nonstandard language use and misspellings are prevalent. In this study, prototype embeddings based on seed words were used to create representations for (i) specific urinary tract infection (UTI) symptoms and (ii) UTI symptoms in general. Four word embedding methods and two phrase detection methods were evaluated using clinical data from Karolinska University Hospital. It is shown that prototype embeddings can effectively capture semantic information related to UTI symptoms. Using prototype embeddings for specific UTI symptoms led to the extraction of more symptom terms compared to using prototype embeddings for UTI symptoms in general. Overall, 142 additional UTI symp tom terms were identified, yielding a more than 100% increment compared to the initial seed set. The mean average precision across all UTI symptoms was 0.51, and as high as 0.86 for one specific UTI symptom. This study provides an effective and cost-effective solution to terminology expansion with small amounts of labeled data

    Citation Estimation Method Using Abstracts of Research Data Articles: A Focus on Scientific Data

    No full text
    Kai Naoto, Yoshihisa Tomoki, et al. Citation Estimation Method Using Abstracts of Research Data Articles: A Focus on Scientific Data. Lecture Notes on Data Engineering and Communications Technologies 189, 1 (2023); https://doi.org/10.1007/978-3-031-46970-1_1.With the trend of open science, efforts have been made to openly utilize research data. Considering the use of shared research data for interdisciplinary research, developing a researcher-friendly abstract writing method in different research fields is pertinent. In this study, we focus on abstracts from Scientific Data, a journal specializing in research data. We examine the influence of each part of speech on the utilization of research data through multiple regression analysis of the number of occurrences of the part of speech, the number of words and index-keywords in the abstract, and the number of citations research data article. Based on these results, we set the explanatory variables as the number of nouns, verbs, the other parts of speech, words, and index-keywords in the abstract. Thereafter, we developed a classifier to estimate the number of citations using machine learning. An analysis of the relationship between the number of citations and index keywords was also conducted
    corecore