7,761 research outputs found

    Identification of clinical characteristics of large patient cohorts through analysis of free text physician notes

    Get PDF
    Thesis (S.M.)--Harvard-MIT Division of Health Sciences and Technology, 2005.Includes bibliographical references (p. 31-33).Background A number of important applications in medicine and biomedical research, including quality of care surveillance and identification of prospective study subjects, require identification of large cohorts of patients with specific clinical characteristics. Currently used conventional techniques are either labor-intensive or imprecise, while natural language processing-based applications are relatively slow and expensive. Specific Aims In this thesis we describe the design and formal evaluation of PACT - a suite of rapid, accurate, and easily portable software tools for identification of patients with specific clinical characteristics through analysis of the text of physician notes in the electronic medical record. Methods PACT algorithm is based on sentence-level semantic analysis. The major steps involve identification of word tags (e.g. name of the disease or medications exclusively used to treat the disease) specific for the clinical characteristics in the sentences of the physician notes. Sentences with word tags and negative qualifiers (e.g. "rule out diabetes") are excluded from consideration. PACT can also identify quantitative (e.g. blood pressure, height, weight) and semi-quantitative (e.g. compliance with medical treatment) clinical characteristics. PACT performance was evaluated against blinded manual chart review (the "gold standard") and currently used computational methods (analysis of billing data). Results Evaluation of PACT demonstrated it to be rapid and highly accurate. PACT processed 6.5 to 8.8x 10⁵ notes/hour (1.0 to 1.4 GB of text / hour).(cont) When compared to the gold standard of manual chart review, PACT sensitivity ranged (depending on the patient characteristic being extracted from the notes) from 74 to 100%, and specificity from 86 to 100%. K statistic for agreement between PACT and manual chart review ranged from 0.67 to 1.0 and in most cases exceeded 0.75, indicating excellent agreement. PACT accuracy substantially exceeded the performance of currently used techniques (billing data analysis). Finally, index of patient non-compliance with physician recommendations computed by PACT was shown to correlate with the frequency of annual Emergency Department visits: patients in the highest quartile for the index of non-compliance had 50% as many annual visits as the patients in the lowest quartile. Conclusion PACT is a rapid, precise and easily portable suite of software tools for extracting focused clinical information out of free text clinical documents. It compares favorably with computation techniques currently available for the purpose (where ones exist). It represents an important advance in the field, and we plan to continue to develop this concept further to improve its performance and functionality.by Alexander Turchin.S.M

    MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset

    Get PDF
    Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130’000 annotated sentences in 6 languages. Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. To encourage further research and development by the community, we have made our dataset, models, and code publicly available

    MultiLegalSBD: A Multilingual Legal Sentence Boundary Detection Dataset

    Full text link
    Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130'000 annotated sentences in 6 languages. Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the-art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. To encourage further research and development by the community, we have made our dataset, models, and code publicly available.Comment: Accepted at ICAIL 202

    Ontology-Based Clinical Information Extraction Using SNOMED CT

    Get PDF
    Extracting and encoding clinical information captured in unstructured clinical documents with standard medical terminologies is vital to enable secondary use of clinical data from practice. SNOMED CT is the most comprehensive medical ontology with broad types of concepts and detailed relationships and it has been widely used for many clinical applications. However, few studies have investigated the use of SNOMED CT in clinical information extraction. In this dissertation research, we developed a fine-grained information model based on the SNOMED CT and built novel information extraction systems to recognize clinical entities and identify their relations, as well as to encode them to SNOMED CT concepts. Our evaluation shows that such ontology-based information extraction systems using SNOMED CT could achieve state-of-the-art performance, indicating its potential in clinical natural language processing

    Large language models in biomedical natural language processing: benchmarks, baselines, and recommendations

    Full text link
    Biomedical literature is growing rapidly, making it challenging to curate and extract knowledge manually. Biomedical natural language processing (BioNLP) techniques that can automatically extract information from biomedical literature help alleviate this burden. Recently, large Language Models (LLMs), such as GPT-3 and GPT-4, have gained significant attention for their impressive performance. However, their effectiveness in BioNLP tasks and impact on method development and downstream users remain understudied. This pilot study (1) establishes the baseline performance of GPT-3 and GPT-4 at both zero-shot and one-shot settings in eight BioNLP datasets across four applications: named entity recognition, relation extraction, multi-label document classification, and semantic similarity and reasoning, (2) examines the errors produced by the LLMs and categorized the errors into three types: missingness, inconsistencies, and unwanted artificial content, and (3) provides suggestions for using LLMs in BioNLP applications. We make the datasets, baselines, and results publicly available to the community via https://github.com/qingyu-qc/gpt_bionlp_benchmark
    corecore