3 research outputs found

    Hierarchical Classification System for Breast Cancer Specimen Report (HCSBC) -- an end-to-end model for characterizing severity and diagnosis

    Full text link
    Automated classification of cancer pathology reports can extract information from unstructured reports and categorize each report into structured diagnosis and severity categories. Thus, such system can reduce the burden for populating tumor registries, help registration for clinical trial as well as developing large dataset for deep learning model development using true pathologic ground truth. However, the content of breast pathology reports can be difficult for categorize due to the high linguistic variability in content and wide variety of potential diagnoses >50. Existing NLP models are primarily focused on developing classifier for primary breast cancer types (e.g. IDC, DCIS, ILC) and tumor characteristics, and ignore the rare diagnosis of cancer subtypes. We then developed a hierarchical hybrid transformer-based pipeline (59 labels) - Hierarchical Classification System for Breast Cancer Specimen Report (HCSBC), which utilizes the potential of the transformer context-preserving NLP technique and compared our model to several state of the art ML and DL models. We trained the model on the EUH data and evaluated our model's performance on two external datasets - MGH and Mayo Clinic. We publicly release the code and a live application under Huggingface spaces repositor

    Named Entity Recognition in Electronic Health Records: A Methodological Review

    Get PDF
    Objectives A substantial portion of the data contained in Electronic Health Records (EHR) is unstructured, often appearing as free text. This format restricts its potential utility in clinical decision-making. Named entity recognition (NER) methods address the challenge of extracting pertinent information from unstructured text. The aim of this study was to outline the current NER methods and trace their evolution from 2011 to 2022. Methods We conducted a methodological literature review of NER methods, with a focus on distinguishing the classification models, the types of tagging systems, and the languages employed in various corpora. Results Several methods have been documented for automatically extracting relevant information from EHRs using natural language processing techniques such as NER and relation extraction (RE). These methods can automatically extract concepts, events, attributes, and other data, as well as the relationships between them. Most NER studies conducted thus far have utilized corpora in English or Chinese. Additionally, the bidirectional encoder representation from transformers using the BIO tagging system architecture is the most frequently reported classification scheme. We discovered a limited number of papers on the implementation of NER or RE tasks in EHRs within a specific clinical domain. Conclusions EHRs play a pivotal role in gathering clinical information and could serve as the primary source for automated clinical decision support systems. However, the creation of new corpora from EHRs in specific clinical domains is essential to facilitate the swift development of NER and RE models applied to EHRs for use in clinical practice

    Classification of Computed Tomography Findings of the Chest Based on Deep Learning

    Get PDF
    Hintergrund: Die Computertomographie des Thorax ist eine häufige und bedeutsame Untersuchung der Radiologie. Die Ergebnisse einer CT-Untersuchung werden in einem Befundtext dargestellt, welcher jedoch keiner festen Struktur folgt, und bislang gibt es keine Kategorisierung von Befundtexten, obwohl dies den klinischen Alltag erheblich erleichtern würde. Um strukturierte Daten aus Befundtexten der CT des Thorax zu extrahieren, wurden drei verschiedene Deep-Learning-Modelle für das Natural Language Processing (NLP) entwickelt. Methoden: Ein annotierter Datensatz bestehend aus 5.950 Befundtexten der CT-Diagnostik des Thorax (inklusive CT-Untersuchungen zur Lungenarterienembolie) wurde für das Training dreier Deep-Learning-Modelle erstellt und die Befundtexte auf das Auftreten 21 verschiedener Befunde untersucht. Für die Klassifikation der Befundtexte mittels Natural Language Processing wurden zum einen ein AWD-LSTM sowie zwei Transformer-Architekturen (BERT und DistilBERT) verwendet. Im Anschluss wurde die Klassifikationsleistungsfähigkeit der Modelle mithilfe der Metriken Genauigkeit, Sensitivität, positivem prädiktiven Wert, F1-Wert sowie AUC beurteilt. Ergebnisse: Alle drei Modelle erzielten hohe Metriken, welche zwischen den verschiedenen Befunden variierten. Die Genauigkeit erreichte bei allen Befunden >0,96 für das AWD-LSTM, >0,89 für BERT und >0,87 für DistilBERT. Dabei stiegen die Parameter mit zunehmender Prävalenz des jeweiligen Befundes. Schlussfolgerung: Mithilfe dreier Deep-Learning-Modelle (AWD-LSTM, BERT, DistilBERT) konnten auf Basis eines verhältnismäßig geringen Datensatzes an Texten verschiedene computergestützte Klassifikationssysteme von Befundtexten der CT des Thorax entwickelt werden, welche in der Lage waren, selbstständig die Befunde zu identifizieren. Die Modelle können nun auf sämtliche Befundtexte der CT-Bildgebung des Thorax angewendet und die extrahierten Labels für weiterführende Aufgaben genutzt werden.Background: Computed tomography of the chest is a common and very important examination in radiology. The results of a CT examination are presented in a report text that does not follow a fixed structure and so far, there is no categorization of the findings, although this would make clinical practice easier. In order to extract structured data from diagnostic texts of the chest CT, three different deep learning models for natural language processing (NLP) were developed. Methods: An annotated data set consisting of 5,950 report texts from CT chest examinations (including CT examinations of pulmonary artery embolism) was created for the training of three deep learning models, and the report texts were screened for the occurrence of 21 different findings. An LSTM and two transformer architectures (BERT and DistilBERT) were used to classify the reports using natural language processing. The classification performance of the models was then assessed using the metrics accuracy, sensitivity, precision, F1 value and AUC. Results: All three models were able to achieve high metrics, which varied between the different findings. The accuracy for every finding reached >0.96 for the LSTM, >0.89 for BERT and >0.87 for DistilBERT. During the process the parameters increased with higher prevalence of the respective finding. Conclusion: With the help of three deep learning models (AWD-LSTM, BERT, DistilBERT) and based on a relatively small dataset of reports, various computer-aided classification systems of chest CT reports could be developed, which were able to identify the findings independently. The models may now be applied to all reports of chest CTs, and the extracted labels can be used for further tasks
    corecore