12,178 research outputs found

    Machine learning model for clinical named entity recognition

    Get PDF
    To extract important concepts (named entities) from clinical notes, most widely used NLP task is named entity recognition (NER). It is found from the literature that several researchers have extensively used machine learning models for clinical NER.The most fundamental tasks among the medical data mining tasks are medical named entity recognition and normalization. Medical named entity recognition is different from general NER in various ways. Huge number of alternate spellings and synonyms create explosion of word vocabulary sizes. This reduces the medicine dictionary efficiency. Entities often consist of long sequences of tokens, making harder to detect boundaries exactly. The notes written by clinicians written notes are less structured and are in minimal grammatical form with cryptic short hand. Because of this, it poses challenges in named entity recognition. Generally, NER systems are either rule based or pattern based. The rules and patterns are not generalizable because of the diverse writing style of clinicians. The systems that use machine learning based approach to resolve these issues focus on choosing effective features for classifier building. In this work, machine learning based approach has been used to extract the clinical data in a required manne

    Content Recognition and Context Modeling for Document Analysis and Retrieval

    Get PDF
    The nature and scope of available documents are changing significantly in many areas of document analysis and retrieval as complex, heterogeneous collections become accessible to virtually everyone via the web. The increasing level of diversity presents a great challenge for document image content categorization, indexing, and retrieval. Meanwhile, the processing of documents with unconstrained layouts and complex formatting often requires effective leveraging of broad contextual knowledge. In this dissertation, we first present a novel approach for document image content categorization, using a lexicon of shape features. Each lexical word corresponds to a scale and rotation invariant local shape feature that is generic enough to be detected repeatably and is segmentation free. A concise, structurally indexed shape lexicon is learned by clustering and partitioning feature types through graph cuts. Our idea finds successful application in several challenging tasks, including content recognition of diverse web images and language identification on documents composed of mixed machine printed text and handwriting. Second, we address two fundamental problems in signature-based document image retrieval. Facing continually increasing volumes of documents, detecting and recognizing unique, evidentiary visual entities (\eg, signatures and logos) provides a practical and reliable supplement to the OCR recognition of printed text. We propose a novel multi-scale framework to detect and segment signatures jointly from document images, based on the structural saliency under a signature production model. We formulate the problem of signature retrieval in the unconstrained setting of geometry-invariant deformable shape matching and demonstrate state-of-the-art performance in signature matching and verification. Third, we present a model-based approach for extracting relevant named entities from unstructured documents. In a wide range of applications that require structured information from diverse, unstructured document images, processing OCR text does not give satisfactory results due to the absence of linguistic context. Our approach enables learning of inference rules collectively based on contextual information from both page layout and text features. Finally, we demonstrate the importance of mining general web user behavior data for improving document ranking and other web search experience. The context of web user activities reveals their preferences and intents, and we emphasize the analysis of individual user sessions for creating aggregate models. We introduce a novel algorithm for estimating web page and web site importance, and discuss its theoretical foundation based on an intentional surfer model. We demonstrate that our approach significantly improves large-scale document retrieval performance

    On the Use of Parsing for Named Entity Recognition

    Get PDF
    [Abstract] Parsing is a core natural language processing technique that can be used to obtain the structure underlying sentences in human languages. Named entity recognition (NER) is the task of identifying the entities that appear in a text. NER is a challenging natural language processing task that is essential to extract knowledge from texts in multiple domains, ranging from financial to medical. It is intuitive that the structure of a text can be helpful to determine whether or not a certain portion of it is an entity and if so, to establish its concrete limits. However, parsing has been a relatively little-used technique in NER systems, since most of them have chosen to consider shallow approaches to deal with text. In this work, we study the characteristics of NER, a task that is far from being solved despite its long history; we analyze the latest advances in parsing that make its use advisable in NER settings; we review the different approaches to NER that make use of syntactic information; and we propose a new way of using parsing in NER based on casting parsing itself as a sequence labeling task.Xunta de Galicia; ED431C 2020/11Xunta de Galicia; ED431G 2019/01This work has been funded by MINECO, AEI and FEDER of UE through the ANSWER-ASAP project (TIN2017-85160-C2-1-R); and by Xunta de Galicia through a Competitive Reference Group grant (ED431C 2020/11). CITIC, as Research Center of the Galician University System, is funded by the Consellería de Educación, Universidade e Formación Profesional of the Xunta de Galicia through the European Regional Development Fund (ERDF/FEDER) with 80%, the Galicia ERDF 2014-20 Operational Programme, and the remaining 20% from the Secretaría Xeral de Universidades (Ref. ED431G 2019/01). Carlos Gómez-Rodríguez has also received funding from the European Research Council (ERC), under the European Union’s Horizon 2020 research and innovation programme (FASTPARSE, Grant No. 714150)

    Learning to match names across languages

    Get PDF
    We report on research on matching names in different scripts across languages. We explore two trainable approaches based on comparing pronunciations. The first, a cross-lingual approach, uses an automatic name-matching program that exploits rules based on phonological comparisons of the two languages carried out by humans. The second, monolingual approach, relies only on automatic comparison of the phonological representations of each pair. Alignments produced by each approach are fed to a machine learning algorithm. Results show that the monolingual approach results in machine-learning based comparison of person-names in English and Chinese at an accuracy of over 97.0 F-measure.

    Named Entity Recognition in Chinese Clinical Text

    Get PDF
    Objective: Named entity recognition (NER) is one of the fundamental tasks in natural language processing (NLP). In the medical domain, there have been a number of studies on NER in English clinical notes; however, very limited NER research has been done on clinical notes written in Chinese. The goal of this study is to develop corpora, methods, and systems for NER in Chinese clinical text. Materials and methods: To study entities in Chinese clinical text, we started with building annotated clinical corpora in Chinese. We developed an NER annotation guideline in Chinese by extending the one used in the 2010 i2b2 NLP challenge. We randomly selected 400 admission notes and 400 discharge summaries from Peking Union Medical College Hospital (PUMCH) in China. For each note, four types of entities including clinical problems, procedures, labs, and medications were annotated according to the developed guideline. In addition, an annotation tool was developed to assist two MD students to annotate Chinese clinical documents. A comparison of entity distribution between Chinese and English clinical notes (646 English and 400 Chinese discharge summaries) was performed using the annotated corpora, to identify the important features for NER. In the NER study, two-thirds of the 400 notes were used for training the NER systems and one-third were used for testing. We investigated the effects of different types of features including bag-of-characters, word segmentation, part-of-speech, and section information, with different machine learning (ML) algorithms including Conditional Random Fields (CRF), Support Vector Machines (SVM), Maximum Entropy (ME), and Structural Support Vector Machines (SSVM) on the Chinese clinical NER task. All classifiers were trained on the training dataset, evaluated on the test set, and microaveraged precision, recall, and F-measure were reported. Results: Our evaluation on the independent test set showed that most types of features were beneficial to Chinese NER systems, although the improvements were limited. By combining word segmentation and section information, the system achieved the highest performance, indicating that these two types of features are complementary to each other. When the same types of optimized features were used, CRF and SSVM outperformed SVM and ME. More specifically, SSVM reached the highest performance among the four algorithms, with F-measures of 93.51% and 90.01% for admission notes and discharge summaries respectively. Conclusions: In this study, we created large annotated datasets of Chinese admission notes and discharge summaries and then systematically evaluated different types of features (e.g., syntactic, semantic, and segmentation information) and four ML algorithms including CRF, SVM, SSVM, and ME for clinical NER in Chinese. To the best of our knowledge, this is one of the earliest comprehensive effort in Chinese clinical NER research and we believe it will provide valuable insights to NLP research in Chinese clinical text. Our results suggest that both word segmentation and section information improves NER in Chinese clinical text, and SSVM, a recent sequential labelling algorithm, outperformed CRF and other classification algorithms. Our best system achieved F-measures of 90.01% and 93.52% on Chinese discharge summaries and admission notes, respectively, indicating a promising start on Chinese NLP research

    Natural Language Processing Using Neighbour Entropy-based Segmentation

    Get PDF
    In natural language processing (NLP) of Chinese hazard text collected in the process of hazard identification, Chinese word segmentation (CWS) is the first step to extracting meaningful information from such semi-structured Chinese texts. This paper proposes a new neighbor entropy-based segmentation (NES) model for CWS. The model considers the segmentation benefits of neighbor entropies, adopting the concept of "neighbor" in optimization research. It is defined by the benefit ratio of text segmentation, including benefits and losses of combining the segmentation unit with more information than other popular statistical models. In the experiments performed, together with the maximum-based segmentation algorithm, the NES model achieves a 99.3% precision, 98.7% recall, and 99.0% f-measure for text segmentation; these performances are higher than those of existing tools based on other seven popular statistical models. Results show that the NES model is a valid CWS, especially for text segmentation requirements necessitating longer-sized characters. The text corpus used comes from the Beijing Municipal Administration of Work Safety, which was recorded in the fourth quarter of 2018
    corecore