Search CORE

40,768 research outputs found

Named Entity Recognition for Electronic Health Records: A Comparison of Rule-based and Machine Learning Approaches

Author: Alex Beatrice
Gorinski Philip John
Grover Claire
Sudlow Cathie
Talbot Conn
Tobin Richard
Whalley Heather
Whiteley William
Wu Honghan
Publication venue
Publication date: 05/06/2019
Field of study

This work investigates multiple approaches to Named Entity Recognition (NER) for text in Electronic Health Record (EHR) data. In particular, we look into the application of (i) rule-based, (ii) deep learning and (iii) transfer learning systems for the task of NER on brain imaging reports with a focus on records from patients with stroke. We explore the strengths and weaknesses of each approach, develop rules and train on a common dataset, and evaluate each system's performance on common test sets of Scottish radiology reports from two sources (brain imaging reports in ESS -- Edinburgh Stroke Study data collected by NHS Lothian as well as radiology reports created in NHS Tayside). Our comparison shows that a hand-crafted system is the most accurate way to automatically label EHR, but machine learning approaches can provide a feasible alternative where resources for a manual system are not readily available.Comment: 8 pages, presented at HealTAC 2019, Cardiff, 24-25/04/201

arXiv.org e-Print Archive

Domain and Language Independent Feature Extraction for Statistical Text Categorization

Author: Bayer Thomas
Kressel Ulrich
Renz Ingrid
Stein Michael
Publication venue
Publication date: 01/01/1996
Field of study

A generic system for text categorization is presented which uses a representative text corpus to adapt the processing steps: feature extraction, dimension reduction, and classification. Feature extraction automatically learns features from the corpus by reducing actual word forms using statistical information of the corpus and general linguistic knowledge. The dimension of feature vector is then reduced by linear transformation keeping the essential information. The classification principle is a minimum least square approach based on polynomials. The described system can be readily adapted to new domains or new languages. In application, the system is reliable, fast, and processes completely automatically. It is shown that the text categorizer works successfully both on text generated by document image analysis - DIA and on ground truth data.Comment: 12 pages, TeX file, 9 Postscript figures, uses epsf.st

arXiv.org e-Print Archive

CiteSeerX

NeuroNER: an easy-to-use program for named-entity recognition based on neural networks

Author: Dernoncourt Franck
Lee Ji Young
Szolovits Peter
Publication venue
Publication date: 15/05/2017
Field of study

Named-entity recognition (NER) aims at identifying entities of interest in a text. Artificial neural networks (ANNs) have recently been shown to outperform existing NER systems. However, ANNs remain challenging to use for non-expert users. In this paper, we present NeuroNER, an easy-to-use named-entity recognition tool based on ANNs. Users can annotate entities using a graphical web-based user interface (BRAT): the annotations are then used to train an ANN, which in turn predict entities' locations and categories in new texts. NeuroNER makes this annotation-training-prediction flow smooth and accessible to anyone.Comment: The first two authors contributed equally to this wor

arXiv.org e-Print Archive

OCR Error Correction Using Character Correction and Feature-Based Word Classification

Author: Dershowitz Nachum
Kissos Ido
Publication venue
Publication date: 21/04/2016
Field of study

This paper explores the use of a learned classifier for post-OCR text correction. Experiments with the Arabic language show that this approach, which integrates a weighted confusion matrix and a shallow language model, improves the vast majority of segmentation and recognition errors, the most frequent types of error on our dataset.Comment: Proceedings of the 12th IAPR International Workshop on Document Analysis Systems (DAS2016), Santorini, Greece, April 11-14, 201

arXiv.org e-Print Archive

Clinical Information Extraction via Convolutional Neural Network

Author: Huang Heng
Li Peng
Publication venue
Publication date: 30/03/2016
Field of study

We report an implementation of a clinical information extraction tool that leverages deep neural network to annotate event spans and their attributes from raw clinical notes and pathology reports. Our approach uses context words and their part-of-speech tags and shape information as features. Then we hire temporal (1D) convolutional neural network to learn hidden feature representations. Finally, we use Multilayer Perceptron (MLP) to predict event spans. The empirical evaluation demonstrates that our approach significantly outperforms baselines.Comment: arXiv admin note: text overlap with arXiv:1408.5882 by other author

arXiv.org e-Print Archive

Improving Document Clustering by Eliminating Unnatural Language

Author: Allan James
Choi Jinho D.
Jang Myungha
Publication venue
Publication date: 17/03/2017
Field of study

Technical documents contain a fair amount of unnatural language, such as tables, formulas, pseudo-codes, etc. Unnatural language can be an important factor of confusing existing NLP tools. This paper presents an effective method of distinguishing unnatural language from natural language, and evaluates the impact of unnatural language detection on NLP tasks such as document clustering. We view this problem as an information extraction task and build a multiclass classification model identifying unnatural language components into four categories. First, we create a new annotated corpus by collecting slides and papers in various formats, PPT, PDF, and HTML, where unnatural language components are annotated into four categories. We then explore features available from plain text to build a statistical model that can handle any format as long as it is converted into plain text. Our experiments show that removing unnatural language components gives an absolute improvement in document clustering up to 15%. Our corpus and tool are publicly available

arXiv.org e-Print Archive

Analysis of Multilingual Sequence-to-Sequence speech recognition systems

Author: Baskar Murali Karthick
Hori Takaaki
Karafiát Martin
Watanabe Shinji
Wiesner Matthew
Černocký Jan "Honza''
Publication venue
Publication date: 07/11/2018
Field of study

This paper investigates the applications of various multilingual approaches developed in conventional hidden Markov model (HMM) systems to sequence-to-sequence (seq2seq) automatic speech recognition (ASR). On a set composed of Babel data, we first show the effectiveness of multi-lingual training with stacked bottle-neck (SBN) features. Then we explore various architectures and training strategies of multi-lingual seq2seq models based on CTC-attention networks including combinations of output layer, CTC and/or attention component re-training. We also investigate the effectiveness of language-transfer learning in a very low resource scenario when the target language is not included in the original multi-lingual training data. Interestingly, we found multilingual features superior to multilingual models, and this finding suggests that we can efficiently combine the benefits of the HMM system with the seq2seq system through these multilingual feature techniques.Comment: arXiv admin note: text overlap with arXiv:1810.0345

arXiv.org e-Print Archive

Recurrent Neural Network Method in Arabic Words Recognition System

Author: Perwej Yusuf
Publication venue
Publication date: 20/01/2013
Field of study

The recognition of unconstrained handwriting continues to be a difficult task for computers despite active research for several decades. This is because handwritten text offers great challenges such as character and word segmentation, character recognition, variation between handwriting styles, different character size and no font constraints as well as the background clarity. In this paper primarily discussed Online Handwriting Recognition methods for Arabic words which being often used among then across the Middle East and North Africa people. Because of the characteristic of the whole body of the Arabic words, namely connectivity between the characters, thereby the segmentation of An Arabic word is very difficult. We introduced a recurrent neural network to online handwriting Arabic word recognition. The key innovation is a recently produce recurrent neural networks objective function known as connectionist temporal classification. The system consists of an advanced recurrent neural network with an output layer designed for sequence labeling, partially combined with a probabilistic language model. Experimental results show that unconstrained Arabic words achieve recognition rates about 79%, which is significantly higher than the about 70% using a previously developed hidden markov model based recognition system.Comment: 6 Pages, 5 Figures, Vol. 3, Issue 11, pages 43-4

arXiv.org e-Print Archive

Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models

Author: He Yifan
Wei Mengxi
Zhang Qiong
Publication venue
Publication date: 22/05/2020
Field of study

Many business documents processed in modern NLP and IR pipelines are visually rich: in addition to text, their semantics can also be captured by visual traits such as layout, format, and fonts. We study the problem of information extraction from visually rich documents (VRDs) and present a model that combines the power of large pre-trained language models and graph neural networks to efficiently encode both textual and visual information in business documents. We further introduce new fine-tuning objectives to improve in-domain unsupervised fine-tuning to better utilize large amount of unlabeled in-domain data. We experiment on real world invoice and resume data sets and show that the proposed method outperforms strong text-based RoBERTa baselines by 6.3% absolute F1 on invoices and 4.7% absolute F1 on resumes. When evaluated in a few-shot setting, our method requires up to 30x less annotation data than the baseline to achieve the same level of performance at ~90% F1.Comment: 10 pages, to appear in SIGIR 2020 Industry Trac

arXiv.org e-Print Archive

Overlay Text Extraction From TV News Broadcast

Author: Guha Prithwijit
Kannao Raghvendra
Publication venue
Publication date: 02/04/2016
Field of study

The text data present in overlaid bands convey brief descriptions of news events in broadcast videos. The process of text extraction becomes challenging as overlay text is presented in widely varying formats and often with animation effects. We note that existing edge density based methods are well suited for our application on account of their simplicity and speed of operation. However, these methods are sensitive to thresholds and have high false positive rates. In this paper, we present a contrast enhancement based preprocessing stage for overlay text detection and a parameter free edge density based scheme for efficient text band detection. The second contribution of this paper is a novel approach for multiple text region tracking with a formal identification of all possible detection failure cases. The tracking stage enables us to establish the temporal presence of text bands and their linking over time. The third contribution is the adoption of Tesseract OCR for the specific task of overlay text recognition using web news articles. The proposed approach is tested and found superior on news videos acquired from three Indian English television news channels along with benchmark datasets.Comment: Published in INDICON 201

arXiv.org e-Print Archive