40,768 research outputs found
Named Entity Recognition for Electronic Health Records: A Comparison of Rule-based and Machine Learning Approaches
This work investigates multiple approaches to Named Entity Recognition (NER)
for text in Electronic Health Record (EHR) data. In particular, we look into
the application of (i) rule-based, (ii) deep learning and (iii) transfer
learning systems for the task of NER on brain imaging reports with a focus on
records from patients with stroke. We explore the strengths and weaknesses of
each approach, develop rules and train on a common dataset, and evaluate each
system's performance on common test sets of Scottish radiology reports from two
sources (brain imaging reports in ESS -- Edinburgh Stroke Study data collected
by NHS Lothian as well as radiology reports created in NHS Tayside). Our
comparison shows that a hand-crafted system is the most accurate way to
automatically label EHR, but machine learning approaches can provide a feasible
alternative where resources for a manual system are not readily available.Comment: 8 pages, presented at HealTAC 2019, Cardiff, 24-25/04/201
Domain and Language Independent Feature Extraction for Statistical Text Categorization
A generic system for text categorization is presented which uses a
representative text corpus to adapt the processing steps: feature extraction,
dimension reduction, and classification. Feature extraction automatically
learns features from the corpus by reducing actual word forms using statistical
information of the corpus and general linguistic knowledge. The dimension of
feature vector is then reduced by linear transformation keeping the essential
information. The classification principle is a minimum least square approach
based on polynomials. The described system can be readily adapted to new
domains or new languages. In application, the system is reliable, fast, and
processes completely automatically. It is shown that the text categorizer works
successfully both on text generated by document image analysis - DIA and on
ground truth data.Comment: 12 pages, TeX file, 9 Postscript figures, uses epsf.st
NeuroNER: an easy-to-use program for named-entity recognition based on neural networks
Named-entity recognition (NER) aims at identifying entities of interest in a
text. Artificial neural networks (ANNs) have recently been shown to outperform
existing NER systems. However, ANNs remain challenging to use for non-expert
users. In this paper, we present NeuroNER, an easy-to-use named-entity
recognition tool based on ANNs. Users can annotate entities using a graphical
web-based user interface (BRAT): the annotations are then used to train an ANN,
which in turn predict entities' locations and categories in new texts. NeuroNER
makes this annotation-training-prediction flow smooth and accessible to anyone.Comment: The first two authors contributed equally to this wor
OCR Error Correction Using Character Correction and Feature-Based Word Classification
This paper explores the use of a learned classifier for post-OCR text
correction. Experiments with the Arabic language show that this approach, which
integrates a weighted confusion matrix and a shallow language model, improves
the vast majority of segmentation and recognition errors, the most frequent
types of error on our dataset.Comment: Proceedings of the 12th IAPR International Workshop on Document
Analysis Systems (DAS2016), Santorini, Greece, April 11-14, 201
Clinical Information Extraction via Convolutional Neural Network
We report an implementation of a clinical information extraction tool that
leverages deep neural network to annotate event spans and their attributes from
raw clinical notes and pathology reports. Our approach uses context words and
their part-of-speech tags and shape information as features. Then we hire
temporal (1D) convolutional neural network to learn hidden feature
representations. Finally, we use Multilayer Perceptron (MLP) to predict event
spans. The empirical evaluation demonstrates that our approach significantly
outperforms baselines.Comment: arXiv admin note: text overlap with arXiv:1408.5882 by other author
Improving Document Clustering by Eliminating Unnatural Language
Technical documents contain a fair amount of unnatural language, such as
tables, formulas, pseudo-codes, etc. Unnatural language can be an important
factor of confusing existing NLP tools. This paper presents an effective method
of distinguishing unnatural language from natural language, and evaluates the
impact of unnatural language detection on NLP tasks such as document
clustering. We view this problem as an information extraction task and build a
multiclass classification model identifying unnatural language components into
four categories. First, we create a new annotated corpus by collecting slides
and papers in various formats, PPT, PDF, and HTML, where unnatural language
components are annotated into four categories. We then explore features
available from plain text to build a statistical model that can handle any
format as long as it is converted into plain text. Our experiments show that
removing unnatural language components gives an absolute improvement in
document clustering up to 15%. Our corpus and tool are publicly available
Analysis of Multilingual Sequence-to-Sequence speech recognition systems
This paper investigates the applications of various multilingual approaches
developed in conventional hidden Markov model (HMM) systems to
sequence-to-sequence (seq2seq) automatic speech recognition (ASR). On a set
composed of Babel data, we first show the effectiveness of multi-lingual
training with stacked bottle-neck (SBN) features. Then we explore various
architectures and training strategies of multi-lingual seq2seq models based on
CTC-attention networks including combinations of output layer, CTC and/or
attention component re-training. We also investigate the effectiveness of
language-transfer learning in a very low resource scenario when the target
language is not included in the original multi-lingual training data.
Interestingly, we found multilingual features superior to multilingual models,
and this finding suggests that we can efficiently combine the benefits of the
HMM system with the seq2seq system through these multilingual feature
techniques.Comment: arXiv admin note: text overlap with arXiv:1810.0345
Recurrent Neural Network Method in Arabic Words Recognition System
The recognition of unconstrained handwriting continues to be a difficult task
for computers despite active research for several decades. This is because
handwritten text offers great challenges such as character and word
segmentation, character recognition, variation between handwriting styles,
different character size and no font constraints as well as the background
clarity. In this paper primarily discussed Online Handwriting Recognition
methods for Arabic words which being often used among then across the Middle
East and North Africa people. Because of the characteristic of the whole body
of the Arabic words, namely connectivity between the characters, thereby the
segmentation of An Arabic word is very difficult. We introduced a recurrent
neural network to online handwriting Arabic word recognition. The key
innovation is a recently produce recurrent neural networks objective function
known as connectionist temporal classification. The system consists of an
advanced recurrent neural network with an output layer designed for sequence
labeling, partially combined with a probabilistic language model. Experimental
results show that unconstrained Arabic words achieve recognition rates about
79%, which is significantly higher than the about 70% using a previously
developed hidden markov model based recognition system.Comment: 6 Pages, 5 Figures, Vol. 3, Issue 11, pages 43-4
Robust Layout-aware IE for Visually Rich Documents with Pre-trained Language Models
Many business documents processed in modern NLP and IR pipelines are visually
rich: in addition to text, their semantics can also be captured by visual
traits such as layout, format, and fonts. We study the problem of information
extraction from visually rich documents (VRDs) and present a model that
combines the power of large pre-trained language models and graph neural
networks to efficiently encode both textual and visual information in business
documents. We further introduce new fine-tuning objectives to improve in-domain
unsupervised fine-tuning to better utilize large amount of unlabeled in-domain
data. We experiment on real world invoice and resume data sets and show that
the proposed method outperforms strong text-based RoBERTa baselines by 6.3%
absolute F1 on invoices and 4.7% absolute F1 on resumes. When evaluated in a
few-shot setting, our method requires up to 30x less annotation data than the
baseline to achieve the same level of performance at ~90% F1.Comment: 10 pages, to appear in SIGIR 2020 Industry Trac
Overlay Text Extraction From TV News Broadcast
The text data present in overlaid bands convey brief descriptions of news
events in broadcast videos. The process of text extraction becomes challenging
as overlay text is presented in widely varying formats and often with animation
effects. We note that existing edge density based methods are well suited for
our application on account of their simplicity and speed of operation. However,
these methods are sensitive to thresholds and have high false positive rates.
In this paper, we present a contrast enhancement based preprocessing stage for
overlay text detection and a parameter free edge density based scheme for
efficient text band detection. The second contribution of this paper is a novel
approach for multiple text region tracking with a formal identification of all
possible detection failure cases. The tracking stage enables us to establish
the temporal presence of text bands and their linking over time. The third
contribution is the adoption of Tesseract OCR for the specific task of overlay
text recognition using web news articles. The proposed approach is tested and
found superior on news videos acquired from three Indian English television
news channels along with benchmark datasets.Comment: Published in INDICON 201
- …