6,113 research outputs found
Geo-Text Data and Data-Driven Geospatial Semantics
Many datasets nowadays contain links between geographic locations and natural
language texts. These links can be geotags, such as geotagged tweets or
geotagged Wikipedia pages, in which location coordinates are explicitly
attached to texts. These links can also be place mentions, such as those in
news articles, travel blogs, or historical archives, in which texts are
implicitly connected to the mentioned places. This kind of data is referred to
as geo-text data. The availability of large amounts of geo-text data brings
both challenges and opportunities. On the one hand, it is challenging to
automatically process this kind of data due to the unstructured texts and the
complex spatial footprints of some places. On the other hand, geo-text data
offers unique research opportunities through the rich information contained in
texts and the special links between texts and geography. As a result, geo-text
data facilitates various studies especially those in data-driven geospatial
semantics. This paper discusses geo-text data and related concepts. With a
focus on data-driven research, this paper systematically reviews a large number
of studies that have discovered multiple types of knowledge from geo-text data.
Based on the literature review, a generalized workflow is extracted and key
challenges for future work are discussed.Comment: Geography Compass, 201
Automatic Extraction of Causal Relations from Natural Language Texts: A Comprehensive Survey
Automatic extraction of cause-effect relationships from natural language
texts is a challenging open problem in Artificial Intelligence. Most of the
early attempts at its solution used manually constructed linguistic and
syntactic rules on small and domain-specific data sets. However, with the
advent of big data, the availability of affordable computing power and the
recent popularization of machine learning, the paradigm to tackle this problem
has slowly shifted. Machines are now expected to learn generic causal
extraction rules from labelled data with minimal supervision, in a domain
independent-manner. In this paper, we provide a comprehensive survey of causal
relation extraction techniques from both paradigms, and analyse their relative
strengths and weaknesses, with recommendations for future work
Intelligent Word Embeddings of Free-Text Radiology Reports
Radiology reports are a rich resource for advancing deep learning
applications in medicine by leveraging the large volume of data continuously
being updated, integrated, and shared. However, there are significant
challenges as well, largely due to the ambiguity and subtlety of natural
language. We propose a hybrid strategy that combines semantic-dictionary
mapping and word2vec modeling for creating dense vector embeddings of free-text
radiology reports. Our method leverages the benefits of both
semantic-dictionary mapping as well as unsupervised learning. Using the vector
representation, we automatically classify the radiology reports into three
classes denoting confidence in the diagnosis of intracranial hemorrhage by the
interpreting radiologist. We performed experiments with varying hyperparameter
settings of the word embeddings and a range of different classifiers. Best
performance achieved was a weighted precision of 88% and weighted recall of
90%. Our work offers the potential to leverage unstructured electronic health
record data by allowing direct analysis of narrative clinical notes.Comment: AMIA Annual Symposium 201
Detecting and Extracting Events from Text Documents
Events of various kinds are mentioned and discussed in text documents,
whether they are books, news articles, blogs or microblog feeds. The paper
starts by giving an overview of how events are treated in linguistics and
philosophy. We follow this discussion by surveying how events and associated
information are handled in computationally. In particular, we look at how
textual documents can be mined to extract events and ancillary information.
These days, it is mostly through the application of various machine learning
techniques. We also discuss applications of event detection and extraction
systems, particularly in summarization, in the medical domain and in the
context of Twitter posts. We end the paper with a discussion of challenges and
future directions.Comment: This is work in progress. Please email [email protected] with any
comments for improvemen
A self-attention based deep learning method for lesion attribute detection from CT reports
In radiology, radiologists not only detect lesions from the medical image,
but also describe them with various attributes such as their type, location,
size, shape, and intensity. While these lesion attributes are rich and useful
in many downstream clinical applications, how to extract them from the
radiology reports is less studied. This paper outlines a novel deep learning
method to automatically extract attributes of lesions of interest from the
clinical text. Different from classical CNN models, we integrated the
multi-head self-attention mechanism to handle the long-distance information in
the sentence, and to jointly correlate different portions of sentence
representation subspaces in parallel. Evaluation on an in-house corpus
demonstrates that our method can achieve high performance with 0.848 in
precision, 0.788 in recall, and 0.815 in F-score. The new method and
constructed corpus will enable us to build automatic systems with a
higher-level understanding of the radiological world.Comment: 5 pages, 2 figures, accepted by 2019 IEEE International Conference on
Healthcare Informatics (ICHI 2019
Extracting Fairness Policies from Legal Documents
Machine Learning community is recently exploring the implications of bias and
fairness with respect to the AI applications. The definition of fairness for
such applications varies based on their domain of application. The policies
governing the use of such machine learning system in a given context are
defined by the constitutional laws of nations and regulatory policies enforced
by the organizations that are involved in the usage. Fairness related laws and
policies are often spread across the large documents like constitution,
agreements, and organizational regulations. These legal documents have long
complex sentences in order to achieve rigorousness and robustness. Automatic
extraction of fairness policies, or in general, any specific kind of policies
from large legal corpus can be very useful for the study of bias and fairness
in the context of AI applications.
We attempted to automatically extract fairness policies from publicly
available law documents using two approaches based on semantic relatedness. The
experiments reveal how classical Wordnet-based similarity and vector-based
similarity differ in addressing this task. We have shown that similarity based
on word vectors beats the classical approach with a large margin, whereas other
vector representations of senses and sentences fail to even match the classical
baseline. Further, we have presented thorough error analysis and reasoning to
explain the results with appropriate examples from the dataset for deeper
insights
ML-Net: multi-label classification of biomedical texts with deep neural networks
In multi-label text classification, each textual document can be assigned
with one or more labels. Due to this nature, the multi-label text
classification task is often considered to be more challenging compared to the
binary or multi-class text classification problems. As an important task with
broad applications in biomedicine such as assigning diagnosis codes, a number
of different computational methods (e.g. training and combining binary
classifiers for each label) have been proposed in recent years. However, many
suffered from modest accuracy and efficiency, with only limited success in
practical use. We propose ML-Net, a novel deep learning framework, for
multi-label classification of biomedical texts. As an end-to-end system, ML-Net
combines a label prediction network with an automated label count prediction
mechanism to output an optimal set of labels by leveraging both predicted
confidence score of each label and the contextual information in the target
document. We evaluate ML-Net on three independent, publicly-available corpora
in two kinds of text genres: biomedical literature and clinical notes. For
evaluation, example-based measures such as precision, recall and f-measure are
used. ML-Net is compared with several competitive machine learning baseline
models. Our benchmarking results show that ML-Net compares favorably to the
state-of-the-art methods in multi-label classification of biomedical texts.
ML-NET is also shown to be robust when evaluated on different text genres in
biomedicine. Unlike traditional machine learning methods, ML-Net does not
require human efforts in feature engineering and is highly efficient and
scalable approach to tasks with a large set of labels (no need to build
individual classifiers for each separate label). Finally, ML-NET is able to
dynamically estimate the label count based on the document context in a more
systematic and accurate manner
Accurate Scene Text Detection through Border Semantics Awareness and Bootstrapping
This paper presents a scene text detection technique that exploits
bootstrapping and text border semantics for accurate localization of texts in
scenes. A novel bootstrapping technique is designed which samples multiple
'subsections' of a word or text line and accordingly relieves the constraint of
limited training data effectively. At the same time, the repeated sampling of
text 'subsections' improves the consistency of the predicted text feature maps
which is critical in predicting a single complete instead of multiple broken
boxes for long words or text lines. In addition, a semantics-aware text border
detection technique is designed which produces four types of text border
segments for each scene text. With semantics-aware text borders, scene texts
can be localized more accurately by regressing text pixels around the ends of
words or text lines instead of all text pixels which often leads to inaccurate
localization while dealing with long words or text lines. Extensive experiments
demonstrate the effectiveness of the proposed techniques, and superior
performance is obtained over several public datasets, e. g. 80.1 f-score for
the MSRA-TD500, 67.1 f-score for the ICDAR2017-RCTW, etc.Comment: 14 pages, 8 figures, accepted by ECCV 201
Hierarchical RNN for Information Extraction from Lawsuit Documents
Every lawsuit document contains the information about the party's claim,
court's analysis, decision and others, and all of this information are helpful
to understand the case better and predict the judge's decision on similar case
in the future. However, the extraction of these information from the document
is difficult because the language is too complicated and sentences varied at
length. We treat this problem as a task of sequence labeling, and this paper
presents the first research to extract relevant information from the civil
lawsuit document in China with the hierarchical RNN framework.Comment: IMECS201
Text Summarization in the Biomedical Domain
This chapter gives an overview of recent advances in the field of biomedical
text summarization. Different types of challenges are introduced, and methods
are discussed concerning the type of challenge that they address. Biomedical
literature summarization is explored as a leading trend in the field, and some
future lines of work are pointed out. Underlying methods of recent
summarization systems are briefly explained and the most significant evaluation
results are mentioned. The primary purpose of this chapter is to review the
most significant research efforts made in the current decade toward new methods
of biomedical text summarization. As the main parts of this chapter, current
trends are discussed and new challenges are introduced
- …