7,399 research outputs found
A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques
The amount of text that is generated every day is increasing dramatically.
This tremendous volume of mostly unstructured text cannot be simply processed
and perceived by computers. Therefore, efficient and effective techniques and
algorithms are required to discover useful patterns. Text mining is the task of
extracting meaningful information from text, which has gained significant
attentions in recent years. In this paper, we describe several of the most
fundamental text mining tasks and techniques including text pre-processing,
classification and clustering. Additionally, we briefly explain text mining in
biomedical and health care domains.Comment: some of References format have update
Scientific Discovery by Machine Intelligence: A New Avenue for Drug Research
The majority of big data is unstructured and of this majority the largest
chunk is text. While data mining techniques are well developed and standardized
for structured, numerical data, the realm of unstructured data is still largely
unexplored. The general focus lies on information extraction, which attempts to
retrieve known information from text. The Holy Grail, however is knowledge
discovery, where machines are expected to unearth entirely new facts and
relations that were not previously known by any human expert. Indeed,
understanding the meaning of text is often considered as one of the main
characteristics of human intelligence. The ultimate goal of semantic artificial
intelligence is to devise software that can understand the meaning of free
text, at least in the practical sense of providing new, actionable information
condensed out of a body of documents. As a stepping stone on the road to this
vision I will introduce a totally new approach to drug research, namely that of
identifying relevant information by employing a self-organizing semantic engine
to text mine large repositories of biomedical research papers, a technique
pioneered by Merck with the InfoCodex software. I will describe the methodology
and a first successful experiment for the discovery of new biomarkers and
phenotypes for diabetes and obesity on the basis of PubMed abstracts, public
clinical trials and Merck internal documents. The reported approach shows much
promise and has potential to impact fundamentally pharmaceutical research as a
way to shorten time-to-market of novel drugs, and for early recognition of dead
ends
Fuzzy Approach Topic Discovery in Health and Medical Corpora
The majority of medical documents and electronic health records (EHRs) are in
text format that poses a challenge for data processing and finding relevant
documents. Looking for ways to automatically retrieve the enormous amount of
health and medical knowledge has always been an intriguing topic. Powerful
methods have been developed in recent years to make the text processing
automatic. One of the popular approaches to retrieve information based on
discovering the themes in health & medical corpora is topic modeling, however,
this approach still needs new perspectives. In this research we describe fuzzy
latent semantic analysis (FLSA), a novel approach in topic modeling using fuzzy
perspective. FLSA can handle health & medical corpora redundancy issue and
provides a new method to estimate the number of topics. The quantitative
evaluations show that FLSA produces superior performance and features to latent
Dirichlet allocation (LDA), the most popular topic model.Comment: 12 Pages, International Journal of Fuzzy Systems, 201
Text2Node: a Cross-Domain System for Mapping Arbitrary Phrases to a Taxonomy
Electronic health record (EHR) systems are used extensively throughout the
healthcare domain. However, data interchangeability between EHR systems is
limited due to the use of different coding standards across systems. Existing
methods of mapping coding standards based on manual human experts mapping,
dictionary mapping, symbolic NLP and classification are unscalable and cannot
accommodate large scale EHR datasets.
In this work, we present Text2Node, a cross-domain mapping system capable of
mapping medical phrases to concepts in a large taxonomy (such as SNOMED CT).
The system is designed to generalize from a limited set of training samples and
map phrases to elements of the taxonomy that are not covered by training data.
As a result, our system is scalable, robust to wording variants between coding
systems and can output highly relevant concepts when no exact concept exists in
the target taxonomy. Text2Node operates in three main stages: first, the
lexicon is mapped to word embeddings; second, the taxonomy is vectorized using
node embeddings; and finally, the mapping function is trained to connect the
two embedding spaces. We compared multiple algorithms and architectures for
each stage of the training, including GloVe and FastText word embeddings, CNN
and Bi-LSTM mapping functions, and node2vec for node embeddings. We confirmed
the robustness and generalisation properties of Text2Node by mapping ICD-9-CM
Diagnosis phrases to SNOMED CT and by zero-shot training at comparable
accuracy.
This system is a novel methodological contribution to the task of normalizing
and linking phrases to a taxonomy, advancing data interchangeability in
healthcare. When applied, the system can use electronic health records to
generate an embedding that incorporates taxonomical medical knowledge to
improve clinical predictive models
A Study of Recent Contributions on Information Extraction
This paper reports on modern approaches in Information Extraction (IE) and
its two main sub-tasks of Named Entity Recognition (NER) and Relation
Extraction (RE). Basic concepts and the most recent approaches in this area are
reviewed, which mainly include Machine Learning (ML) based approaches and the
more recent trend to Deep Learning (DL) based methods
Unsupervised Extraction of Phenotypes from Cancer Clinical Notes for Association Studies
The recent adoption of Electronic Health Records (EHRs) by health care
providers has introduced an important source of data that provides detailed and
highly specific insights into patient phenotypes over large cohorts. These
datasets, in combination with machine learning and statistical approaches,
generate new opportunities for research and clinical care. However, many
methods require the patient representations to be in structured formats, while
the information in the EHR is often locked in unstructured texts designed for
human readability. In this work, we develop the methodology to automatically
extract clinical features from clinical narratives from large EHR corpora
without the need for prior knowledge. We consider medical terms and sentences
appearing in clinical narratives as atomic information units. We propose an
efficient clustering strategy suitable for the analysis of large text corpora
and to utilize the clusters to represent information about the patient
compactly. To demonstrate the utility of our approach, we perform an
association study of clinical features with somatic mutation profiles from
4,007 cancer patients and their tumors. We apply the proposed algorithm to a
dataset consisting of about 65 thousand documents with a total of about 3.2
million sentences. We identify 341 significant statistical associations between
the presence of somatic mutations and clinical features. We annotated these
associations according to their novelty, and report several known associations.
We also propose 32 testable hypotheses where the underlying biological
mechanism does not appear to be known but plausible. These results illustrate
that the automated discovery of clinical features is possible and the joint
analysis of clinical and genetic datasets can generate appealing new
hypotheses
Deep EHR: A Survey of Recent Advances in Deep Learning Techniques for Electronic Health Record (EHR) Analysis
The past decade has seen an explosion in the amount of digital information
stored in electronic health records (EHR). While primarily designed for
archiving patient clinical information and administrative healthcare tasks,
many researchers have found secondary use of these records for various clinical
informatics tasks. Over the same period, the machine learning community has
seen widespread advances in deep learning techniques, which also have been
successfully applied to the vast amount of EHR data. In this paper, we review
these deep EHR systems, examining architectures, technical aspects, and
clinical applications. We also identify shortcomings of current techniques and
discuss avenues of future research for EHR-based deep learning.Comment: Accepted for publication with Journal of Biomedical and Health
Informatics: http://ieeexplore.ieee.org/abstract/document/8086133
Toward Interpretable Topic Discovery via Anchored Correlation Explanation
Many predictive tasks, such as diagnosing a patient based on their medical
chart, are ultimately defined by the decisions of human experts. Unfortunately,
encoding experts' knowledge is often time consuming and expensive. We propose a
simple way to use fuzzy and informal knowledge from experts to guide discovery
of interpretable latent topics in text. The underlying intuition of our
approach is that latent factors should be informative about both correlations
in the data and a set of relevance variables specified by an expert.
Mathematically, this approach is a combination of the information bottleneck
and Total Correlation Explanation (CorEx). We give a preliminary evaluation of
Anchored CorEx, showing that it produces more coherent and interpretable topics
on two distinct corpora.Comment: presented at 2016 ICML Workshop on #Data4Good: Machine Learning in
Social Good Applications, New York, N
Relation Extraction : A Survey
With the advent of the Internet, large amount of digital text is generated
everyday in the form of news articles, research publications, blogs, question
answering forums and social media. It is important to develop techniques for
extracting information automatically from these documents, as lot of important
information is hidden within them. This extracted information can be used to
improve access and management of knowledge hidden in large text corpora.
Several applications such as Question Answering, Information Retrieval would
benefit from this information. Entities like persons and organizations, form
the most basic unit of the information. Occurrences of entities in a sentence
are often linked through well-defined relations; e.g., occurrences of person
and organization in a sentence may be linked through relations such as employed
at. The task of Relation Extraction (RE) is to identify such relations
automatically. In this paper, we survey several important supervised,
semi-supervised and unsupervised RE techniques. We also cover the paradigms of
Open Information Extraction (OIE) and Distant Supervision. Finally, we describe
some of the recent trends in the RE techniques and possible future research
directions. This survey would be useful for three kinds of readers - i)
Newcomers in the field who want to quickly learn about RE; ii) Researchers who
want to know how the various RE techniques evolved over time and what are
possible future research directions and iii) Practitioners who just need to
know which RE technique works best in various settings
Scientific Article Summarization Using Citation-Context and Article's Discourse Structure
We propose a summarization approach for scientific articles which takes
advantage of citation-context and the document discourse model. While citations
have been previously used in generating scientific summaries, they lack the
related context from the referenced article and therefore do not accurately
reflect the article's content. Our method overcomes the problem of
inconsistency between the citation summary and the article's content by
providing context for each citation. We also leverage the inherent scientific
article's discourse for producing better summaries. We show that our proposed
method effectively improves over existing summarization approaches (greater
than 30% improvement over the best performing baseline) in terms of
\textsc{Rouge} scores on TAC2014 scientific summarization dataset. While the
dataset we use for evaluation is in the biomedical domain, most of our
approaches are general and therefore adaptable to other domains.Comment: EMNLP 201
- …