17,268 research outputs found
An enhanced computational feature selection method for medical synonym identification via bilingualism and multi-corpus training
Medical synonym identification has been an important part of medical natural
language processing (NLP). However, in the field of Chinese medical synonym
identification, there are problems like low precision and low recall rate. To
solve the problem, in this paper, we propose a method for identifying Chinese
medical synonyms. We first selected 13 features including Chinese and English
features. Then we studied the synonym identification results of each feature
alone and different combinations of the features. Through the comparison among
identification results, we present an optimal combination of features for
Chinese medical synonym identification. Experiments show that our selected
features have achieved 97.37% precision rate, 96.00% recall rate and 97.33% F1
score
Improving Document Clustering by Eliminating Unnatural Language
Technical documents contain a fair amount of unnatural language, such as
tables, formulas, pseudo-codes, etc. Unnatural language can be an important
factor of confusing existing NLP tools. This paper presents an effective method
of distinguishing unnatural language from natural language, and evaluates the
impact of unnatural language detection on NLP tasks such as document
clustering. We view this problem as an information extraction task and build a
multiclass classification model identifying unnatural language components into
four categories. First, we create a new annotated corpus by collecting slides
and papers in various formats, PPT, PDF, and HTML, where unnatural language
components are annotated into four categories. We then explore features
available from plain text to build a statistical model that can handle any
format as long as it is converted into plain text. Our experiments show that
removing unnatural language components gives an absolute improvement in
document clustering up to 15%. Our corpus and tool are publicly available
iParaphrasing: Extracting Visually Grounded Paraphrases via an Image
A paraphrase is a restatement of the meaning of a text in other words.
Paraphrases have been studied to enhance the performance of many natural
language processing tasks. In this paper, we propose a novel task iParaphrasing
to extract visually grounded paraphrases (VGPs), which are different phrasal
expressions describing the same visual concept in an image. These extracted
VGPs have the potential to improve language and image multimodal tasks such as
visual question answering and image captioning. How to model the similarity
between VGPs is the key of iParaphrasing. We apply various existing methods as
well as propose a novel neural network-based method with image attention, and
report the results of the first attempt toward iParaphrasing.Comment: COLING 201
Text Classification Algorithms: A Survey
In recent years, there has been an exponential growth in the number of
complex documents and texts that require a deeper understanding of machine
learning methods to be able to accurately classify texts in many applications.
Many machine learning approaches have achieved surpassing results in natural
language processing. The success of these learning algorithms relies on their
capacity to understand complex models and non-linear relationships within data.
However, finding suitable structures, architectures, and techniques for text
classification is a challenge for researchers. In this paper, a brief overview
of text classification algorithms is discussed. This overview covers different
text feature extractions, dimensionality reduction methods, existing algorithms
and techniques, and evaluations methods. Finally, the limitations of each
technique and their application in the real-world problem are discussed
Phonetic-enriched Text Representation for Chinese Sentiment Analysis with Reinforcement Learning
The Chinese pronunciation system offers two characteristics that distinguish
it from other languages: deep phonemic orthography and intonation variations.
We are the first to argue that these two important properties can play a major
role in Chinese sentiment analysis. Particularly, we propose two effective
features to encode phonetic information. Next, we develop a Disambiguate
Intonation for Sentiment Analysis (DISA) network using a reinforcement network.
It functions as disambiguating intonations for each Chinese character (pinyin).
Thus, a precise phonetic representation of Chinese is learned. Furthermore, we
also fuse phonetic features with textual and visual features in order to mimic
the way humans read and understand Chinese text. Experimental results on five
different Chinese sentiment analysis datasets show that the inclusion of
phonetic features significantly and consistently improves the performance of
textual and visual representations and outshines the state-of-the-art Chinese
character level representations
dpUGC: Learn Differentially Private Representation for User Generated Contents
This paper firstly proposes a simple yet efficient generalized approach to
apply differential privacy to text representation (i.e., word embedding). Based
on it, we propose a user-level approach to learn personalized differentially
private word embedding model on user generated contents (UGC). To our best
knowledge, this is the first work of learning user-level differentially private
word embedding model from text for sharing. The proposed approaches protect the
privacy of the individual from re-identification, especially provide better
trade-off of privacy and data utility on UGC data for sharing. The experimental
results show that the trained embedding models are applicable for the classic
text analysis tasks (e.g., regression). Moreover, the proposed approaches of
learning differentially private embedding models are both framework- and data-
independent, which facilitates the deployment and sharing. The source code is
available at https://github.com/sonvx/dpText
Personalized sentence generation using generative adversarial networks with author-specific word usage
The author-specific word usage is a vital feature to let readers perceive the
writing style of the author. In this work, a personalized sentence generation
method based on generative adversarial networks (GANs) is proposed to cope with
this issue. The frequently used function word and content word are incorporated
not only as the input features but also as the sentence structure constraint
for the GAN training. For the sentence generation with the related topics
decided by the user, the Named Entity Recognition (NER) information of the
input words is also used in the network training. We compared the proposed
method with the GAN-based sentence generation methods, and the experimental
results showed that the generated sentences using our method are more similar
to the original sentences of the same author based on the objective evaluation
such as BLEU and SimHash score.Comment: slightly changed version of the paper accepted to the CICling 2019
conferenc
Machine Learning with World Knowledge: The Position and Survey
Machine learning has become pervasive in multiple domains, impacting a wide
variety of applications, such as knowledge discovery and data mining, natural
language processing, information retrieval, computer vision, social and health
informatics, ubiquitous computing, etc. Two essential problems of machine
learning are how to generate features and how to acquire labels for machines to
learn. Particularly, labeling large amount of data for each domain-specific
problem can be very time consuming and costly. It has become a key obstacle in
making learning protocols realistic in applications. In this paper, we will
discuss how to use the existing general-purpose world knowledge to enhance
machine learning processes, by enriching the features or reducing the labeling
work. We start from the comparison of world knowledge with domain-specific
knowledge, and then introduce three key problems in using world knowledge in
learning processes, i.e., explicit and implicit feature representation,
inference for knowledge linking and disambiguation, and learning with direct or
indirect supervision. Finally we discuss the future directions of this research
topic
Location reference identification from tweets during emergencies: A deep learning approach
Twitter is recently being used during crises to communicate with officials
and provide rescue and relief operation in real time. The geographical location
information of the event, as well as users, are vitally important in such
scenarios. The identification of geographic location is one of the challenging
tasks as the location information fields, such as user location and place name
of tweets are not reliable. The extraction of location information from tweet
text is difficult as it contains a lot of non-standard English, grammatical
errors, spelling mistakes, non-standard abbreviations, and so on. This research
aims to extract location words used in the tweet using a Convolutional Neural
Network (CNN) based model. We achieved the exact matching score of 0.929,
Hamming loss of 0.002, and -score of 0.96 for the tweets related to the
earthquake. Our model was able to extract even three- to four-word long
location references which is also evident from the exact matching score of over
92\%. The findings of this paper can help in early event localization,
emergency situations, real-time road traffic management, localized
advertisement, and in various location-based services
A Case Study on the Impact of Similarity Measure on Information Retrieval based Software Engineering Tasks
Information Retrieval (IR) plays a pivotal role in diverse Software
Engineering (SE) tasks, e.g., bug localization and triaging, code retrieval,
requirements analysis, etc. The choice of similarity measure is the core
component of an IR technique. The performance of any IR method critically
depends on selecting an appropriate similarity measure for the given
application domain. Since different SE tasks operate on different document
types like bug reports, software descriptions, source code, etc. that often
contain non-standard domain-specific vocabulary, it is essential to understand
which similarity measures work best for different SE documents.
This paper presents two case studies on the effect of different similarity
measure on various SE documents w.r.t. two tasks: (i) project recommendation:
finding similar GitHub projects and (ii) bug localization: retrieving buggy
source file(s) correspond to a bug report. These tasks contain a diverse
combination of textual (i.e. description, readme) and code (i.e. source code,
API, import package) artifacts. We observe that the performance of IR models
varies when applied to different artifact types. We find that, in general, the
context-aware models achieve better performance on textual artifacts. In
contrast, simple keyword-based bag-of-words models perform better on code
artifacts. On the other hand, the probabilistic ranking model BM25 performs
better on a mixture of text and code artifacts.
We further investigate how such an informed choice of similarity measure
impacts the performance of SE tools. In particular, we analyze two previously
proposed tools for project recommendation and bug localization tasks, which
leverage diverse software artifacts, and observe that an informed choice of
similarity measure indeed leads to improved performance of the existing SE
tools.Comment: 22 pages, on submissio
- …