Search CORE

17,268 research outputs found

An enhanced computational feature selection method for medical synonym identification via bilingualism and multi-corpus training

Author: Lei K.
Shen Y.
Si S.
Wen D.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/12/2018
Field of study

Medical synonym identification has been an important part of medical natural language processing (NLP). However, in the field of Chinese medical synonym identification, there are problems like low precision and low recall rate. To solve the problem, in this paper, we propose a method for identifying Chinese medical synonyms. We first selected 13 features including Chinese and English features. Then we studied the synonym identification results of each feature alone and different combinations of the features. Through the comparison among identification results, we present an optimal combination of features for Chinese medical synonym identification. Experiments show that our selected features have achieved 97.37% precision rate, 96.00% recall rate and 97.33% F1 score

arXiv.org e-Print Archive

Improving Document Clustering by Eliminating Unnatural Language

Author: Allan James
Choi Jinho D.
Jang Myungha
Publication venue
Publication date: 17/03/2017
Field of study

Technical documents contain a fair amount of unnatural language, such as tables, formulas, pseudo-codes, etc. Unnatural language can be an important factor of confusing existing NLP tools. This paper presents an effective method of distinguishing unnatural language from natural language, and evaluates the impact of unnatural language detection on NLP tasks such as document clustering. We view this problem as an information extraction task and build a multiclass classification model identifying unnatural language components into four categories. First, we create a new annotated corpus by collecting slides and papers in various formats, PPT, PDF, and HTML, where unnatural language components are annotated into four categories. We then explore features available from plain text to build a statistical model that can handle any format as long as it is converted into plain text. Our experiments show that removing unnatural language components gives an absolute improvement in document clustering up to 15%. Our corpus and tool are publicly available

arXiv.org e-Print Archive

iParaphrasing: Extracting Visually Grounded Paraphrases via an Image

Author: Chu Chenhui
Nakashima Yuta
Otani Mayu
Publication venue
Publication date: 11/06/2018
Field of study

A paraphrase is a restatement of the meaning of a text in other words. Paraphrases have been studied to enhance the performance of many natural language processing tasks. In this paper, we propose a novel task iParaphrasing to extract visually grounded paraphrases (VGPs), which are different phrasal expressions describing the same visual concept in an image. These extracted VGPs have the potential to improve language and image multimodal tasks such as visual question answering and image captioning. How to model the similarity between VGPs is the key of iParaphrasing. We apply various existing methods as well as propose a novel neural network-based method with image attention, and report the results of the first attempt toward iParaphrasing.Comment: COLING 201

arXiv.org e-Print Archive

Text Classification Algorithms: A Survey

Author: Barnes Laura E.
Brown Donald E.
Heidarysafa Mojtaba
Kowsari Kamran
Meimandi Kiana Jafari
Mendu Sanjana
Publication venue: 'MDPI AG'
Publication date: 20/05/2020
Field of study

In recent years, there has been an exponential growth in the number of complex documents and texts that require a deeper understanding of machine learning methods to be able to accurately classify texts in many applications. Many machine learning approaches have achieved surpassing results in natural language processing. The success of these learning algorithms relies on their capacity to understand complex models and non-linear relationships within data. However, finding suitable structures, architectures, and techniques for text classification is a challenge for researchers. In this paper, a brief overview of text classification algorithms is discussed. This overview covers different text feature extractions, dimensionality reduction methods, existing algorithms and techniques, and evaluations methods. Finally, the limitations of each technique and their application in the real-world problem are discussed

arXiv.org e-Print Archive

Phonetic-enriched Text Representation for Chinese Sentiment Analysis with Reinforcement Learning

Author: Cambria Erik
Li Yang
Ma Yukun
Peng Haiyun
Poria Soujanya
Publication venue
Publication date: 23/01/2019
Field of study

The Chinese pronunciation system offers two characteristics that distinguish it from other languages: deep phonemic orthography and intonation variations. We are the first to argue that these two important properties can play a major role in Chinese sentiment analysis. Particularly, we propose two effective features to encode phonetic information. Next, we develop a Disambiguate Intonation for Sentiment Analysis (DISA) network using a reinforcement network. It functions as disambiguating intonations for each Chinese character (pinyin). Thus, a precise phonetic representation of Chinese is learned. Furthermore, we also fuse phonetic features with textual and visual features in order to mimic the way humans read and understand Chinese text. Experimental results on five different Chinese sentiment analysis datasets show that the inclusion of phonetic features significantly and consistently improves the performance of textual and visual representations and outshines the state-of-the-art Chinese character level representations

arXiv.org e-Print Archive

dpUGC: Learn Differentially Private Representation for User Generated Contents

Author: Jiang Lili
Tran Son N.
Vu Xuan-Son
Publication venue
Publication date: 25/03/2019
Field of study

This paper firstly proposes a simple yet efficient generalized approach to apply differential privacy to text representation (i.e., word embedding). Based on it, we propose a user-level approach to learn personalized differentially private word embedding model on user generated contents (UGC). To our best knowledge, this is the first work of learning user-level differentially private word embedding model from text for sharing. The proposed approaches protect the privacy of the individual from re-identification, especially provide better trade-off of privacy and data utility on UGC data for sharing. The experimental results show that the trained embedding models are applicable for the classic text analysis tasks (e.g., regression). Moreover, the proposed approaches of learning differentially private embedding models are both framework- and data- independent, which facilitates the deployment and sharing. The source code is available at https://github.com/sonvx/dpText

arXiv.org e-Print Archive

Personalized sentence generation using generative adversarial networks with author-specific word usage

Author: Huang Yi-Chin
Yuan Chenhan
Publication venue
Publication date: 20/04/2019
Field of study

The author-specific word usage is a vital feature to let readers perceive the writing style of the author. In this work, a personalized sentence generation method based on generative adversarial networks (GANs) is proposed to cope with this issue. The frequently used function word and content word are incorporated not only as the input features but also as the sentence structure constraint for the GAN training. For the sentence generation with the related topics decided by the user, the Named Entity Recognition (NER) information of the input words is also used in the network training. We compared the proposed method with the GAN-based sentence generation methods, and the experimental results showed that the generated sentences using our method are more similar to the original sentences of the same author based on the objective evaluation such as BLEU and SimHash score.Comment: slightly changed version of the paper accepted to the CICling 2019 conferenc

arXiv.org e-Print Archive

Machine Learning with World Knowledge: The Position and Survey

Author: Roth Dan
Song Yangqiu
Publication venue
Publication date: 08/05/2017
Field of study

Machine learning has become pervasive in multiple domains, impacting a wide variety of applications, such as knowledge discovery and data mining, natural language processing, information retrieval, computer vision, social and health informatics, ubiquitous computing, etc. Two essential problems of machine learning are how to generate features and how to acquire labels for machines to learn. Particularly, labeling large amount of data for each domain-specific problem can be very time consuming and costly. It has become a key obstacle in making learning protocols realistic in applications. In this paper, we will discuss how to use the existing general-purpose world knowledge to enhance machine learning processes, by enriching the features or reducing the labeling work. We start from the comparison of world knowledge with domain-specific knowledge, and then introduce three key problems in using world knowledge in learning processes, i.e., explicit and implicit feature representation, inference for knowledge linking and disambiguation, and learning with direct or indirect supervision. Finally we discuss the future directions of this research topic

arXiv.org e-Print Archive

Location reference identification from tweets during emergencies: A deep learning approach

Author: Kumar Abhinav
Singh Jyoti Prakash
Publication venue
Publication date: 24/01/2019
Field of study

Twitter is recently being used during crises to communicate with officials and provide rescue and relief operation in real time. The geographical location information of the event, as well as users, are vitally important in such scenarios. The identification of geographic location is one of the challenging tasks as the location information fields, such as user location and place name of tweets are not reliable. The extraction of location information from tweet text is difficult as it contains a lot of non-standard English, grammatical errors, spelling mistakes, non-standard abbreviations, and so on. This research aims to extract location words used in the tweet using a Convolutional Neural Network (CNN) based model. We achieved the exact matching score of 0.929, Hamming loss of 0.002, and

F_1

-score of 0.96 for the tweets related to the earthquake. Our model was able to extract even three- to four-word long location references which is also evident from the exact matching score of over 92\%. The findings of this paper can help in early event localization, emergency situations, real-time road traffic management, localized advertisement, and in various location-based services

arXiv.org e-Print Archive

A Case Study on the Impact of Similarity Measure on Information Retrieval based Software Engineering Tasks

Author: Chakraborty Saikat
Kaiser Gail
Rahman Md Masudur
Ray Baishakhi
Publication venue
Publication date: 08/08/2018
Field of study

Information Retrieval (IR) plays a pivotal role in diverse Software Engineering (SE) tasks, e.g., bug localization and triaging, code retrieval, requirements analysis, etc. The choice of similarity measure is the core component of an IR technique. The performance of any IR method critically depends on selecting an appropriate similarity measure for the given application domain. Since different SE tasks operate on different document types like bug reports, software descriptions, source code, etc. that often contain non-standard domain-specific vocabulary, it is essential to understand which similarity measures work best for different SE documents. This paper presents two case studies on the effect of different similarity measure on various SE documents w.r.t. two tasks: (i) project recommendation: finding similar GitHub projects and (ii) bug localization: retrieving buggy source file(s) correspond to a bug report. These tasks contain a diverse combination of textual (i.e. description, readme) and code (i.e. source code, API, import package) artifacts. We observe that the performance of IR models varies when applied to different artifact types. We find that, in general, the context-aware models achieve better performance on textual artifacts. In contrast, simple keyword-based bag-of-words models perform better on code artifacts. On the other hand, the probabilistic ranking model BM25 performs better on a mixture of text and code artifacts. We further investigate how such an informed choice of similarity measure impacts the performance of SE tools. In particular, we analyze two previously proposed tools for project recommendation and bug localization tasks, which leverage diverse software artifacts, and observe that an informed choice of similarity measure indeed leads to improved performance of the existing SE tools.Comment: 22 pages, on submissio

arXiv.org e-Print Archive