177 research outputs found
Exploiting Multiple Embeddings for Chinese Named Entity Recognition
Identifying the named entities mentioned in text would enrich many semantic
applications at the downstream level. However, due to the predominant usage of
colloquial language in microblogs, the named entity recognition (NER) in
Chinese microblogs experience significant performance deterioration, compared
with performing NER in formal Chinese corpus. In this paper, we propose a
simple yet effective neural framework to derive the character-level embeddings
for NER in Chinese text, named ME-CNER. A character embedding is derived with
rich semantic information harnessed at multiple granularities, ranging from
radical, character to word levels. The experimental results demonstrate that
the proposed approach achieves a large performance improvement on Weibo dataset
and comparable performance on MSRA news dataset with lower computational cost
against the existing state-of-the-art alternatives.Comment: accepted at CIKM 201
A word-building method based on neural network for text classification
Text classification is a foundational task in many natural language processing applications. All traditional text classifiers take words as the basic units and conduct the pre-training process (like word2vec) to directly generate word vectors at the first step. However, none of them have considered the information contained in word structure which is proved to be helpful for text classification. In this paper, we propose a word-building method based on neural network model that can decompose a Chinese word to a sequence of radicals and learn structure information from these radical level features which is a key difference from the existing models. Then, the convolutional neural network is applied to extract structure information of words from radical sequence to generate a word vector, and the long short-term memory is applied to generate the sentence vector for the prediction purpose. The experimental results show that our model outperforms other existing models on Chinese dataset. Our model is also applicable to English as well where an English word can be decomposed down to character level, which demonstrates the excellent generalisation ability of our model. The experimental results have proved that our model also outperforms others on English dataset
End-to-end Neural Information Retrieval
In recent years we have witnessed many successes of neural networks in the information
retrieval community with lots of labeled data. Yet it remains unknown whether the same
techniques can be easily adapted to search social media posts where the text is much
shorter. In addition, we find that most neural information retrieval models are compared
against weak baselines. In this thesis, we build an end-to-end neural information retrieval
system using two toolkits: Anserini and MatchZoo. In addition, we also propose a novel
neural model to capture the relevance of short and varied tweet text, named MP-HCNN.
With the information retrieval toolkit Anserini, we build a reranking architecture based
on various traditional information retrieval models (QL, QL+RM3, BM25, BM25+RM3),
including a strong pseudo-relevance feedback baseline: RM3. With the neural network
toolkit MatchZoo, we offer an empirical study of a number of popular neural network
ranking models (DSSM, CDSSM, KNRM, DUET, DRMM). Experiments on datasets from
the TREC Microblog Tracks and the TREC Robust Retrieval Track show that most
existing neural network models cannot beat a simple language model baseline. How-
ever, DRMM provides a significant improvement over the pseudo-relevance feedback baseline
(BM25+RM3) on the Robust04 dataset and DUET, DRMM and MP-HCNN can provide
significant improvements over the baseline (QL+RM3) on the microblog datasets. Further
detailed analyses suggest that searching social media and searching news articles exhibit
several different characteristics that require customized model design, shedding light on
future directions
Detecting Traffic Information From Social Media Texts With Deep Learning Approaches
Mining traffic-relevant information from social media data has become an emerging topic due to the real-time and ubiquitous features of social media. In this paper, we focus on a specific problem in social media mining which is to extract traffic relevant microblogs from Sina Weibo, a Chinese microblogging platform. It is transformed into a machine learning problem of short text classification. First, we apply the continuous bag-of-word model to learn word embedding representations based on a data set of three billion microblogs. Compared to the traditional one-hot vector representation of words, word embedding can capture semantic similarity between words and has been proved effective in natural language processing tasks. Next, we propose using convolutional neural networks (CNNs), long short-term memory (LSTM) models and their combination LSTM-CNN to extract traffic relevant microblogs with the learned word embeddings as inputs. We compare the proposed methods with competitive approaches, including the support vector machine (SVM) model based on a bag of n-gram features, the SVM model based on word vector features, and the multi-layer perceptron model based on word vector features. Experiments show the effectiveness of the proposed deep learning approaches
News Text Classification Based on an Improved Convolutional Neural Network
With the explosive growth in Internet news media and the disorganized status of news texts, this paper puts forward an automatic classification model for news based on a Convolutional Neural Network (CNN). In the model, Word2vec is firstly merged with Latent Dirichlet Allocation (LDA) to generate an effective text feature representation. Then when an attention mechanism is combined with the proposed model, higher attention probability values are given to key features to achieve an accurate judgment. The results show that the precision rate, the recall rate and the F1 value of the model in this paper reach 96.4%, 95.9% and 96.2% respectively, which indicates that the improved CNN, through a unique framework, can extract deep semantic features of the text and provide a strong support for establishing an efficient and accurate news text classification model
Char-RNN and Active Learning for Hashtag Segmentation
We explore the abilities of character recurrent neural network (char-RNN) for
hashtag segmentation. Our approach to the task is the following: we generate
synthetic training dataset according to frequent n-grams that satisfy
predefined morpho-syntactic patterns to avoid any manual annotation. The active
learning strategy limits the training dataset and selects informative training
subset. The approach does not require any language-specific settings and is
compared for two languages, which differ in inflection degree.Comment: to appear in Cicling201
- …