29,500 research outputs found
Web News Documents Clustering in Indonesian Language Using Singular Value Decomposition-principal Component Analysis (Svdpca) and Ant Algorithms
Ant-based document clustering is a cluster method of measuring text documents similarity based on the shortest path between nodes (trial phase) and determines the optimal clusters of sequence document similarity (dividing phase). The processing time of trial phase Ant algorithms to make document vectors is very long because of high dimensional Document-Term Matrix (DTM). In this paper, we proposed a document clustering method for optimizing dimension reduction using Singular Value Decomposition-Principal Component Analysis (SVDPCA) and Ant algorithms. SVDPCA reduces size of the DTM dimensions by converting freq-term of conventional DTM to score-pc of Document-PC Matrix (DPCM). Ant algorithms creates documents clustering using the vector space model based on the dimension reduction result of DPCM. The experimental results on 506 news documents in Indonesian language demonstrated that the proposed method worked well to optimize dimension reduction up to 99.7%. We could speed up execution time efficiently of the trial phase and maintain the best F-measure achieved from experiments was 0.88 (88%)
Bridging Semantic Gaps between Natural Languages and APIs with Word Embedding
Developers increasingly rely on text matching tools to analyze the relation
between natural language words and APIs. However, semantic gaps, namely textual
mismatches between words and APIs, negatively affect these tools. Previous
studies have transformed words or APIs into low-dimensional vectors for
matching; however, inaccurate results were obtained due to the failure of
modeling words and APIs simultaneously. To resolve this problem, two main
challenges are to be addressed: the acquisition of massive words and APIs for
mining and the alignment of words and APIs for modeling. Therefore, this study
proposes Word2API to effectively estimate relatedness of words and APIs.
Word2API collects millions of commonly used words and APIs from code
repositories to address the acquisition challenge. Then, a shuffling strategy
is used to transform related words and APIs into tuples to address the
alignment challenge. Using these tuples, Word2API models words and APIs
simultaneously. Word2API outperforms baselines by 10%-49.6% of relatedness
estimation in terms of precision and NDCG. Word2API is also effective on
solving typical software tasks, e.g., query expansion and API documents
linking. A simple system with Word2API-expanded queries recommends up to 21.4%
more related APIs for developers. Meanwhile, Word2API improves comparison
algorithms by 7.9%-17.4% in linking questions in Question&Answer communities to
API documents.Comment: accepted by IEEE Transactions on Software Engineerin
User Profile Based Research Paper Recommendation
We design a recommender system for research papers based on topic-modeling.
The users feedback to the results is used to make the results more relevant the
next time they fire a query. The user's needs are understood by observing the
change in the themes that the user shows a preference for over time.Comment: Work in progress. arXiv admin note: text overlap with
arXiv:1611.0482
Integrating Lexical and Temporal Signals in Neural Ranking Models for Searching Social Media Streams
Time is an important relevance signal when searching streams of social media
posts. The distribution of document timestamps from the results of an initial
query can be leveraged to infer the distribution of relevant documents, which
can then be used to rerank the initial results. Previous experiments have shown
that kernel density estimation is a simple yet effective implementation of this
idea. This paper explores an alternative approach to mining temporal signals
with recurrent neural networks. Our intuition is that neural networks provide a
more expressive framework to capture the temporal coherence of neighboring
documents in time. To our knowledge, we are the first to integrate lexical and
temporal signals in an end-to-end neural network architecture, in which
existing neural ranking models are used to generate query-document similarity
vectors that feed into a bidirectional LSTM layer for temporal modeling. Our
results are mixed: existing neural models for document ranking alone yield
limited improvements over simple baselines, but the integration of lexical and
temporal signals yield significant improvements over competitive temporal
baselines.Comment: SIGIR 2017 Workshop on Neural Information Retrieval (Neu-IR'17),
August 7-11, 2017, Shinjuku, Tokyo, Japa
Understanding the Logical and Semantic Structure of Large Documents
Current language understanding approaches focus on small documents, such as
newswire articles, blog posts, product reviews and discussion forum entries.
Understanding and extracting information from large documents like legal
briefs, proposals, technical manuals and research articles is still a
challenging task. We describe a framework that can analyze a large document and
help people to know where a particular information is in that document. We aim
to automatically identify and classify semantic sections of documents and
assign consistent and human-understandable labels to similar sections across
documents. A key contribution of our research is modeling the logical and
semantic structure of an electronic document. We apply machine learning
techniques, including deep learning, in our prototype system. We also make
available a dataset of information about a collection of scholarly articles
from the arXiv eprints collection that includes a wide range of metadata for
each article, including a table of contents, section labels, section
summarizations and more. We hope that this dataset will be a useful resource
for the machine learning and NLP communities in information retrieval,
content-based question answering and language modeling.Comment: 10 pages, 15 figures and 6 table
Unsupervised Identification of Study Descriptors in Toxicology Research: An Experimental Study
Identifying and extracting data elements such as study descriptors in
publication full texts is a critical yet manual and labor-intensive step
required in a number of tasks. In this paper we address the question of
identifying data elements in an unsupervised manner. Specifically, provided a
set of criteria describing specific study parameters, such as species, route of
administration, and dosing regimen, we develop an unsupervised approach to
identify text segments (sentences) relevant to the criteria. A binary
classifier trained to identify publications that met the criteria performs
better when trained on the candidate sentences than when trained on sentences
randomly picked from the text, supporting the intuition that our method is able
to accurately identify study descriptors.Comment: Ninth International Workshop on Health Text Mining and Information
Analysis at EMNLP 201
Combining Similarity Features and Deep Representation Learning for Stance Detection in the Context of Checking Fake News
Fake news are nowadays an issue of pressing concern, given their recent rise
as a potential threat to high-quality journalism and well-informed public
discourse. The Fake News Challenge (FNC-1) was organized in 2017 to encourage
the development of machine learning-based classification systems for stance
detection (i.e., for identifying whether a particular news article agrees,
disagrees, discusses, or is unrelated to a particular news headline), thus
helping in the detection and analysis of possible instances of fake news. This
article presents a new approach to tackle this stance detection problem, based
on the combination of string similarity features with a deep neural
architecture that leverages ideas previously advanced in the context of
learning efficient text representations, document classification, and natural
language inference. Specifically, we use bi-directional Recurrent Neural
Networks, together with max-pooling over the temporal/sequential dimension and
neural attention, for representing (i) the headline, (ii) the first two
sentences of the news article, and (iii) the entire news article. These
representations are then combined/compared, complemented with similarity
features inspired on other FNC-1 approaches, and passed to a final layer that
predicts the stance of the article towards the headline. We also explore the
use of external sources of information, specifically large datasets of sentence
pairs originally proposed for training and evaluating natural language
inference methods, in order to pre-train specific components of the neural
network architecture (e.g., the RNNs used for encoding sentences). The obtained
results attest to the effectiveness of the proposed ideas and show that our
model, particularly when considering pre-training and the combination of neural
representations together with similarity features, slightly outperforms the
previous state-of-the-art.Comment: Accepted for publication in the special issue of the ACM Journal of
Data and Information Quality (ACM JDIQ) on Combating Digital Misinformation
and Disinformatio
Using Neural Generative Models to Release Synthetic Twitter Corpora with Reduced Stylometric Identifiability of Users
We present a method for generating synthetic versions of Twitter data using
neural generative models. The goal is protecting individuals in the source data
from stylometric re-identification attacks while still releasing data that
carries research value. Specifically, we generate tweet corpora that maintain
user-level word distributions by augmenting the neural language models with
user-specific components. We compare our approach to two standard text data
protection methods: redaction and iterative translation. We evaluate the three
methods on measures of risk and utility. We define risk following the
stylometric models of re-identification, and we define utility based on two
general word distribution measures and two common text analysis research tasks.
We find that neural models are able to significantly lower risk over previous
methods with little cost to utility. We also demonstrate that the neural models
allow data providers to actively control the risk-utility trade-off through
model tuning parameters. This work presents promising results for a new tool
addressing the problem of privacy for free text and sharing social media data
in a way that respects privacy and is ethically responsible
Neural Network Architecture for Credibility Assessment of Textual Claims
Text articles with false claims, especially news, have recently become
aggravating for the Internet users. These articles are in wide circulation and
readers face difficulty discerning fact from fiction. Previous work on
credibility assessment has focused on factual analysis and linguistic features.
The task's main challenge is the distinction between the features of true and
false articles. In this paper, we propose a novel approach called Credibility
Outcome (CREDO) which aims at scoring the credibility of an article in an open
domain setting.
CREDO consists of different modules for capturing various features
responsible for the credibility of an article. These features includes
credibility of the article's source and author, semantic similarity between the
article and related credible articles retrieved from a knowledge base, and
sentiments conveyed by the article. A neural network architecture learns the
contribution of each of these modules to the overall credibility of an article.
Experiments on Snopes dataset reveals that CREDO outperforms the
state-of-the-art approaches based on linguistic features.Comment: Best Paper Award at 19th International Conference on Computational
Linguistics and Intelligent Text Processing, March 2018, Hanoi, Vietna
Deep Neural Networks for Query Expansion using Word Embeddings
Query expansion is a method for alleviating the vocabulary mismatch problem
present in information retrieval tasks. Previous works have shown that terms
selected for query expansion by traditional methods such as pseudo-relevance
feedback are not always helpful to the retrieval process. In this paper, we
show that this is also true for more recently proposed embedding-based query
expansion methods. We then introduce an artificial neural network classifier to
predict the usefulness of query expansion terms. This classifier uses term word
embeddings as inputs. We perform experiments on four TREC newswire and web
collections show that using terms selected by the classifier for expansion
significantly improves retrieval performance when compared to competitive
baselines. The results are also shown to be more robust than the baselines.Comment: 8 pages, 1 figur
- …