Search CORE

29,500 research outputs found

Web News Documents Clustering in Indonesian Language Using Singular Value Decomposition-principal Component Analysis (Svdpca) and Ant Algorithms

Author: Arifin A. Z. (Agus)
Fadllullah A. (Arif)
Kamudi D. D. (Dasrit)
Nasir M. (Muhamad)
Purwitasari D. (Diana)
Publication venue: Indonesian Society for Soft Computing
Publication date: 01/01/2016
Field of study

Ant-based document clustering is a cluster method of measuring text documents similarity based on the shortest path between nodes (trial phase) and determines the optimal clusters of sequence document similarity (dividing phase). The processing time of trial phase Ant algorithms to make document vectors is very long because of high dimensional Document-Term Matrix (DTM). In this paper, we proposed a document clustering method for optimizing dimension reduction using Singular Value Decomposition-Principal Component Analysis (SVDPCA) and Ant algorithms. SVDPCA reduces size of the DTM dimensions by converting freq-term of conventional DTM to score-pc of Document-PC Matrix (DPCM). Ant algorithms creates documents clustering using the vector space model based on the dimension reduction result of DPCM. The experimental results on 506 news documents in Indonesian language demonstrated that the proposed method worked well to optimize dimension reduction up to 99.7%. We could speed up execution time efficiently of the trial phase and maintain the best F-measure achieved from experiments was 0.88 (88%)

Bridging Semantic Gaps between Natural Languages and APIs with Word Embedding

Author: Chen Xin
Jiang He
Kamei Yasutaka
Li Xiaochen
Publication venue
Publication date: 23/10/2018
Field of study

Developers increasingly rely on text matching tools to analyze the relation between natural language words and APIs. However, semantic gaps, namely textual mismatches between words and APIs, negatively affect these tools. Previous studies have transformed words or APIs into low-dimensional vectors for matching; however, inaccurate results were obtained due to the failure of modeling words and APIs simultaneously. To resolve this problem, two main challenges are to be addressed: the acquisition of massive words and APIs for mining and the alignment of words and APIs for modeling. Therefore, this study proposes Word2API to effectively estimate relatedness of words and APIs. Word2API collects millions of commonly used words and APIs from code repositories to address the acquisition challenge. Then, a shuffling strategy is used to transform related words and APIs into tuples to address the alignment challenge. Using these tuples, Word2API models words and APIs simultaneously. Word2API outperforms baselines by 10%-49.6% of relatedness estimation in terms of precision and NDCG. Word2API is also effective on solving typical software tasks, e.g., query expansion and API documents linking. A simple system with Word2API-expanded queries recommends up to 21.4% more related APIs for developers. Meanwhile, Word2API improves comparison algorithms by 7.9%-17.4% in linking questions in Question&Answer communities to API documents.Comment: accepted by IEEE Transactions on Software Engineerin

arXiv.org e-Print Archive

User Profile Based Research Paper Recommendation

Author: Dasgupta Sourish
Sahijwani Harshita
Publication venue
Publication date: 25/04/2017
Field of study

We design a recommender system for research papers based on topic-modeling. The users feedback to the results is used to make the results more relevant the next time they fire a query. The user's needs are understood by observing the change in the themes that the user shows a preference for over time.Comment: Work in progress. arXiv admin note: text overlap with arXiv:1611.0482

arXiv.org e-Print Archive

Integrating Lexical and Temporal Signals in Neural Ranking Models for Searching Social Media Streams

Author: He Hua
Lin Jimmy
Mohammed Salman
Rao Jinfeng
Sequiera Royal
Ture Ferhan
Zhang Haotian
Publication venue
Publication date: 24/07/2017
Field of study

Time is an important relevance signal when searching streams of social media posts. The distribution of document timestamps from the results of an initial query can be leveraged to infer the distribution of relevant documents, which can then be used to rerank the initial results. Previous experiments have shown that kernel density estimation is a simple yet effective implementation of this idea. This paper explores an alternative approach to mining temporal signals with recurrent neural networks. Our intuition is that neural networks provide a more expressive framework to capture the temporal coherence of neighboring documents in time. To our knowledge, we are the first to integrate lexical and temporal signals in an end-to-end neural network architecture, in which existing neural ranking models are used to generate query-document similarity vectors that feed into a bidirectional LSTM layer for temporal modeling. Our results are mixed: existing neural models for document ranking alone yield limited improvements over simple baselines, but the integration of lexical and temporal signals yield significant improvements over competitive temporal baselines.Comment: SIGIR 2017 Workshop on Neural Information Retrieval (Neu-IR'17), August 7-11, 2017, Shinjuku, Tokyo, Japa

arXiv.org e-Print Archive

Understanding the Logical and Semantic Structure of Large Documents

Author: Finin Tim
Rahman Muhammad Mahbubur
Publication venue
Publication date: 03/09/2017
Field of study

Current language understanding approaches focus on small documents, such as newswire articles, blog posts, product reviews and discussion forum entries. Understanding and extracting information from large documents like legal briefs, proposals, technical manuals and research articles is still a challenging task. We describe a framework that can analyze a large document and help people to know where a particular information is in that document. We aim to automatically identify and classify semantic sections of documents and assign consistent and human-understandable labels to similar sections across documents. A key contribution of our research is modeling the logical and semantic structure of an electronic document. We apply machine learning techniques, including deep learning, in our prototype system. We also make available a dataset of information about a collection of scholarly articles from the arXiv eprints collection that includes a wide range of metadata for each article, including a table of contents, section labels, section summarizations and more. We hope that this dataset will be a useful resource for the machine learning and NLP communities in information retrieval, content-based question answering and language modeling.Comment: 10 pages, 15 figures and 6 table

arXiv.org e-Print Archive

Unsupervised Identification of Study Descriptors in Toxicology Research: An Experimental Study

Author: Herrmannova Drahomira
Kleinstreuer Nicole C.
Patton Robert M.
Stahl Christopher G.
Wolfe Mary S.
Young Steven R.
Publication venue
Publication date: 03/11/2018
Field of study

Identifying and extracting data elements such as study descriptors in publication full texts is a critical yet manual and labor-intensive step required in a number of tasks. In this paper we address the question of identifying data elements in an unsupervised manner. Specifically, provided a set of criteria describing specific study parameters, such as species, route of administration, and dosing regimen, we develop an unsupervised approach to identify text segments (sentences) relevant to the criteria. A binary classifier trained to identify publications that met the criteria performs better when trained on the candidate sentences than when trained on sentences randomly picked from the text, supporting the intuition that our method is able to accurately identify study descriptors.Comment: Ninth International Workshop on Health Text Mining and Information Analysis at EMNLP 201

arXiv.org e-Print Archive

Combining Similarity Features and Deep Representation Learning for Stance Detection in the Context of Checking Fake News

Author: Borges Luís
Calado Pável
Martins Bruno
Publication venue
Publication date: 01/11/2018
Field of study

Fake news are nowadays an issue of pressing concern, given their recent rise as a potential threat to high-quality journalism and well-informed public discourse. The Fake News Challenge (FNC-1) was organized in 2017 to encourage the development of machine learning-based classification systems for stance detection (i.e., for identifying whether a particular news article agrees, disagrees, discusses, or is unrelated to a particular news headline), thus helping in the detection and analysis of possible instances of fake news. This article presents a new approach to tackle this stance detection problem, based on the combination of string similarity features with a deep neural architecture that leverages ideas previously advanced in the context of learning efficient text representations, document classification, and natural language inference. Specifically, we use bi-directional Recurrent Neural Networks, together with max-pooling over the temporal/sequential dimension and neural attention, for representing (i) the headline, (ii) the first two sentences of the news article, and (iii) the entire news article. These representations are then combined/compared, complemented with similarity features inspired on other FNC-1 approaches, and passed to a final layer that predicts the stance of the article towards the headline. We also explore the use of external sources of information, specifically large datasets of sentence pairs originally proposed for training and evaluating natural language inference methods, in order to pre-train specific components of the neural network architecture (e.g., the RNNs used for encoding sentences). The obtained results attest to the effectiveness of the proposed ideas and show that our model, particularly when considering pre-training and the combination of neural representations together with similarity features, slightly outperforms the previous state-of-the-art.Comment: Accepted for publication in the special issue of the ACM Journal of Data and Information Quality (ACM JDIQ) on Combating Digital Misinformation and Disinformatio

arXiv.org e-Print Archive

Using Neural Generative Models to Release Synthetic Twitter Corpora with Reduced Stylometric Identifiability of Users

Author: Linder Fridolin
Ororbia II Alexander G.
Snoke Joshua
Publication venue
Publication date: 30/05/2018
Field of study

We present a method for generating synthetic versions of Twitter data using neural generative models. The goal is protecting individuals in the source data from stylometric re-identification attacks while still releasing data that carries research value. Specifically, we generate tweet corpora that maintain user-level word distributions by augmenting the neural language models with user-specific components. We compare our approach to two standard text data protection methods: redaction and iterative translation. We evaluate the three methods on measures of risk and utility. We define risk following the stylometric models of re-identification, and we define utility based on two general word distribution measures and two common text analysis research tasks. We find that neural models are able to significantly lower risk over previous methods with little cost to utility. We also demonstrate that the neural models allow data providers to actively control the risk-utility trade-off through model tuning parameters. This work presents promising results for a new tool addressing the problem of privacy for free text and sharing social media data in a way that respects privacy and is ethically responsible

arXiv.org e-Print Archive

Neural Network Architecture for Credibility Assessment of Textual Claims

Author: Bindlish Ishita
Choudhary Nurendra
Shrivastava Manish
Singh Rajat
Publication venue
Publication date: 30/03/2018
Field of study

Text articles with false claims, especially news, have recently become aggravating for the Internet users. These articles are in wide circulation and readers face difficulty discerning fact from fiction. Previous work on credibility assessment has focused on factual analysis and linguistic features. The task's main challenge is the distinction between the features of true and false articles. In this paper, we propose a novel approach called Credibility Outcome (CREDO) which aims at scoring the credibility of an article in an open domain setting. CREDO consists of different modules for capturing various features responsible for the credibility of an article. These features includes credibility of the article's source and author, semantic similarity between the article and related credible articles retrieved from a knowledge base, and sentiments conveyed by the article. A neural network architecture learns the contribution of each of these modules to the overall credibility of an article. Experiments on Snopes dataset reveals that CREDO outperforms the state-of-the-art approaches based on linguistic features.Comment: Best Paper Award at 19th International Conference on Computational Linguistics and Intelligent Text Processing, March 2018, Hanoi, Vietna

arXiv.org e-Print Archive

Deep Neural Networks for Query Expansion using Word Embeddings

Author: Imani Ayyoob
Montazer Ali
Shakery Azadeh
Vakili Amir
Publication venue
Publication date: 08/11/2018
Field of study

Query expansion is a method for alleviating the vocabulary mismatch problem present in information retrieval tasks. Previous works have shown that terms selected for query expansion by traditional methods such as pseudo-relevance feedback are not always helpful to the retrieval process. In this paper, we show that this is also true for more recently proposed embedding-based query expansion methods. We then introduce an artificial neural network classifier to predict the usefulness of query expansion terms. This classifier uses term word embeddings as inputs. We perform experiments on four TREC newswire and web collections show that using terms selected by the classifier for expansion significantly improves retrieval performance when compared to competitive baselines. The results are also shown to be more robust than the baselines.Comment: 8 pages, 1 figur

arXiv.org e-Print Archive