Search CORE

1,421 research outputs found

Extending Topic Models With Syntax and Semantics Relationships

Author: Delpisheh Elnaz
Publication venue
Publication date: 16/12/2015
Field of study

Probabilistic topic modeling is a powerful tool to uncover hidden thematic structure of documents. These hidden structures are useful for extracting concepts of documents and other data mining tasks, such as information retrieval. Latent Dirichlet allocation (LDA), is a generative probabilistic topic model for collections of discrete data such as text corpora. LDA represents documents as a bag-of-words, where the important structure of documents is neglected. In this work, we proposed three extended LDA models that incorporates syntactic and semantic structures of text documents into probabilistic topic models. Our first proposed topic model enriches text documents with collapsed typed dependency relations to effectively acquire syntactic and semantic dependencies between consecutive and nonconsecutive words of text documents. This representation has several benefits. It captures relations between consecutive and nonconsecutive words of text documents. In addition, the labels of the collapsed typed dependency relations help to eliminate less important relations, i.e., relations involving prepositions. Moreover, in this thesis, we introduced a method to enforce topic similarity to conceptually similar words. As a result, this algorithm leads to more coherent topic distribution over words. Our second and third proposed generative topic models incorporate term importance into latent topic variables by boosting the probability of important terms and consequently decreasing the probability of less important terms to better reflect the themes of documents. In essence, we assign weights to terms by employing corpus-level and document-level approaches. We incorporate term importance using a nonuniform base measure for an asymmetric prior over topic term distributions in the LDA framework. This leads to better estimates for important terms that occur less frequently in documents. Experimental studies have been conducted to show the effectiveness of our work across a variety of text mining applications. Furthermore, we employ our topic models to build a personalized content-based news recommender system. Our proposed recommender system eases reading and navigation through online newspapers. In essence, the recommender system acts as filters, delivering only news articles that can be considered relevant to a user. This recommender system has been used by The Globe and Mail, a company that offers most authoritative news in Canada, featuring national and international news

YorkSpace

A Dependency-Based Neural Network for Relation Classification

Author: Ji Heng
Li Sujian
Liu Yang
Wang Houfeng
Wei Furu
Zhou Ming
Publication venue
Publication date: 16/07/2015
Field of study

Previous research on relation classification has verified the effectiveness of using dependency shortest paths or subtrees. In this paper, we further explore how to make full use of the combination of these dependency information. We first propose a new structure, termed augmented dependency path (ADP), which is composed of the shortest dependency path between two entities and the subtrees attached to the shortest path. To exploit the semantic representation behind the ADP structure, we develop dependency-based neural networks (DepNN): a recursive neural network designed to model the subtrees, and a convolutional neural network to capture the most important features on the shortest path. Experiments on the SemEval-2010 dataset show that our proposed method achieves state-of-art results.Comment: This preprint is the full version of a short paper accepted in the annual meeting of the Association for Computational Linguistics (ACL) 2015 (Beijing, China

arXiv.org e-Print Archive

Unsupervised Extraction of Representative Concepts from Scientific Literature

Author: Han Jiawei
Krishnan Adit
Sankar Aravind
Zhi Shi
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 08/11/2017
Field of study

This paper studies the automated categorization and extraction of scientific concepts from titles of scientific articles, in order to gain a deeper understanding of their key contributions and facilitate the construction of a generic academic knowledgebase. Towards this goal, we propose an unsupervised, domain-independent, and scalable two-phase algorithm to type and extract key concept mentions into aspects of interest (e.g., Techniques, Applications, etc.). In the first phase of our algorithm we propose PhraseType, a probabilistic generative model which exploits textual features and limited POS tags to broadly segment text snippets into aspect-typed phrases. We extend this model to simultaneously learn aspect-specific features and identify academic domains in multi-domain corpora, since the two tasks mutually enhance each other. In the second phase, we propose an approach based on adaptor grammars to extract fine grained concept mentions from the aspect-typed phrases without the need for any external resources or human effort, in a purely data-driven manner. We apply our technique to study literature from diverse scientific domains and show significant gains over state-of-the-art concept extraction techniques. We also present a qualitative analysis of the results obtained.Comment: Published as a conference paper at CIKM 201

arXiv.org e-Print Archive

Crossref

Stop Clickbait: Detecting and Preventing Clickbaits in Online News Media

Author: Chakraborty Abhijnan
Ganguly Niloy
Kakarla Sourya
Paranjape Bhargavi
Publication venue
Publication date: 31/10/2016
Field of study

Most of the online news media outlets rely heavily on the revenues generated from the clicks made by their readers, and due to the presence of numerous such outlets, they need to compete with each other for reader attention. To attract the readers to click on an article and subsequently visit the media site, the outlets often come up with catchy headlines accompanying the article links, which lure the readers to click on the link. Such headlines are known as Clickbaits. While these baits may trick the readers into clicking, in the long run, clickbaits usually don't live up to the expectation of the readers, and leave them disappointed. In this work, we attempt to automatically detect clickbaits and then build a browser extension which warns the readers of different media sites about the possibility of being baited by such headlines. The extension also offers each reader an option to block clickbaits she doesn't want to see. Then, using such reader choices, the extension automatically blocks similar clickbaits during her future visits. We run extensive offline and online experiments across multiple media sites and find that the proposed clickbait detection and the personalized blocking approaches perform very well achieving 93% accuracy in detecting and 89% accuracy in blocking clickbaits.Comment: 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM

arXiv.org e-Print Archive

Crossref

Creating a Semantic Graph from Wikipedia

Author: Tanner Ryan
Publication venue: Digital Commons @ Trinity
Publication date: 01/04/2012
Field of study

With the continued need to organize and automate the use of data, solutions are needed to transform unstructred text into structred information. By treating dependency grammar functions as programming language functions, this process produces \property maps which connect entities (people, places, events) with snippets of information. These maps are used to construct a semantic graph. By inputting Wikipedia, a large graph of information is produced representing a section of history. The resulting graph allows a user to quickly browse a topic and view the interconnections between entities across history

Trinity University

Matrix Factorization with Knowledge Graph Propagation for Unsupervised Spoken Language Understanding

Author: Anatole Gershman
Er I. Rudnicky
William Yang Wang
Yun-nung Chen
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2015
Field of study

Spoken dialogue systems (SDS) typically require a predefined semantic ontology to train a spoken language understanding (SLU) module. In addition to the anno-tation cost, a key challenge for design-ing such an ontology is to define a coher-ent slot set while considering their com-plex relations. This paper introduces a novel matrix factorization (MF) approach to learn latent feature vectors for utter-ances and semantic elements without the need of corpus annotations. Specifically, our model learns the semantic slots for a domain-specific SDS in an unsupervised fashion, and carries out semantic pars-ing using latent MF techniques. To fur-ther consider the global semantic struc-ture, such as inter-word and inter-slot re-lations, we augment the latent MF-based model with a knowledge graph propaga-tion model based on a slot-based seman-tic graph and a word-based lexical graph. Our experiments show that the proposed MF approaches produce better SLU mod-els that are able to predict semantic slots and word patterns taking into account their relations and domain-specificity in a joint manner.

CiteSeerX

Crossref