Search CORE

325 research outputs found

On the Role of Text Preprocessing in Neural Network Architectures: An Evaluation Study on Text Categorization and Sentiment Analysis

Author: Camacho-Collados Jose
Pilehvar Mohammad Taher
Publication venue
Publication date: 01/01/2018
Field of study

Text preprocessing is often the first step in the pipeline of a Natural Language Processing (NLP) system, with potential impact in its final performance. Despite its importance, text preprocessing has not received much attention in the deep learning literature. In this paper we investigate the impact of simple text preprocessing decisions (particularly tokenizing, lemmatizing, lowercasing and multiword grouping) on the performance of a standard neural text classifier. We perform an extensive evaluation on standard benchmarks from text categorization and sentiment analysis. While our experiments show that a simple tokenization of input text is generally adequate, they also highlight significant degrees of variability across preprocessing techniques. This reveals the importance of paying attention to this usually-overlooked step in the pipeline, particularly when comparing different models. Finally, our evaluation provides insights into the best preprocessing practices for training word embeddings.Comment: Blackbox EMNLP 2018. 7 page

arXiv.org e-Print Archive

Crossref

From Word to Sense Embeddings: A Survey on Vector Representations of Meaning

Author: Camacho-Collados Jose
Pilehvar Mohammad Taher
Publication venue
Publication date: 26/10/2018
Field of study

Over the past years, distributed semantic representations have proved to be effective and flexible keepers of prior knowledge to be integrated into downstream applications. This survey focuses on the representation of meaning. We start from the theoretical background behind word vector space models and highlight one of their major limitations: the meaning conflation deficiency, which arises from representing a word with all its possible meanings as a single vector. Then, we explain how this deficiency can be addressed through a transition from the word level to the more fine-grained level of word senses (in its broader acceptation) as a method for modelling unambiguous lexical meaning. We present a comprehensive overview of the wide range of techniques in the two main branches of sense representation, i.e., unsupervised and knowledge-based. Finally, this survey covers the main evaluation procedures and applications for this type of representation, and provides an analysis of four of its important aspects: interpretability, sense granularity, adaptability to different domains and compositionality.Comment: 46 pages, 8 figures. Published in Journal of Artificial Intelligence Researc

arXiv.org e-Print Archive

Online Research @ Cardiff

A Unified multilingual semantic representation of concepts

Author: CAMACHO COLLADOS Jose'
Navigli Roberto
Pilehvar MOHAMMED TAHER
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2015
Field of study

Semantic representation lies at the core of several applications in Natural Language Processing. However, most existing semantic representation techniques cannot be used effectively for the representation of individual word senses. We put forward a novel multilingual concept representation, called MUFFIN , which not only enables accurate representation of word senses in different languages, but also provides multiple advantages over existing approaches. MUFFIN represents a given concept in a unified semantic space irrespective of the language of interest, enabling cross-lingual comparison of different concepts. We evaluate our approach in two different evaluation benchmarks, semantic similarity and Word Sense Disambiguation, reporting state-of-the-art performance on several standard datasets

CiteSeerX

Crossref

Online Research @ Cardiff

Archivio della ricerca- Università di Roma La Sapienza

NASARI: a novel approach to a Semantically-Aware Representation of items

Author: CAMACHO COLLADOS Jose'
Navigli Roberto
Pilehvar MOHAMMED TAHER
Publication venue
Publication date: 01/01/2015
Field of study

The semantic representation of individual word senses and concepts is of fundamental importance to several applications in Natural Language Processing. To date, concept modeling techniques have in the main based their representation either on lexicographic resources, such as WordNet, or on encyclopedic resources, such as Wikipedia. We propose a vector representation technique that combines the complementary knowledge of both these types of resource. Thanks to its use of explicit semantics combined with a novel cluster-based dimensionality reduction and an effective weighting scheme, our representation attains state-of-the-art performance on multiple datasets in two standard benchmarks: word similarity and sense clustering. We are releasing our vector representations at http://lcl.uniroma1.it/nasari/

CiteSeerX

Crossref

Online Research @ Cardiff

Archivio della ricerca- Università di Roma La Sapienza

Embeddings for word sense disambiguation: an evaluation study

Author: Iacobacci IGNACIO JAVIER
Navigli Roberto
Pilehvar MOHAMMED TAHER
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2016
Field of study

Recent years have seen a dramatic growth in the popularity of word embeddings mainly owing to their ability to capture semantic information from massive amounts of textual content. As a result, many tasks in Natural Language Processing have tried to take advantage of the potential of these distributional models. In this work, we study how word embeddings can be used in Word Sense Disambiguation, one of the oldest tasks in Natural Language Processing and Artificial Intelligence. We propose different methods through which word embeddings can be leveraged in a state-of-the-art supervised WSD system architecture, and perform a deep analysis of how different parameters affect performance. We show how a WSD system that makes use of word embeddings alone, if designed properly, can provide significant performance improvement over a state-of-the-art WSD system that incorporates several standard WSD features

Archivio della ricerca- Università di Roma La Sapienza

SensEmbed: Learning sense embeddings for word and relational similarity

Author: IACOBACCI IGNACIO JAVIER
NAVIGLI ROBERTO
PILEHVAR MOHAMMED TAHER
Publication venue
Publication date: 01/01/2015
Field of study

Word embeddings have recently gained considerable popularity for modeling words in different Natural Language Processing (NLP) tasks including semantic similarity measurement. However, notwithstanding their success, word embeddings are by their very nature unable to capture polysemy, as different meanings of a word are conflated into a single representation. In addition, their learning process usually relies on massive corpora only, preventing them from taking advantage of structured knowledge. We address both issues by proposing a multifaceted approach that transforms word embeddings to the sense level and leverages knowledge from a large semantic network for effective semantic similarity measurement. We evaluate our approach on word similarity and relational similarity frameworks, reporting state-of-the-art performance on multiple datasets

CiteSeerX

Archivio della ricerca- Università di Roma La Sapienza

Essays on Information Flows and Auction Outcomes in Business-to-Business Market: Theoretical and Empirical Evidence

Author: Pilehvar Ali
Publication venue
Publication date: 01/01/2013
Field of study

In this dissertation, I have three separate essays in the context of Business-to Business (B2B) auctions; in each I introduce a complex problem regarding the impact of information flows on auction's performance which has not been addressed by prior auction literature. The first two essays (Chapter 1 and 2) are empirical studies in the context of online secondary market B2B auctions while the third essay (Chapter 3) is a theoretical investigation and will contribute to the B2B procurement auction literature. The findings from this dissertation have managerial implications of how/when auctioneers can improve the efficiency or success of their operations. B2B auctions are new types of ventures which have begun to shape how industries of all types trade goods. Online B2B auctions have also become particularly popular for industrial procurement and liquidation purposes. By using online B2B auctions companies can benefit by creating competition when auctioning off goods or contracts to business customers. B2B Procurement auctions− where the buyer runs an auction to procure goods and services from suppliers− have been documented as saving firms millions of dollars by lowering the cost of procurement. On the other hand, B2B auctions are also commonly used by sellers in `secondary market' to liquidate the left-over goods to business buyers in a timely fashion. In order to maximize revenues in either both industrial procurement or secondary market settings, auctioneers should understand how the auction participants behave and react to the available market information or auction design. Auctioneers can then use this knowledge to improve the performance of their B2B auctions by choosing the right auction design or strategies. In the first essay, I investigate how an online B2B secondary market auction environment can provide several sources of information that can be used by bidders to form their bids. One such information set that has been relatively understudied in the literature pertains to reference prices available to the bidder from other concurrent and comparable auctions. I will examine how reference prices from such auctions affect bidding behavior on the focal auction conditioning on bidders' types. I will use longitudinal data of auctions and bids for more than 4000 B2B auctions collected from a large liquidator firm in North America. In the second essay, I report on the results of a field experiment that I carried out on a secondary market auction site of another one of the nation's largest B2B wholesale liquidators. The design of this field experiment on iPad marketplace is directly aimed at understanding how (i) the starting price of the auction, and (ii) the number of auctions for a specific (model, quality), i.e., the supply of that product, interact to impact the auction final price. I also explore how a seller should manage the product differentiation so that she auctions off the right mix and supply of products at the reasonable starting prices. Finally, in the last essay, I study a norm used in many procurement auctions in which buyers grant the `Right of First Refusal' (ROFR) to a favored supplier. Under ROFR, the favored supplier sees the bids of all other participating suppliers and has the opportunity to match the (current) winning bid. I verify the conventional wisdom that ROFR increases the buyer's procurement cost in a single auction setting. With a looming second auction in the future (with the same participating suppliers), I show that the buyer lowers his procurement cost by granting the ROFR to a supplier. The analytical findings of this essay highlights the critical role of information flows and the timing of information-release in procurement auctions with ROFR

Digital Repository at the University of Maryland

De-Conflated Semantic Representations

Author: Collier NH
Pilehvar Mohammad Taher
Publication venue: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
Publication date: 05/08/2016
Field of study

One major deficiency of most semantic representation techniques is that they usually model a word type as a single point in the semantic space, hence conflating all the meanings that the word can have. Addressing this issue by learning distinct representations for individual meanings of words has been the subject of several research studies in the past few years. However, the generated sense representations are either not linked to any sense inventory or are unreliable for infrequent word senses. We propose a technique that tackles these problems by de-conflating the representations of words based on the deep knowledge that can be derived from a semantic network. Our approach provides multiple advantages in comparison to the previous approaches, including its high coverage and the ability to generate accurate representations even for infrequent word senses. We carry out evaluations on six datasets across two semantic similarity tasks and report state-of-the-art results on most of them

arXiv.org e-Print Archive

Apollo (Cambridge)

Recommended from our members

Modeling the fake news challenge as a cross-level stance detection task

Author: Collier N
Conforti C
Pilehvar MT
Publication venue: CEUR Workshop Proceedings
Publication date: 01/01/2019
Field of study

The 2017 Fake News Challenge Stage 1, a shared task for stance detection of news articles and claims pairs, has received a lot of attention in recent years [3]. The provided dataset is highly unbalanced, with a skewed distribution towards unrelated samples - that is, randomly generated pairs of news and claims belonging to different topics. This imbalance favored systems which performed particularly well in classifying those noisy samples, something which does not require a deep semantic understanding. In this paper, we propose a simple architecture based on conditional encoding, carefully designed to model the internal structure of a news article and its relations with a claim. We demonstrate that our model, which only leverages information from word embeddings, can outperform a system based on a large number of hand-engineered features, which replicates one of the winning systems at the Fake News Challenge [6], in the stance detection of the related samples

Apollo (Cambridge)