Dish Discovery via Word Embeddings on Restaurant Reviews

Abstract

ABSTRACT This paper proposes a novel framework for automatic dish discovery via word embeddings on restaurant reviews. We collect a dataset of user reviews from Yelp and parse the reviews to extract dish words. Then, we utilize the processed reviews as training texts to learn the embedding vectors of words via the skip-gram model. In the paper, a nearestneighbor like score function is proposed to rank the dishes based on their learned representations. We brief some analyses on the preliminary experiments and present a web-based visualization at http://clip.csie.org/yelp/. Keywords dish discovery, word embeddings, dish-word extraction BACKGROUND With the growth of social media, corporations, such as Yelp, have accumulated a great number of user generated content (UGC). In the literature, some studies have been conducted with a perspective of finding critical information hidden in the content METHODOLOGY Copyright held by the author(s). RecSys 2016 Poster Proceedings, September 15-19, 2016, USA, Boston. Our methodology mainly consists of three parts: 1) dishword recognition, 2) word embedding learning, and 3) dish score calculation. As alluded to earlier, UGC usually incorporates a degree of noise and different language usages; therefore, extracting dish names from user reviews is a complicated task. For example, observed from the dataset, users tend not to write the full name of a dish in their reviews; instead, the last word or the last two words are often written in the reviews. To grapple with this issue, we use regular expressions (regexps) to extract dish names from the user reviews. However, this also give rise to an issue that a certain dish in a restaurant may be of the same name in other restaurants, which may induce the problem of ambiguity and lower the accuracy of matching the correct dish name. So, we attach a dish name with its restaurant name to solve the ambiguity problem. We then utilize the collection of processed reviews as training texts to learn embeddings of each word in the reviews via a continuous space language model, the skip-gram model. After the training phase, each word (including every dish) is represented by an n-dimensional vector (called the embedding of this word). Inspired by the k-nearest neighbors algorithm, we define the score for every dish d as: where , m is the total number of positive sentiment words considered, λi (i = 1, · · · , m) is a weighting parameter. In addition, si denotes the i-nearest positive sentiment words of the given dish d, and w d , ws i ∈ R n are the vector representations of the dish d and the sentiment word si, respectively. In an extreme case (1) of λm = 1 and λi = 0 for i = 1, · · · , m − 1, this score function implements the concept of the average Euclidean distance between a dish and all the positive sentiment words; while in the case (2) λ1 = 1 and λi = 0 for i = 2, · · · , m, the scored is obtained with the closest positive sentiment words to the dish. EXPERIMENTS Our preliminary experiments involve a real-world restaurant review dataset collected from Yelp Data Challenge

    Similar works