7 research outputs found

    Unsupervised Aspect Discovery from Online Consumer Reviews

    Get PDF
    The success of on-line review websites has led to an overwhelming number of on-line consumer reviews. These reviews have become an important tool for consumers when making a decision to purchase a product. This growth has led to the need for applications that enable this information to be presented in a way that is meaningful. These applications often rely on domain specific semantic lexicons which are both expensive and time consuming to make. The following thesis proposes an unsupervised approach for product aspect discovery in on-line consumer reviews. We apply a two step hierarchical clustering process in which we first cluster based on the semantic similarity of the contexts of terms and then on the similarity of the hypernyms of the cluster members. The method also includes a process for assigning class labels to each of the clusters. Finally an experiment showing how the proposed methods can be used to measure aspect based sentiment is performed. The methods proposed in this thesis are evaluated on a set of 157,865 reviews from a major commercial website and found that the two-step clustering process increases cluster F-scores over a single round of clustering. Finally, the proposed methods are compared to a state of the art topic modelling approach by Titov and McDonald (2008)

    DiversitĂ© et recommandation : une investigation sur l’apport de la fouille d’opinions pour la distinction d’articles d’opinion dans une controverse mĂ©diatique

    Full text link
    Les plateformes de consultation d’articles de presse en format numĂ©rique comme Google ActualitĂ©s et Yahoo! ActualitĂ©s sont devenues de plus en plus populaires pour la recherche et la lecture de l’information journalistique en ligne. Dans le but d’aider les usagers Ă  s’orienter parmi la multitude de sources d’information, ces plateformes intĂšgrent Ă  leurs moteurs de recherche des mĂ©canismes de filtrage automatisĂ©s, connus comme systĂšmes de recommandation. Ceux-ci aident les usagers Ă  retrouver des ressources informationnelles qui correspondent davantage Ă  leurs intĂ©rĂȘts et goĂ»ts personnels, en prenant comme base des comportements antĂ©rieurs, par exemple, l’historique de documents consultĂ©s. Cependant, ces systĂšmes peuvent nuire Ă  la diversitĂ© d’idĂ©es et de perspectives politiques dans l’environnement informationnel qu’ils crĂ©ent : la gĂ©nĂ©ration de rĂ©sultats de recherche ou de recommandations excessivement spĂ©cialisĂ©es, surtout dans le contexte de la presse en ligne, pourrait cacher des idĂ©es qui sont importantes dans un dĂ©bat. Quand l’environnement informationnel est insuffisamment divers, il y a un manque d’opportunitĂ© pour produire l’enquĂȘte ouverte, le dialogique et le dĂ©saccord constructif, ce qui peut rĂ©sulter dans l’émergence d’opinions extrĂ©mistes et la dĂ©gradation gĂ©nĂ©rale du dĂ©bat. Les travaux du domaine de l’intelligence artificielle qui tentent de rĂ©pondre au problĂšme de la diversitĂ© dans les systĂšmes de recommandation d’articles de presse sont confrontĂ©s par plusieurs questions, dont la reprĂ©sentation de textes numĂ©riques dans le modĂšle vectoriel Ă  partir d’un ensemble de mots statistiquement discriminants dans ces textes, ainsi que le dĂ©veloppement d’une mesure statistique capable de maximiser la diffĂ©rence entre des articles similaires qui sont retournĂ©s lors d’un processus de recommandation Ă  un usager. Un courant de recherche propose des systĂšmes de recommandation basĂ©s sur des techniques de fouille d’opinions afin de dĂ©tecter de maniĂšre automatique la diffĂ©rence d’opinions entre des articles de presse qui traitent d’un mĂȘme thĂšme lors du processus de recommandation. Dans cette approche, la reprĂ©sentation des textes numĂ©riques se fait par un ensemble de mots qui peuvent ĂȘtre associĂ©s, dans les textes, Ă  l’expression d’opinions, comme les adjectifs et les Ă©motions. NĂ©anmoins, ces techniques s’avĂšrent moins efficaces pour dĂ©tecter les diffĂ©rences entre les opinions relatives Ă  un dĂ©bat public argumentĂ©, puisque l’expression de l’opinion dans les discussions politiques n’est pas nĂ©cessairement liĂ©e Ă  l’expression de la subjectivitĂ© ou des Ă©motions du journaliste. Notre recherche doctorale s’inscrit dans l’objectif de (1) systĂ©matiser et de valider une mĂ©thodologie de fouille d’opinions permettant d’assister l’identification d’opinions divergentes dans le cadre d’une controverse et (2) d’explorer l’applicabilitĂ© de cette mĂ©thodologie pour un systĂšme de recommandation d’articles de presse. Nous assimilons la controverse Ă  un type de dĂ©bat d’opinions dans la presse, dont la particularitĂ© est la formation de camps explicitement opposĂ©s quant Ă  la façon de voir et de comprendre une question d’importance pour la collectivitĂ©. Notre recherche apporte des questionnements sur la dĂ©finition d’opinion dans ce contexte prĂ©cis et discute la pertinence d’exploiter les thĂ©ories discursives et Ă©nonciatives dans les recherches de fouille d’opinions. Le corpus expĂ©rimental est composĂ© par 495 articles d’opinion publiĂ©s dans la presse au sujet de la mobilisation Ă©tudiante du QuĂ©bec en 2012 contre la hausse de droits de scolaritĂ© annoncĂ©e par le gouvernement de Jean Charest. Ils ont Ă©tĂ© classĂ©s dans deux catĂ©gories, ETUD et GOUV, en fonction du type d’opinion qu’ils vĂ©hiculent. Soit ils sont favorables aux Ă©tudiants et Ă  la continuitĂ© de la grĂšve soit favorables au gouvernement et critiques envers le mouvement de grĂšve. Sur le plan mĂ©thodologique, notre recherche se base sur la dĂ©marche proposĂ©e par les travaux qui explorent des techniques du champ de la linguistique du corpus dans la fouille d’opinions, ainsi que les concepts de la sĂ©mantique interprĂ©tative de François Rastier. Elle systĂ©matise les Ă©tapes de cette dĂ©marche, en prĂ©conisant la description des textes du corpus, pour relever et interprĂ©ter les mots spĂ©cifiques qui contrastent les types d’opinions qui devront ĂȘtre classĂ©s. Ce travail permet de sĂ©lectionner des critĂšres textuels interprĂ©tables et descriptifs des phĂ©nomĂšnes Ă©nonciatifs Ă©tudiĂ©s dans le corpus qui serviront Ă  reprĂ©senter les textes numĂ©riques dans le format vectoriel. La dĂ©marche proposĂ©e par ces travaux a Ă©tĂ© validĂ©e avec l’utilisation du corpus de presse constituĂ© pour l’expĂ©rimentation. Les rĂ©sultats dĂ©montrent que la sĂ©lection de 447 critĂšres textuels par une approche interprĂ©tative du corpus est plus performante pour la classification automatique des articles que le choix d’un ensemble de mots dont la sĂ©lection ne prend pas en compte de facteurs linguistiques liĂ©s au corpus. Notre recherche a Ă©galement Ă©valuĂ© la possibilitĂ© d’une application dans les systĂšmes de recommandation d’articles de presse, en faisant une Ă©tude sur l’évolution chronologique du vocabulaire du corpus de l’expĂ©rimentation. Nous dĂ©montrons que la sĂ©lection de critĂšres textuels effectuĂ©e au dĂ©but de la controverse est efficace pour prĂ©dire l’opinion des articles qui sont publiĂ©s par la suite, suggĂ©rant que la dĂ©marche de sĂ©lection de critĂšres interprĂ©tables peut ĂȘtre mise au profit d’un systĂšme de recommandation qui propose des articles d’opinion issus d’une controverse mĂ©diatique.Web-based reading services such as Google News and Yahoo! News have become increasingly popular with the growth of online news consumption. To help users cope with information overload on these search engines, recommender systems and personalization techniques are utilized. These services help users find content that matches their personal interests and tastes, using their browser history and past behavior as a basis for recommendations. However, recommender systems can limit diversity of thought and the range of political perspectives that circulate within the informational environment. In consequence, relevant ideas and questions may not be seen, debatable assumptions may be taken as facts, and overspecialized recommendations may reinforce confirmation bias, special interests, tribalism, and extremist opinions. When the informational environment is insufficiently diverse, there is a loss of open inquiry, dialogue and constructive disagreement—and, as a result, an overall degradation of public discourse. Studies within the artificial intelligence field that try to solve the diversity problem for news recommender systems are confronted by many questions, including the vector model representation of digital texts and the development of a statistical measure that maximizes the difference between similar articles that are proposed to the user by the recommendation process. Studies based on opinion mining techniques propose to tackle the diversity problem in a different manner, by automatically detecting the difference of perspectives between news articles that are related by content in the recommendation process. In this latter approach, the representation of digital texts in the vector model considers a set of words that are associated with opinion expressions, such as adjectives or emotions. However, those techniques are less effective in detecting differences of opinion in a publicly argued debate, because journalistic opinions are not necessarily linked with the journalist’s subjectivity or emotions. The aims of our research are (1) to systematize and validate an opinion mining method that can classify divergent opinions within a controversial debate in the press and (2) to explore the applicability of this method in a news recommender system. We equate controversy to an opinion debate in the press where at least two camps are explicitly opposed in their understanding of a consequential question in their community. Our research raises questions about how to define opinion in this context and discusses the relevance of using discursive and enunciation theoretical approaches in opinion mining. The corpus of our experiment has 495 opinion articles about the 2012 student protest in Quebec against the raise of tuition fees announced by the Liberal Premier Minister Jean Charest. Articles were classified into two categories, ETUD and GOUV, representing the two types of opinions that dominated the debate: namely, those that favored the students and the continuation of the strike or those that favored the government and criticized the student movement. Methodologically, our research is based on the approach of previous studies that explore techniques from the corpus linguistics field in the context of opinion mining, as well as theoretical concepts of François Rastier’s Interpretative Semantics. Our research systematizes the steps of this approach, advocating for a contrastive and interpretative description of the corpus, with the aim of discovering linguistic features that better describe the types of opinion that are to be classified. This approach allows us to select textual features that are interpretable and compatible with the enunciative phenomena in the corpus that are then used to represent the digital texts in the vector model. The approach of previous works has been validated by our analysis of the corpus. The results show that the selection of 447 textual features by an interpretative approach of the corpus performs better for the automatic classification of the opinion articles than a selection process in which the set of words are not identified by linguistic factors. Our research also evaluated the possibility of applying this approach to the development of a news recommender system, by studying the chronological evolution of the vocabulary in the corpus. We show that the selection of features at the beginning of the controversy effectively predicts the opinion of the articles that are published later, suggesting that the selection of interpretable features can benefit the development of a news recommender system in a controversial debate
    corecore