37 research outputs found

    IRIT at TREC Knowledge Base Acceleration 2013: Cumulative Citation Recommendation Task

    Get PDF
    International audienceThis paper describes the IRIT lab participation to the Cumulative Citation Recommendation task of the TREC 2013 Knowledge Base Acceleration Track. In this task, we are asked to implement a system which aims to detect “Vital” documents that a human would want to cite when updating the Wikipedia article for the target entity. Our approach is built on two steps. First, for each topic (entity), we retrieve a set of potential relevant documents containing at least one entity mention. These documents are then classified using a supervised learning algorithm to identify which ones are vital. We submitted three runs using different combinations of features. Obtained results are presented and discussed

    Novelty Detection by Latent Semantic Indexing

    Get PDF
    As a new topic in text mining, novelty detection is a natural extension of information retrieval systems, or search engines. Aiming at refining raw search results by filtering out old news and saving only the novel messages, it saves modern people from the nightmare of information overload. One of the difficulties in novelty detection is the inherent ambiguity of language, which is the carrier of information. Among the sources of ambiguity, synonymy proves to be a notable factor. To address this issue, previous studies mainly employed WordNet, a lexical database which can be perceived as a thesaurus. Rather than borrowing a dictionary, we proposed a statistical approach employing Latent Semantic Indexing (LSI) to learn semantic relationship automatically with the help of language resources. To apply LSI which involves matrix factorization, an immediate problem is that the dataset in novelty detection is dynamic and changing constantly. As an imitation of real-world scenario, texts are ranked in chronological order and examined one by one. Each text is only compared with those having appeared earlier, while later ones remain unknown. As a result, the data matrix starts as a one-row vector representing the first report, and has a new row added at the bottom every time we read a new document. Such a changing dataset makes it hard to employ matrix methods directly. Although LSI has long been acknowledged as an effective text mining method when considering semantic structure, it has never been used in novelty detection, nor have other statistical treatments. We tried to change this situation by introducing external text source to build the latent semantic space, onto which the incoming news vectors were projected. We used the Reuters-21578 dataset and the TREC data as sources of latent semantic information. Topics were divided into years and types in order to take the differences between them into account. Results showed that LSI, though very effective in traditional information retrieval tasks, had only a slight improvement to the performances for some data types. The extent of improvement depended on the similarity between news data and external information. A probing into the co-occurrence matrix attributed such a limited performance to the unique features of microblogs. Their short sentence lengths and restricted dictionary made it very hard to recover and exploit latent semantic information via traditional data structure

    Detecting Vital Documents in Massive Data Streams

    Get PDF
    Existing knowledge bases, includingWikipedia, are typically written and maintained by a group of voluntary editors. Meanwhile, numerous web documents are being published partly due to the popularization of online news and social media. Some of the web documents, called "vital documents", contain novel information that should be taken into account in updating articles of the knowledge bases. However, it is practically impossible for the editors to manually monitor all the relevant web documents. Consequently, there is a considerable time lag between an edit to knowledge base and the publication dates of such vital documents. This paper proposes a realtime detection framework of web documents containing novel information flowing in massive document streams. The framework consists of twostep filter using statistical language models. Further, the framework is implemented on the distributed and faulttolerant realtime computation system, Apache Storm, in order to process the large number of web documents. On a publicly available web document data set, the TREC KBA Stream Corpus, the validity of the proposed framework is demonstrated in terms of the detection performance and processing time

    Event summarization on social media stream: retrospective and prospective tweet summarization

    Get PDF
    Le contenu gĂ©nĂ©rĂ© dans les mĂ©dias sociaux comme Twitter permet aux utilisateurs d'avoir un aperçu rĂ©trospectif d'Ă©vĂšnement et de suivre les nouveaux dĂ©veloppements dĂšs qu'ils se produisent. Cependant, bien que Twitter soit une source d'information importante, il est caractĂ©risĂ© par le volume et la vĂ©locitĂ© des informations publiĂ©es qui rendent difficile le suivi de l'Ă©volution des Ă©vĂšnements. Pour permettre de mieux tirer profit de ce nouveau vecteur d'information, deux tĂąches complĂ©mentaires de recherche d'information dans les mĂ©dias sociaux ont Ă©tĂ© introduites : la gĂ©nĂ©ration de rĂ©sumĂ© rĂ©trospectif qui vise Ă  sĂ©lectionner les tweets pertinents et non redondant rĂ©capitulant "ce qui s'est passĂ©" et l'envoi des notifications prospectives dĂšs qu'une nouvelle information pertinente est dĂ©tectĂ©e. Notre travail s'inscrit dans ce cadre. L'objectif de cette thĂšse est de faciliter le suivi d'Ă©vĂ©nement, en fournissant des outils de gĂ©nĂ©ration de synthĂšse adaptĂ©s Ă  ce vecteur d'information. Les dĂ©fis majeurs sous-jacents Ă  notre problĂ©matique dĂ©coulent d'une part du volume, de la vĂ©locitĂ© et de la variĂ©tĂ© des contenus publiĂ©s et, d'autre part, de la qualitĂ© des tweets qui peut varier d'une maniĂšre considĂ©rable. La tĂąche principale dans la notification prospective est l'identification en temps rĂ©el des tweets pertinents et non redondants. Le systĂšme peut choisir de retourner les nouveaux tweets dĂšs leurs dĂ©tections oĂč bien de diffĂ©rer leur envoi afin de s'assurer de leur qualitĂ©. Dans ce contexte, nos contributions se situent Ă  ces diffĂ©rents niveaux : PremiĂšrement, nous introduisons Word Similarity Extended Boolean Model (WSEBM), un modĂšle d'estimation de la pertinence qui exploite la similaritĂ© entre les termes basĂ©e sur le word embedding et qui n'utilise pas les statistiques de flux. L'intuition sous- jacente Ă  notre proposition est que la mesure de similaritĂ© Ă  base de word embedding est capable de considĂ©rer des mots diffĂ©rents ayant la mĂȘme sĂ©mantique ce qui permet de compenser le non-appariement des termes lors du calcul de la pertinence. DeuxiĂšmement, l'estimation de nouveautĂ© d'un tweet entrant est basĂ©e sur la comparaison de ses termes avec les termes des tweets dĂ©jĂ  envoyĂ©s au lieu d'utiliser la comparaison tweet Ă  tweet. Cette mĂ©thode offre un meilleur passage Ă  l'Ă©chelle et permet de rĂ©duire le temps d'exĂ©cution. TroisiĂšmement, pour contourner le problĂšme du seuillage de pertinence, nous utilisons un classificateur binaire qui prĂ©dit la pertinence. L'approche proposĂ©e est basĂ©e sur l'apprentissage supervisĂ© adaptatif dans laquelle les signes sociaux sont combinĂ©s avec les autres facteurs de pertinence dĂ©pendants de la requĂȘte. De plus, le retour des jugements de pertinence est exploitĂ© pour re-entrainer le modĂšle de classification. Enfin, nous montrons que l'approche proposĂ©e, qui envoie les notifications en temps rĂ©el, permet d'obtenir des performances prometteuses en termes de qualitĂ© (pertinence et nouveautĂ©) avec une faible latence alors que les approches de l'Ă©tat de l'art tendent Ă  favoriser la qualitĂ© au dĂ©triment de la latence. Cette thĂšse explore Ă©galement une nouvelle approche de gĂ©nĂ©ration du rĂ©sumĂ© rĂ©trospectif qui suit un paradigme diffĂ©rent de la majoritĂ© des mĂ©thodes de l'Ă©tat de l'art. Nous proposons de modĂ©liser le processus de gĂ©nĂ©ration de synthĂšse sous forme d'un problĂšme d'optimisation linĂ©aire qui prend en compte la diversitĂ© temporelle des tweets. Les tweets sont filtrĂ©s et regroupĂ©s d'une maniĂšre incrĂ©mentale en deux partitions basĂ©es respectivement sur la similaritĂ© du contenu et le temps de publication. Nous formulons la gĂ©nĂ©ration du rĂ©sumĂ© comme Ă©tant un problĂšme linĂ©aire entier dans lequel les variables inconnues sont binaires, la fonction objective est Ă  maximiser et les contraintes assurent qu'au maximum un tweet par cluster est sĂ©lectionnĂ© dans la limite de la longueur du rĂ©sumĂ© fixĂ©e prĂ©alablement.User-generated content on social media, such as Twitter, provides in many cases, the latest news before traditional media, which allows having a retrospective summary of events and being updated in a timely fashion whenever a new development occurs. However, social media, while being a valuable source of information, can be also overwhelming given the volume and the velocity of published information. To shield users from being overwhelmed by irrelevant and redundant posts, retrospective summarization and prospective notification (real-time summarization) were introduced as two complementary tasks of information seeking on document streams. The former aims to select a list of relevant and non-redundant tweets that capture "what happened". In the latter, systems monitor the live posts stream and push relevant and novel notifications as soon as possible. Our work falls within these frameworks and focuses on developing a tweet summarization approaches for the two aforementioned scenarios. It aims at providing summaries that capture the key aspects of the event of interest to help users to efficiently acquire information and follow the development of long ongoing events from social media. Nevertheless, tweet summarization task faces many challenges that stem from, on one hand, the high volume, the velocity and the variety of the published information and, on the other hand, the quality of tweets, which can vary significantly. In the prospective notification, the core task is the relevancy and the novelty detection in real-time. For timeliness, a system may choose to push new updates in real-time or may choose to trade timeliness for higher notification quality. Our contributions address these levels: First, we introduce Word Similarity Extended Boolean Model (WSEBM), a relevance model that does not rely on stream statistics and takes advantage of word embedding model. We used word similarity instead of the traditional weighting techniques. By doing this, we overcome the shortness and word mismatch issues in tweets. The intuition behind our proposition is that context-aware similarity measure in word2vec is able to consider different words with the same semantic meaning and hence allows offsetting the word mismatch issue when calculating the similarity between a tweet and a topic. Second, we propose to compute the novelty score of the incoming tweet regarding all words of tweets already pushed to the user instead of using the pairwise comparison. The proposed novelty detection method scales better and reduces the execution time, which fits real-time tweet filtering. Third, we propose an adaptive Learning to Filter approach that leverages social signals as well as query-dependent features. To overcome the issue of relevance threshold setting, we use a binary classifier that predicts the relevance of the incoming tweet. In addition, we show the gain that can be achieved by taking advantage of ongoing relevance feedback. Finally, we adopt a real-time push strategy and we show that the proposed approach achieves a promising performance in terms of quality (relevance and novelty) with low cost of latency whereas the state-of-the-art approaches tend to trade latency for higher quality. This thesis also explores a novel approach to generate a retrospective summary that follows a different paradigm than the majority of state-of-the-art methods. We consider the summary generation as an optimization problem that takes into account the topical and the temporal diversity. Tweets are filtered and are incrementally clustered in two cluster types, namely topical clusters based on content similarity and temporal clusters that depends on publication time. Summary generation is formulated as integer linear problem in which unknowns variables are binaries, the objective function is to be maximized and constraints ensure that at most one post per cluster is selected with respect to the defined summary length limit

    Filtrage et agrégation d'informations vitales relatives à des entités

    Get PDF
    Nowadays, knowledge bases such as Wikipedia and DBpedia are the main sources to access information on a wide variety of entities (an entity is a thing that can be distinctly identified such a person, an organization, a product, an event, etc.). However, the update of these sources with new information related to a given entity is done manually by contributors with a significant latency time particularly if that entity is not popular. A system that analyzes documents when published on the Web to filter important information about entities will probably accelerate the update of these knowledge bases. In this thesis, we are interested in filtering timely and relevant information, called vital information, concerning the entities. We aim at answering the following two issues: (1) How to detect if a document is vital (i.e., it provides timely relevant information) to an entity? and (2) How to extract vital information from these documents to build a temporal summary about the entity that can be seen as a reference for updating the corresponding knowledge base entry?Regarding the first issue, we proposed two methods. The first proposal is fully supervised. It is based on a vitality language model. The second proposal measures the freshness of temporal expressions in a document to decide its vitality. Concerning the second issue, we proposed a method that selects the sentences based on the presence of triggers words automatically retrieved from the knowledge already represented in the knowledge base (such as the description of similar entities).We carried out our experiments on the TREC Stream corpus 2013 and 2014 with 1.2 billion documents and different types of entities (persons, organizations, facilities and events). For vital documents filtering approaches, we conducted our experiments in the context of the task "knowledge Base Acceleration (KBA)" for the years 2013 and 2014. Our method based on leveraging the temporal expressions in the document obtained good results outperforming the best participant system in the task KBA 2013. In addition, we showed the importance of our generated temporal summaries to accelerate the update of knowledge bases.Aujourd'hui, les bases de connaissances telles que Wikipedia et DBpedia reprĂ©sentent les sources principales pour accĂ©der aux informations disponibles sur une grande variĂ©tĂ© d'entitĂ©s (une entitĂ© est une chose qui peut ĂȘtre distinctement identifiĂ©e par exemple une personne, une organisation, un produit, un Ă©vĂ©nement, etc.). Cependant, la mise Ă  jour de ces sources avec des informations nouvelles en rapport avec une entitĂ© donnĂ©e se fait manuellement par des contributeurs et avec un temps de latence important en particulier si cette entitĂ© n'est pas populaire. Concevoir un systĂšme qui analyse les documents dĂšs leur publication sur le Web pour filtrer les informations importantes relatives Ă  des entitĂ©s pourra sans doute accĂ©lĂ©rer la mise Ă  jour de ces bases de connaissances. Dans cette thĂšse, nous nous intĂ©ressons au filtrage d'informations pertinentes et nouvelles, appelĂ©es vitales, relatives Ă  des entitĂ©s. Ces travaux rentrent dans le cadre de la recherche d'information mais visent aussi Ă  enrichir les techniques d'ingĂ©nierie de connaissances en aidant Ă  la sĂ©lection des informations Ă  traiter. Nous souhaitons rĂ©pondre principalement aux deux problĂ©matiques suivantes: (1) Comment dĂ©tecter si un document est vital (c.Ă .d qu'il apporte une information pertinente et nouvelle) par rapport Ă  une entitĂ© donnĂ©e? et (2) Comment extraire les informations vitales Ă  partir de ces documents qui serviront comme rĂ©fĂ©rence pour mettre Ă  jour des bases de connaissances? Concernant la premiĂšre problĂ©matique, nous avons proposĂ© deux mĂ©thodes. La premiĂšre proposition est totalement supervisĂ©e. Elle se base sur un modĂšle de langue de vitalitĂ©. La deuxiĂšme proposition mesure la fraĂźcheur des expressions temporelles contenues dans un document afin de dĂ©cider de sa vitalitĂ©. En ce qui concerne la deuxiĂšme problĂ©matique relative Ă  l'extraction d'informations vitales Ă  partir des documents vitaux, nous avons proposĂ© une mĂ©thode qui sĂ©lectionne les phrases comportant potentiellement ces informations vitales, en nous basant sur la prĂ©sence de mots dĂ©clencheurs rĂ©cupĂ©rĂ©s automatiquement Ă  partir de la connaissance dĂ©jĂ  reprĂ©sentĂ©e dans la base de connaissances (comme la description d'entitĂ©s similaires).L'Ă©valuation des approches proposĂ©es a Ă©tĂ© effectuĂ©e dans le cadre de la campagne d'Ă©valuation internationale TREC sur une collection de 1.2 milliard de documents avec diffĂ©rents types d'entitĂ©s (personnes, organisations, Ă©tablissements et Ă©vĂ©nements). Pour les approches de filtrage de documents vitaux, nous avons menĂ© nos expĂ©rimentations dans le cadre de la tĂąche "Knwoledge Base Acceleration (KBA)" pour les annĂ©es 2013 et 2014. L'exploitation des expressions temporelles dans le document a permis d'obtenir de bons rĂ©sultats dĂ©passant le meilleur systĂšme proposĂ© dans la tĂąche KBA 2013. Pour Ă©valuer les contributions concernant l'extraction des informations vitales relatives Ă  des entitĂ©s, nous nous sommes basĂ©s sur le cadre expĂ©rimental de la tĂąche "Temporal Summarization (TS)". Nous avons montrĂ© que notre approche permet de minimiser le temps de latence des mises Ă  jour de bases de connaissances

    Combining granularity-based topic-dependent and topic-independent evidences for opinion detection

    Get PDF
    Fouille des opinion, une sous-discipline dans la recherche d'information (IR) et la linguistique computationnelle, fait rĂ©fĂ©rence aux techniques de calcul pour l'extraction, la classification, la comprĂ©hension et l'Ă©valuation des opinions exprimĂ©es par diverses sources de nouvelles en ligne, social commentaires des mĂ©dias, et tout autre contenu gĂ©nĂ©rĂ© par l'utilisateur. Il est Ă©galement connu par de nombreux autres termes comme trouver l'opinion, la dĂ©tection d'opinion, l'analyse des sentiments, la classification sentiment, de dĂ©tection de polaritĂ©, etc. DĂ©finition dans le contexte plus spĂ©cifique et plus simple, fouille des opinion est la tĂąche de rĂ©cupĂ©ration des opinions contre son besoin aussi exprimĂ© par l'utilisateur sous la forme d'une requĂȘte. Il y a de nombreux problĂšmes et dĂ©fis liĂ©s Ă  l'activitĂ© fouille des opinion. Dans cette thĂšse, nous nous concentrons sur quelques problĂšmes d'analyse d'opinion. L'un des dĂ©fis majeurs de fouille des opinion est de trouver des opinions concernant spĂ©cifiquement le sujet donnĂ© (requĂȘte). Un document peut contenir des informations sur de nombreux sujets Ă  la fois et il est possible qu'elle contienne opiniĂątre texte sur chacun des sujet ou sur seulement quelques-uns. Par consĂ©quent, il devient trĂšs important de choisir les segments du document pertinentes Ă  sujet avec leurs opinions correspondantes. Nous abordons ce problĂšme sur deux niveaux de granularitĂ©, des phrases et des passages. Dans notre premiĂšre approche de niveau de phrase, nous utilisons des relations sĂ©mantiques de WordNet pour trouver cette association entre sujet et opinion. Dans notre deuxiĂšme approche pour le niveau de passage, nous utilisons plus robuste modĂšle de RI i.e. la language modĂšle de se concentrer sur ce problĂšme. L'idĂ©e de base derriĂšre les deux contributions pour l'association d'opinion-sujet est que si un document contient plus segments textuels (phrases ou passages) opiniĂątre et pertinentes Ă  sujet, il est plus opiniĂątre qu'un document avec moins segments textuels opiniĂątre et pertinentes. La plupart des approches d'apprentissage-machine basĂ©e Ă  fouille des opinion sont dĂ©pendants du domaine i.e. leurs performances varient d'un domaine Ă  d'autre. D'autre part, une approche indĂ©pendant de domaine ou un sujet est plus gĂ©nĂ©ralisĂ©e et peut maintenir son efficacitĂ© dans diffĂ©rents domaines. Cependant, les approches indĂ©pendant de domaine souffrent de mauvaises performances en gĂ©nĂ©ral. C'est un grand dĂ©fi dans le domaine de fouille des opinion Ă  dĂ©velopper une approche qui est plus efficace et gĂ©nĂ©ralisĂ©. Nos contributions de cette thĂšse incluent le dĂ©veloppement d'une approche qui utilise de simples fonctions heuristiques pour trouver des documents opiniĂątre. Fouille des opinion basĂ©e entitĂ© devient trĂšs populaire parmi les chercheurs de la communautĂ© IR. Il vise Ă  identifier les entitĂ©s pertinentes pour un sujet donnĂ© et d'en extraire les opinions qui leur sont associĂ©es Ă  partir d'un ensemble de documents textuels. Toutefois, l'identification et la dĂ©termination de la pertinence des entitĂ©s est dĂ©jĂ  une tĂąche difficile. Nous proposons un systĂšme qui prend en compte Ă  la fois l'information de l'article de nouvelles en cours ainsi que des articles antĂ©rieurs pertinents afin de dĂ©tecter les entitĂ©s les plus importantes dans les nouvelles actuelles. En plus de cela, nous prĂ©sentons Ă©galement notre cadre d'analyse d'opinion et tĂąches relieĂ©s. Ce cadre est basĂ©e sur les Ă©vidences contents et les Ă©vidences sociales de la blogosphĂšre pour les tĂąches de trouver des opinions, de prĂ©vision et d'avis de classement multidimensionnel. Cette contribution d'prĂ©maturĂ©e pose les bases pour nos travaux futurs. L'Ă©valuation de nos mĂ©thodes comprennent l'utilisation de TREC 2006 Blog collection et de TREC Novelty track 2004 collection. La plupart des Ă©valuations ont Ă©tĂ© rĂ©alisĂ©es dans le cadre de TREC Blog track.Opinion mining is a sub-discipline within Information Retrieval (IR) and Computational Linguistics. It refers to the computational techniques for extracting, classifying, understanding, and assessing the opinions expressed in various online sources like news articles, social media comments, and other user-generated content. It is also known by many other terms like opinion finding, opinion detection, sentiment analysis, sentiment classification, polarity detection, etc. Defining in more specific and simpler context, opinion mining is the task of retrieving opinions on an issue as expressed by the user in the form of a query. There are many problems and challenges associated with the field of opinion mining. In this thesis, we focus on some major problems of opinion mining

    Filtering News from Document Streams: Evaluation Aspects and Modeled Stream Utility

    Get PDF
    Events like hurricanes, earthquakes, or accidents can impact a large number of people. Not only are people in the immediate vicinity of the event affected, but concerns about their well-being are shared by the local government and well-wishers across the world. The latest information about news events could be of use to government and aid agencies in order to make informed decisions on providing necessary support, security and relief. The general public avails of news updates via dedicated news feeds or broadcasts, and lately, via social media services like Facebook or Twitter. Retrieving the latest information about newsworthy events from the world-wide web is thus of importance to a large section of society. As new content on a multitude of topics is continuously being published on the web, specific event related information needs to be filtered from the resulting stream of documents. We present in this thesis, a user-centric evaluation measure for evaluating systems that filter news related information from document streams. Our proposed evaluation measure, Modeled Stream Utility (MSU), models users accessing information from a stream of sentences produced by a news update filtering system. The user model allows for simulating a large number of users with different characteristic stream browsing behavior. Through simulation, MSU estimates the utility of a system for an average user browsing a stream of sentences. Our results show that system performance is sensitive to a user population's stream browsing behavior and that existing evaluation metrics correspond to very specific types of user behavior. To evaluate systems that filter sentences from a document stream, we need a set of judged sentences. This judged set is a subset of all the sentences returned by all systems, and is typically constructed by pooling together the highest quality sentences, as determined by respective system assigned scores for each sentence. Sentences in the pool are manually assessed and the resulting set of judged sentences is then used to compute system performance metrics. In this thesis, we investigate the effect of including duplicates of judged sentences, into the judged set, on system performance evaluation. We also develop an alternative pooling methodology, that given the MSU user model, selects sentences for pooling based on the probability of a sentences being read by modeled users. Our research lays the foundation for interesting future work for utilizing user-models in different aspects of evaluation of stream filtering systems. The MSU measure enables incorporation of different user models. Furthermore, the applicability of MSU could be extended through calibration based on user behavior

    Diversité et systÚme de recommandation : application à une plateforme de blogs à fort trafic

    Get PDF
    Recommender Systems aim at automatically providing objects related to user’s interests. These tools are increasingly used on content platforms to help the users to access information. In this context, user’s interests can be modeled from the visited content and/or user’s actions (clicks, comments, etc). However, these interests can not be modeled for an unknown user (cold start issue). Therefore, modeling is complex and recommendations are often far away from the real user’s interests. In addition, existing approaches are generally not able to guarantee good performances on platforms with high traffic and which host a significant volume of data.To obtain more relevant recommendations for each user, we propose a recommender system model that builds a list of recommendations aiming at covering a large range of interests, even when only few information about the user is available. The recommender system model we propose is based on diversity. It uses different interest measures and an aggregation function to build the final set of recommendations.We demonstrate the interest of our approach using reference collections and through a user study. Finally, we evaluate our model on the OverBlog platform to validate its scalability in an industrial context.Les systĂšmes de recommandation ont pour objectif de proposer automatiquement aux usagers des objets en relation avec leurs intĂ©rĂȘts. Ces outils d’aide Ă  l’accĂšs Ă  l’information sont de plus en plus prĂ©sents sur les plateformes de contenus. Dans ce contexte, les intĂ©rĂȘts des usagers peuvent ĂȘtre modĂ©lisĂ©s Ă  partir du contenu des documents visitĂ©s ou des actions rĂ©alisĂ©es (clics, commentaires, ...). Cependant, ces intĂ©rĂȘts ne peuvent ĂȘtre modĂ©lisĂ©s en cas de dĂ©marrage Ă  froid, c’est-Ă -dire pour un usager inconnu du systĂšme ou un nouveau document. Cette modĂ©lisation s’avĂšre donc complexe Ă  obtenir, et demeure parfois incomplĂšte, conduisant Ă  des recommandations bien souvent Ă©loignĂ©es des intĂ©rĂȘts rĂ©els des usagers. De plus, les approches existantes ne sont gĂ©nĂ©ralement pas en mesure de garantir des performances satisfaisantes sur des plateformes Ă  fort trafic et hĂ©bergeant une volumĂ©trie de donnĂ©es consĂ©quente.Pour tendre vers des recommandations plus pertinentes, nous proposons un modĂšle de systĂšme de recommandation qui construit une liste de recommandations rĂ©pondant Ă  un large spectre d’intĂ©rĂȘts potentiels, et ce mĂȘme dans un contexte oĂč le systĂšme ne possĂšde que peu d’informations sur l’usager. L’originalitĂ© de notre modĂšle est qu’il repose sur la notion de diversitĂ©. Cette diversitĂ© est obtenue en agrĂ©geant le rĂ©sultat de diffĂ©rentes mesures de sĂ©lection pour construire la liste de recommandations finale.AprĂšs avoir dĂ©montrĂ© l’intĂ©rĂȘt de notre approche en utilisant des corpus des rĂ©fĂ©rences, ainsi qu’au travers d’une Ă©valuation auprĂšs d’usagers rĂ©els, nous Ă©valuons notre modĂšle sur la plateforme de blogs OverBlog. Nous validons ainsi notre proposition dans un contexte industriel Ă  grande Ă©chelle

    Diversité et systÚme de recommandation : application à une plateforme de blogs à fort trafic (convention CIFRE n°20091274)

    Get PDF
    Les systĂšmes de recommandation ont pour objectif de proposer automatique­ment aux usagers des objets en relation avec leurs intĂ©rĂȘts. Ces outils d'aide Ă  l'accĂšs Ă  l'information sont de plus en plus prĂ©sents sur les plateformes de conte­nus. Dans ce contexte, les intĂ©rĂȘts des usagers peuvent ĂȘtre modĂ©lisĂ©s Ă  partir du contenu des documents visitĂ©s ou des actions rĂ©alisĂ©es (clics, commentaires, ...). Cependant, ces intĂ©rĂȘts ne peuvent ĂȘtre modĂ©lisĂ©s en cas de dĂ©marrage Ă  froid, c'est-Ă -dire pour un usager inconnu du systĂšme ou un nouveau document. Cette modĂ©lisation s'avĂšre donc complexe Ă  obtenir, et demeure parfois incom­plĂšte, conduisant Ă  des recommandations bien souvent Ă©loignĂ©es des intĂ©rĂȘts rĂ©els des usagers. De plus, les approches existantes ne sont gĂ©nĂ©ralement pas en me­sure de garantir des performances satisfaisantes sur des plateformes Ă  fort trafic et hĂ©bergeant une volumĂ©trie de donnĂ©es consĂ©quente. Pour tendre vers des recommandations plus pertinentes, nous proposons un modĂšle de systĂšme de recommandation qui construit une liste de recommandations rĂ©pondant Ă  un large spectre d'intĂ©rĂȘts potentiels, et ce mĂȘme dans un contexte oĂč le systĂšme ne possĂšde que peu d'informations sur l'usager. L'originalitĂ© de notre modĂšle est qu'il repose sur la notion de diversitĂ©. Cette diversitĂ© est obtenue en agrĂ©geant le rĂ©sultat de diffĂ©rentes mesures de sĂ©lection pour construire la liste de recommandations finale. AprĂšs avoir dĂ©montrĂ© l'intĂ©rĂȘt de notre approche en utilisant des corpus des rĂ©fĂ©rences, ainsi qu'au travers d'une Ă©valuation auprĂšs d'usagers rĂ©els, nous Ă©valuons notre modĂšle sur la plateforme de blogs OverBlog. Nous validons ainsi notre proposition dans un contexte industriel Ă  grande Ă©chelle.Recommender Systems aim at automatically providing objects related to user's interests. These tools are increasingly used on content platforms to help the users to access information. In this context, user's interests can be modeled from the visited content and/or user's actions (clicks, comments, etc). However, these interests can not be modeled for an unknown user (cold start issue). Therefore, modeling is complex and recommendations are often far away from the real user's interests. In addition, existing approaches are generally not able to guarantee good performances on platforms with high trafic and which host a significant volume of data. To obtain more relevant recommendations for each user, we propose a recommender system model that builds a list of recommendations aiming at covering a large range of interests, even when only few information about the user is available. The recommender system model we propose is based on diversity. It uses different interest measures and an aggregation function to build the final set of recommendations. We demonstrate the interest of our approach using reference collections and through a user study. Finally, we evaluate our model on the OverBlog platform to validate its scalability in an industrial context

    Promoting user engagement and learning in search tasks by effective document representation

    Get PDF
    Much research in information retrieval (IR) focuses on optimisation of the rank of relevant retrieval results for single shot ad hoc IR tasks. Relatively little research has been carried out on supporting and promoting user engagement within search tasks. We seek to improve user experience by use of enhanced document snippets to be presented during the search process to promote user engagement with retrieved information. The primary role of document snippets within search has traditionally been to indicate the potential relevance of retrieved items to the user’s information need. Beyond the relevance of an item, it is generally not possible to infer the contents of individual ranked results just by reading the current snippets. We hypothesise that the creation of richer document snippets and summaries, and effective presentation of this information to users will promote effective search and greater user engagement, and support emerging areas such as learning through search. We generate document summaries for a given query by extracting top relevant sentences from retrieved documents. Creation of these summaries goes beyond exist- ing snippet creation methods by comparing content between documents to take into account novelty when selecting content for inclusion in individual document sum- maries. Further, we investigate the readability of the generated summaries with the overall goal of generating snippets which not only help a user to identify document relevance, but are also designed to increase the user’s understanding and knowledge of a topic gained while inspecting the snippets. We perform a task-based user study to record the user’s interactions, search be- haviour and feedback to evaluate the effectiveness of our snippets using qualitative and quantitative measures. In our user study, we found that richer snippets generated in this work improved the user experience and topical knowledge, and helped users to learn about the topic effectively
    corecore