3 research outputs found

    INEX Tweet Contextualization Task: Evaluation, Results and Lesson Learned

    Get PDF
    Microblogging platforms such as Twitter are increasingly used for on-line client and market analysis. This motivated the proposal of a new track at CLEF INEX lab of Tweet Contextualization. The objective of this task was to help a user to understand a tweet by providing him with a short explanatory summary (500 words). This summary should be built automatically using resources like Wikipedia and generated by extracting relevant passages and aggregating them into a coherent summary. Running for four years, results show that the best systems combine NLP techniques with more traditional methods. More precisely the best performing systems combine passage retrieval, sentence segmentation and scoring, named entity recognition, text part-of-speech (POS) analysis, anaphora detection, diversity content measure as well as sentence reordering. This paper provides a full summary report on the four-year long task. While yearly overviews focused on system results, in this paper we provide a detailed report on the approaches proposed by the participants and which can be considered as the state of the art for this task. As an important result from the 4 years competition, we also describe the open access resources that have been built and collected. The evaluation measures for automatic summarization designed in DUC or MUC were not appropriate to evaluate tweet contextualization, we explain why and depict in detailed the LogSim measure used to evaluate informativeness of produced contexts or summaries. Finally, we also mention the lessons we learned and that it is worth considering when designing a task

    Vers une représentation du contexte thématique en Recherche d'Information

    Get PDF
    Quand des humains cherchent des informations au sein de bases de connaissancesou de collections de documents, ils utilisent un systĂšme de recherche d information(SRI) faisant office d interface. Les utilisateurs doivent alors transmettre au SRI unereprĂ©sentation de leur besoin d information afin que celui-ci puisse chercher des documentscontenant des informations pertinentes. De nos jours, la reprĂ©sentation du besoind information est constituĂ©e d un petit ensemble de mots-clĂ©s plus souvent connu sousla dĂ©nomination de requĂȘte . Or, quelques mots peuvent ne pas ĂȘtre suffisants pourreprĂ©senter prĂ©cisĂ©ment et efficacement l Ă©tat cognitif complet d un humain par rapportĂ  son besoin d information initial. Sans une certaine forme de contexte thĂ©matiquecomplĂ©mentaire, le SRI peut ne pas renvoyer certains documents pertinents exprimantdes concepts n Ă©tant pas explicitement Ă©voquĂ©s dans la requĂȘte.Dans cette thĂšse, nous explorons et proposons diffĂ©rentes mĂ©thodes statistiques, automatiqueset non supervisĂ©es pour la reprĂ©sentation du contexte thĂ©matique de larequĂȘte. Plus spĂ©cifiquement, nous cherchons Ă  identifier les diffĂ©rents concepts implicitesd une requĂȘte formulĂ©e par un utilisateur sans qu aucune action de sa part nesoit nĂ©cessaire. Nous expĂ©rimentons pour cela l utilisation et la combinaison de diffĂ©rentessources d information gĂ©nĂ©rales reprĂ©sentant les grands types d informationauxquels nous sommes confrontĂ©s quotidiennement sur internet. Nous tirons Ă©galementparti d algorithmes de modĂ©lisation thĂ©matique probabiliste (tels que l allocationde Dirichlet latente) dans le cadre d un retour de pertinence simulĂ©. Nous proposonspar ailleurs une mĂ©thode permettant d estimer conjointement le nombre de conceptsimplicites d une requĂȘte ainsi que l ensemble de documents pseudo-pertinent le plusappropriĂ© afin de modĂ©liser ces concepts. Nous Ă©valuons nos approches en utilisantquatre collections de test TREC de grande taille. En annexes, nous proposons Ă©galementune approche de contextualisation de messages courts exploitant des mĂ©thodesde recherche d information et de rĂ©sumĂ© automatiqueWhen searching for information within knowledge bases or document collections,humans use an information retrieval system (IRS). So that it can retrieve documentscontaining relevant information, users have to provide the IRS with a representationof their information need. Nowadays, this representation of the information need iscomposed of a small set of keywords often referred to as the query . A few wordsmay however not be sufficient to accurately and effectively represent the complete cognitivestate of a human with respect to her initial information need. A query may notcontain sufficient information if the user is searching for some topic in which she is notconfident at all. Hence, without some kind of context, the IRS could simply miss somenuances or details that the user did not or could not provide in query.In this thesis, we explore and propose various statistic, automatic and unsupervisedmethods for representing the topical context of the query. More specifically, we aim toidentify the latent concepts of a query without involving the user in the process norrequiring explicit feedback. We experiment using and combining several general informationsources representing the main types of information we deal with on a dailybasis while browsing theWeb.We also leverage probabilistic topic models (such as LatentDirichlet Allocation) in a pseudo-relevance feedback setting. Besides, we proposea method allowing to jointly estimate the number of latent concepts of a query andthe set of pseudo-relevant feedback documents which is the most suitable to modelthese concepts. We evaluate our approaches using four main large TREC test collections.In the appendix of this thesis, we also propose an approach for contextualizingshort messages which leverages both information retrieval and automatic summarizationtechniquesAVIGNON-Bib. numĂ©rique (840079901) / SudocSudocFranceF

    Geographic information extraction from texts

    Get PDF
    A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction
    corecore