3 research outputs found

    Design, implementation and experiment of a YeSQL Web Crawler

    Full text link
    We describe a novel, "focusable", scalable, distributed web crawler based on GNU/Linux and PostgreSQL that we designed to be easily extendible and which we have released under a GNU public licence. We also report a first use case related to an analysis of Twitter's streams about the french 2012 presidential elections and the URL's it contains

    B!SON: A Tool for Open Access Journal Recommendation

    Get PDF
    Finding a suitable open access journal to publish scientific work is a complex task: Researchers have to navigate a constantly growing number of journals, institutional agreements with publishers, funders’ conditions and the risk of Predatory Publishers. To help with these challenges, we introduce a web-based journal recommendation system called B!SON. It is developed based on a systematic requirements analysis, built on open data, gives publisher-independent recommendations and works across domains. It suggests open access journals based on title, abstract and references provided by the user. The recommendation quality has been evaluated using a large test set of 10,000 articles. Development by two German scientific libraries ensures the longevity of the project

    Vers une représentation du contexte thématique en Recherche d'Information

    Get PDF
    Quand des humains cherchent des informations au sein de bases de connaissancesou de collections de documents, ils utilisent un systĂšme de recherche d information(SRI) faisant office d interface. Les utilisateurs doivent alors transmettre au SRI unereprĂ©sentation de leur besoin d information afin que celui-ci puisse chercher des documentscontenant des informations pertinentes. De nos jours, la reprĂ©sentation du besoind information est constituĂ©e d un petit ensemble de mots-clĂ©s plus souvent connu sousla dĂ©nomination de requĂȘte . Or, quelques mots peuvent ne pas ĂȘtre suffisants pourreprĂ©senter prĂ©cisĂ©ment et efficacement l Ă©tat cognitif complet d un humain par rapportĂ  son besoin d information initial. Sans une certaine forme de contexte thĂ©matiquecomplĂ©mentaire, le SRI peut ne pas renvoyer certains documents pertinents exprimantdes concepts n Ă©tant pas explicitement Ă©voquĂ©s dans la requĂȘte.Dans cette thĂšse, nous explorons et proposons diffĂ©rentes mĂ©thodes statistiques, automatiqueset non supervisĂ©es pour la reprĂ©sentation du contexte thĂ©matique de larequĂȘte. Plus spĂ©cifiquement, nous cherchons Ă  identifier les diffĂ©rents concepts implicitesd une requĂȘte formulĂ©e par un utilisateur sans qu aucune action de sa part nesoit nĂ©cessaire. Nous expĂ©rimentons pour cela l utilisation et la combinaison de diffĂ©rentessources d information gĂ©nĂ©rales reprĂ©sentant les grands types d informationauxquels nous sommes confrontĂ©s quotidiennement sur internet. Nous tirons Ă©galementparti d algorithmes de modĂ©lisation thĂ©matique probabiliste (tels que l allocationde Dirichlet latente) dans le cadre d un retour de pertinence simulĂ©. Nous proposonspar ailleurs une mĂ©thode permettant d estimer conjointement le nombre de conceptsimplicites d une requĂȘte ainsi que l ensemble de documents pseudo-pertinent le plusappropriĂ© afin de modĂ©liser ces concepts. Nous Ă©valuons nos approches en utilisantquatre collections de test TREC de grande taille. En annexes, nous proposons Ă©galementune approche de contextualisation de messages courts exploitant des mĂ©thodesde recherche d information et de rĂ©sumĂ© automatiqueWhen searching for information within knowledge bases or document collections,humans use an information retrieval system (IRS). So that it can retrieve documentscontaining relevant information, users have to provide the IRS with a representationof their information need. Nowadays, this representation of the information need iscomposed of a small set of keywords often referred to as the query . A few wordsmay however not be sufficient to accurately and effectively represent the complete cognitivestate of a human with respect to her initial information need. A query may notcontain sufficient information if the user is searching for some topic in which she is notconfident at all. Hence, without some kind of context, the IRS could simply miss somenuances or details that the user did not or could not provide in query.In this thesis, we explore and propose various statistic, automatic and unsupervisedmethods for representing the topical context of the query. More specifically, we aim toidentify the latent concepts of a query without involving the user in the process norrequiring explicit feedback. We experiment using and combining several general informationsources representing the main types of information we deal with on a dailybasis while browsing theWeb.We also leverage probabilistic topic models (such as LatentDirichlet Allocation) in a pseudo-relevance feedback setting. Besides, we proposea method allowing to jointly estimate the number of latent concepts of a query andthe set of pseudo-relevant feedback documents which is the most suitable to modelthese concepts. We evaluate our approaches using four main large TREC test collections.In the appendix of this thesis, we also propose an approach for contextualizingshort messages which leverages both information retrieval and automatic summarizationtechniquesAVIGNON-Bib. numĂ©rique (840079901) / SudocSudocFranceF
    corecore