Search CORE

3 research outputs found

Design, implementation and experiment of a YeSQL Web Crawler

Author: Deveaud Romain
Francony Jean-Marc
Joulin Pierre
Para Françoise
SanJuan-Ibekwe Eric
Publication venue
Publication date: 01/08/2012
Field of study

We describe a novel, "focusable", scalable, distributed web crawler based on GNU/Linux and PostgreSQL that we designed to be easily extendible and which we have released under a GNU public licence. We also report a first use case related to an analysis of Twitter's streams about the french 2012 presidential elections and the URL's it contains

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

B!SON: A Tool for Open Access Journal Recommendation

Author: Entrup Elias
Eppelin Anita
Ewerth Ralph
Hartwig Josephine
Hoppe Anett
Tullney Marco
Wohlgemuth Michael
Publication venue: Heidelberg : Springer
Publication date: 01/01/2022
Field of study

Finding a suitable open access journal to publish scientific work is a complex task: Researchers have to navigate a constantly growing number of journals, institutional agreements with publishers, funders’ conditions and the risk of Predatory Publishers. To help with these challenges, we introduce a web-based journal recommendation system called B!SON. It is developed based on a systematic requirements analysis, built on open data, gives publisher-independent recommendations and works across domains. It suggests open access journals based on title, abstract and references provided by the user. The recommendation quality has been evaluated using a large test set of 10,000 articles. Development by two German scientific libraries ensures the longevity of the project

Repositorium für Naturwissenschaften und Technik

Vers une représentation du contexte thématique en Recherche d'Information

Author: BELLOT Patrice
DEVEAUD Romain
SANJUAN Eric
Publication venue
Publication date: 01/01/2013
Field of study

Quand des humains cherchent des informations au sein de bases de connaissancesou de collections de documents, ils utilisent un système de recherche d information(SRI) faisant office d interface. Les utilisateurs doivent alors transmettre au SRI unereprésentation de leur besoin d information afin que celui-ci puisse chercher des documentscontenant des informations pertinentes. De nos jours, la représentation du besoind information est constituée d un petit ensemble de mots-clés plus souvent connu sousla dénomination de requête . Or, quelques mots peuvent ne pas être suffisants pourreprésenter précisément et efficacement l état cognitif complet d un humain par rapportà son besoin d information initial. Sans une certaine forme de contexte thématiquecomplémentaire, le SRI peut ne pas renvoyer certains documents pertinents exprimantdes concepts n étant pas explicitement évoqués dans la requête.Dans cette thèse, nous explorons et proposons différentes méthodes statistiques, automatiqueset non supervisées pour la représentation du contexte thématique de larequête. Plus spécifiquement, nous cherchons à identifier les différents concepts implicitesd une requête formulée par un utilisateur sans qu aucune action de sa part nesoit nécessaire. Nous expérimentons pour cela l utilisation et la combinaison de différentessources d information générales représentant les grands types d informationauxquels nous sommes confrontés quotidiennement sur internet. Nous tirons égalementparti d algorithmes de modélisation thématique probabiliste (tels que l allocationde Dirichlet latente) dans le cadre d un retour de pertinence simulé. Nous proposonspar ailleurs une méthode permettant d estimer conjointement le nombre de conceptsimplicites d une requête ainsi que l ensemble de documents pseudo-pertinent le plusapproprié afin de modéliser ces concepts. Nous évaluons nos approches en utilisantquatre collections de test TREC de grande taille. En annexes, nous proposons égalementune approche de contextualisation de messages courts exploitant des méthodesde recherche d information et de résumé automatiqueWhen searching for information within knowledge bases or document collections,humans use an information retrieval system (IRS). So that it can retrieve documentscontaining relevant information, users have to provide the IRS with a representationof their information need. Nowadays, this representation of the information need iscomposed of a small set of keywords often referred to as the query . A few wordsmay however not be sufficient to accurately and effectively represent the complete cognitivestate of a human with respect to her initial information need. A query may notcontain sufficient information if the user is searching for some topic in which she is notconfident at all. Hence, without some kind of context, the IRS could simply miss somenuances or details that the user did not or could not provide in query.In this thesis, we explore and propose various statistic, automatic and unsupervisedmethods for representing the topical context of the query. More specifically, we aim toidentify the latent concepts of a query without involving the user in the process norrequiring explicit feedback. We experiment using and combining several general informationsources representing the main types of information we deal with on a dailybasis while browsing theWeb.We also leverage probabilistic topic models (such as LatentDirichlet Allocation) in a pseudo-relevance feedback setting. Besides, we proposea method allowing to jointly estimate the number of latent concepts of a query andthe set of pseudo-relevant feedback documents which is the most suitable to modelthese concepts. We evaluate our approaches using four main large TREC test collections.In the appendix of this thesis, we also propose an approach for contextualizingshort messages which leverages both information retrieval and automatic summarizationtechniquesAVIGNON-Bib. numérique (840079901) / SudocSudocFranceF

OpenGrey Repository