326 research outputs found

    Building user interest profiles from wikipedia clusters

    Get PDF
    Users of search systems are often reluctant to explicitly build profiles to indicate their search interests. Thus automatically building user profiles is an important research area for personalized search. One difficult component of doing this is accessing a knowledge system which provides broad coverage of user search interests. In this work, we describe a method to build category id based user profiles from a user's historical search data. Our approach makes significant use of Wikipedia as an external knowledge resource

    Fast redshift clustering with the Baire (ultra) metric

    Full text link
    The Baire metric induces an ultrametric on a dataset and is of linear computational complexity, contrasted with the standard quadratic time agglomerative hierarchical clustering algorithm. We apply the Baire distance to spectrometric and photometric redshifts from the Sloan Digital Sky Survey using, in this work, about half a million astronomical objects. We want to know how well the (more cos\ tly to determine) spectrometric redshifts can predict the (more easily obtained) photometric redshifts, i.e. we seek to regress the spectrometric on the photometric redshifts, and we develop a clusterwise nearest neighbor regression procedure for this.Comment: 14 pages, 6 figure

    Determining the Characteristic Vocabulary for a Specialized Dictionary using Word2vec and a Directed Crawler

    Get PDF
    Specialized dictionaries are used to understand concepts in specific domains, especially where those concepts are not part of the general vocabulary, or having meanings that differ from ordinary languages. The first step in creating a specialized dictionary involves detecting the characteristic vocabulary of the domain in question. Classical methods for detecting this vocabulary involve gathering a domain corpus, calculating statistics on the terms found there, and then comparing these statistics to a background or general language corpus. Terms which are found significantly more often in the specialized corpus than in the background corpus are candidates for the characteristic vocabulary of the domain. Here we present two tools, a directed crawler, and a distributional semantics package, that can be used together, circumventing the need of a background corpus. Both tools are available on the web

    Supporting polyrepresentation in a quantum-inspired geometrical retrieval framework

    Get PDF
    The relevance of a document has many facets, going beyond the usual topical one, which have to be considered to satisfy a user's information need. Multiple representations of documents, like user-given reviews or the actual document content, can give evidence towards certain facets of relevance. In this respect polyrepresentation of documents, where such evidence is combined, is a crucial concept to estimate the relevance of a document. In this paper, we discuss how a geometrical retrieval framework inspired by quantum mechanics can be extended to support polyrepresentation. We show by example how different representations of a document can be modelled in a Hilbert space, similar to physical systems known from quantum mechanics. We further illustrate how these representations are combined by means of the tensor product to support polyrepresentation, and discuss the case that representations of documents are not independent from a user point of view. Besides giving a principled framework for polyrepresentation, the potential of this approach is to capture and formalise the complex interdependent relationships that the different representations can have between each other

    Text Classification: A Sequential Reading Approach

    Full text link
    We propose to model the text classification process as a sequential decision process. In this process, an agent learns to classify documents into topics while reading the document sentences sequentially and learns to stop as soon as enough information was read for deciding. The proposed algorithm is based on a modelisation of Text Classification as a Markov Decision Process and learns by using Reinforcement Learning. Experiments on four different classical mono-label corpora show that the proposed approach performs comparably to classical SVM approaches for large training sets, and better for small training sets. In addition, the model automatically adapts its reading process to the quantity of training information provided.Comment: ECIR201

    ModÚle de langue pour l'ordonnancement conjoint d'entités pertinentes dans un réseau d'informations hétérogÚnes

    Get PDF
    National audienceDans ce papier, nous proposons un nouveau modĂšle, appelĂ© BibRank, ayant pour objectif d'ordonnancer conjointement des ressources hĂ©tĂ©rogĂšnes, documents et auteurs, d'un rĂ©seau bibliographique selon leur degrĂ© de pertinence vis-Ă -vis d'une requĂȘte. Ce modĂšle utilise le principe de propagation des scores des entitĂ©s en considĂ©rant Ă  la fois la structure du rĂ©seau et le sujet de la requĂȘte. De plus, ce modĂšle introduit deux indicateurs de proximitĂ© thĂ©matique entre entitĂ©s connectĂ©es suivant le type des entitĂ©s reliĂ©es. Pour les relations entre entitĂ©s homogĂšnes, cet indicateur dĂ©tecte les citations marginales tandis que pour les relations entre entitĂ©s hĂ©tĂ©rogĂšnes, il utilise deux sources d'Ă©vidence : le sujet du document et l'expertise de l'auteur. Des expĂ©rimentations, menĂ©es en utilisant le rĂ©seau bibliographique CiteSeerX, montrent l'efficacitĂ© du modĂšle d'ordonnancement proposĂ©

    A social model for Literature Access: Towards a weighted social network of authors

    Get PDF
    International audienceThis paper presents a novel retrieval approach for literature access based on social network analysis. In fact, we investigate a social model where authors represent the main entities and relationships are extracted from co-author and citation links. Moreover, we define a weighting model for social relationships which takes into account the authors positions in the social network and their mutual collaborations. Assigned weights express influence, knowledge transfer and shared interest between authors. Furthermore, we estimate document relevance by combing the document-query similarity and the document social importance derived from corresponding authors. To evaluate the effectiveness of our model, we conduct a series of experiments on a scientific document dataset that includes textual content and social data extracted from the academic social network CiteULike. Final results show that the proposed model improves the retrieval effectiveness and outperforms traditional and social information retrieval baselines
    • 

    corecore