598 research outputs found

    Web Document Models for Web Information Retrieval

    Get PDF
    http://www.emse.fr/OSWIR05/2005-oswir-p19-beigbeder.pdfInternational audienceDifferent Web document models in relation to the hyper- text nature of the Web are presented. The Web graph is the most well known and used data extracted from the Web hy- pertext. The ways it has been used in works in relation with information retrieval are surveyed. Finally, some consider- ations about the integration of these works in a Web search engine are presented

    Monte Carlo Methods for Top-k Personalized PageRank Lists and Name Disambiguation

    Get PDF
    We study a problem of quick detection of top-k Personalized PageRank lists. This problem has a number of important applications such as finding local cuts in large graphs, estimation of similarity distance and name disambiguation. In particular, we apply our results to construct efficient algorithms for the person name disambiguation problem. We argue that when finding top-k Personalized PageRank lists two observations are important. Firstly, it is crucial that we detect fast the top-k most important neighbours of a node, while the exact order in the top-k list as well as the exact values of PageRank are by far not so crucial. Secondly, a little number of wrong elements in top-k lists do not really degrade the quality of top-k lists, but it can lead to significant computational saving. Based on these two key observations we propose Monte Carlo methods for fast detection of top-k Personalized PageRank lists. We provide performance evaluation of the proposed methods and supply stopping criteria. Then, we apply the methods to the person name disambiguation problem. The developed algorithm for the person name disambiguation problem has achieved the second place in the WePS 2010 competition

    An Efficient Clustering System for the Measure of Page (Document) Authoritativeness

    Get PDF
    A collection of documents D1 of a search result R1 is a cluster if all the documents in D1 are similar in a way and dissimilar to another collection say D2 for a given query Q1. Implying that, given a new query Q2, the search result R2 may pose an intersection or a union of documents from D1 and D2 or more to form D3. However within these collections say D1, D2, D3 etc, one or two pages certainly would be better in relevance to the query that invokes them. Such a page is regarded being ‘authoritative’ than others. Therefore in a query context, a given search result has pages of authority. The most important measure of a search engine’s efficiency is the quality of its search results. This work seeks to cluster search results to ease the matching of searched documents with user’s need by attaching a page authority value (pav). We developed a classifier that falls in the margin of supervised and unsupervised learning which would be computationally feasible and producing most authoritative pages. A novel searching and clustering engine was developed using several measure-factors such as anchor text, proximity, page rank, and features of neighbors to rate the pages so searched. Documents or corpora of known measures from the Text Retrieval Conference (TREC), the Initiative for the Evaluation of XML Retrieval (INEX) and Reuter’s Collection, were fed into our work and evaluated comparatively with existing search engines (Google, VIVISIMO and Wikipedia). We got very impressive results based on our evaluation. Additionally, our system could add a value – pav to every searched and classified page to indicate a page’s relevance over the other. A document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often. This approach thus provides a different realization of some of the basic ideas for document ranking which could be applied through some acceptable rules: number of occurrence, document zone and relevance measures. The biggest problem facing users of web search engines today is the quality of the results they get back. While the results are often amusing and expand users' horizons, they are often frustrating and consume precious time. We have made available a better page ranker that do not depend heavily on the page developer’s inflicted weights but considers the actual factors within and without the target page. Though very experimental on research collections, the user can within the collection of the first ten search results listing, extract his or her relevant pages with ease. Keywords: page Authoritativeness, page Rank, search results, clustering algorithm, web crawling

    Personalized PageRank with Node-dependent Restart

    Get PDF
    Personalized PageRank is an algorithm to classify the improtance of web pages on a user-dependent basis. We introduce two generalizations of Personalized PageRank with node-dependent restart. The first generalization is based on the proportion of visits to nodes before the restart, whereas the second generalization is based on the probability of visited node just before the restart. In the original case of constant restart probability, the two measures coincide. We discuss interesting particular cases of restart probabilities and restart distributions. We show that the both generalizations of Personalized PageRank have an elegant expression connecting the so-called direct and reverse Personalized PageRanks that yield a symmetry property of these Personalized PageRanks

    The contribution of data mining to information science

    Get PDF
    The information explosion is a serious challenge for current information institutions. On the other hand, data mining, which is the search for valuable information in large volumes of data, is one of the solutions to face this challenge. In the past several years, data mining has made a significant contribution to the field of information science. This paper examines the impact of data mining by reviewing existing applications, including personalized environments, electronic commerce, and search engines. For these three types of application, how data mining can enhance their functions is discussed. The reader of this paper is expected to get an overview of the state of the art research associated with these applications. Furthermore, we identify the limitations of current work and raise several directions for future research

    WISER: A Semantic Approach for Expert Finding in Academia based on Entity Linking

    Full text link
    We present WISER, a new semantic search engine for expert finding in academia. Our system is unsupervised and it jointly combines classical language modeling techniques, based on text evidences, with the Wikipedia Knowledge Graph, via entity linking. WISER indexes each academic author through a novel profiling technique which models her expertise with a small, labeled and weighted graph drawn from Wikipedia. Nodes in this graph are the Wikipedia entities mentioned in the author's publications, whereas the weighted edges express the semantic relatedness among these entities computed via textual and graph-based relatedness functions. Every node is also labeled with a relevance score which models the pertinence of the corresponding entity to author's expertise, and is computed by means of a proper random-walk calculation over that graph; and with a latent vector representation which is learned via entity and other kinds of structural embeddings derived from Wikipedia. At query time, experts are retrieved by combining classic document-centric approaches, which exploit the occurrences of query terms in the author's documents, with a novel set of profile-centric scoring strategies, which compute the semantic relatedness between the author's expertise and the query topic via the above graph-based profiles. The effectiveness of our system is established over a large-scale experimental test on a standard dataset for this task. We show that WISER achieves better performance than all the other competitors, thus proving the effectiveness of modelling author's profile via our "semantic" graph of entities. Finally, we comment on the use of WISER for indexing and profiling the whole research community within the University of Pisa, and its application to technology transfer in our University

    Context based multimedia information retrieval

    Get PDF

    USING SOCIAL ANNOTATIONS TO IMPROVE WEB SEARCH

    Get PDF
    Web-based tagging systems, which include social bookmarking systems such as Delicious, have become increasingly popular. These systems allow participants to annotate or tag web resources. This research examined the use of social annotations to improve the quality of web searches. The research involved three components. First, social annotations were used to index resources. Two annotation-based indexing methods were proposed: annotation based indexing and full text with annotation indexing. Second, social annotations were used to improve search result ranking. Six annotation based ranking methods were proposed: Popularity Count, Propagate Popularity Count, Query Weighted Popularity Count, Query Weighted Propagate Popularity Count, Match Tag Count and Normalized Match Tag Count. Third, social annotations were used to both index and rank resources. The result from the first experiment suggested that both static feature and similarity feature should be considered when using social annotations to re-rank search result. The result of the second experiment showed that using only annotation as an index of resources may not be a good idea. Since social Annotations could be viewed as a high level concept of the content, combining them to the content of resource could add some more important concepts to the resources. Last but not least, the result from the third experiment confirmed that the combination of using social annotations to rank the search result and using social annotations as resource index augmentation provided a promising rank of search results. It showed that social annotations could benefit web search

    The structure of broad topics on the web

    Get PDF
    • 

    corecore