7 research outputs found

    Utility analysis for topically biased PageRank

    Full text link

    An Efficient Clustering System for the Measure of Page (Document) Authoritativeness

    Get PDF
    A collection of documents D1 of a search result R1 is a cluster if all the documents in D1 are similar in a way and dissimilar to another collection say D2 for a given query Q1. Implying that, given a new query Q2, the search result R2 may pose an intersection or a union of documents from D1 and D2 or more to form D3. However within these collections say D1, D2, D3 etc, one or two pages certainly would be better in relevance to the query that invokes them. Such a page is regarded being ‘authoritative’ than others. Therefore in a query context, a given search result has pages of authority. The most important measure of a search engine’s efficiency is the quality of its search results. This work seeks to cluster search results to ease the matching of searched documents with user’s need by attaching a page authority value (pav). We developed a classifier that falls in the margin of supervised and unsupervised learning which would be computationally feasible and producing most authoritative pages. A novel searching and clustering engine was developed using several measure-factors such as anchor text, proximity, page rank, and features of neighbors to rate the pages so searched. Documents or corpora of known measures from the Text Retrieval Conference (TREC), the Initiative for the Evaluation of XML Retrieval (INEX) and Reuter’s Collection, were fed into our work and evaluated comparatively with existing search engines (Google, VIVISIMO and Wikipedia). We got very impressive results based on our evaluation. Additionally, our system could add a value – pav to every searched and classified page to indicate a page’s relevance over the other. A document is a good match to a query if the document model is likely to generate the query, which will in turn happen if the document contains the query words often. This approach thus provides a different realization of some of the basic ideas for document ranking which could be applied through some acceptable rules: number of occurrence, document zone and relevance measures. The biggest problem facing users of web search engines today is the quality of the results they get back. While the results are often amusing and expand users' horizons, they are often frustrating and consume precious time. We have made available a better page ranker that do not depend heavily on the page developer’s inflicted weights but considers the actual factors within and without the target page. Though very experimental on research collections, the user can within the collection of the first ten search results listing, extract his or her relevant pages with ease. Keywords: page Authoritativeness, page Rank, search results, clustering algorithm, web crawling

    Web Page Classification and Hierarchy Adaptation

    Get PDF

    Semantic enrichment of knowledge sources supported by domain ontologies

    Get PDF
    This thesis introduces a novel conceptual framework to support the creation of knowledge representations based on enriched Semantic Vectors, using the classical vector space model approach extended with ontological support. One of the primary research challenges addressed here relates to the process of formalization and representation of document contents, where most existing approaches are limited and only take into account the explicit, word-based information in the document. This research explores how traditional knowledge representations can be enriched through incorporation of implicit information derived from the complex relationships (semantic associations) modelled by domain ontologies with the addition of information presented in documents. The relevant achievements pursued by this thesis are the following: (i) conceptualization of a model that enables the semantic enrichment of knowledge sources supported by domain experts; (ii) development of a method for extending the traditional vector space, using domain ontologies; (iii) development of a method to support ontology learning, based on the discovery of new ontological relations expressed in non-structured information sources; (iv) development of a process to evaluate the semantic enrichment; (v) implementation of a proof-of-concept, named SENSE (Semantic Enrichment kNowledge SourcEs), which enables to validate the ideas established under the scope of this thesis; (vi) publication of several scientific articles and the support to 4 master dissertations carried out by the department of Electrical and Computer Engineering from FCT/UNL. It is worth mentioning that the work developed under the semantic referential covered by this thesis has reused relevant achievements within the scope of research European projects, in order to address approaches which are considered scientifically sound and coherent and avoid “reinventing the wheel”.European research projects - CoSpaces (IST-5-034245), CRESCENDO (FP7-234344) and MobiS (FP7-318452

    Exploiting links and text structure on the Web : a quantitative approach to improving search quality

    Get PDF
    [no abstract

    Utility analysis for topically biased PageRank

    No full text
    PageRank is known to be an efficient metric for computing general document importance in the Web. While commonly used as a one-size-fits-all measure, the ability to produce topically biased ranks has not yet been fully explored in detail. In particular, it was still unclear to what granularity of “topic ” the computation of biased page ranks makes sense. In this paper we present the results of a thorough quantitative and qualitative analysis of biasing PageRank on Open Directory categories. We show that the MAP quality of Biased PageRank generally increases with the ODP level up to a certain point, thus sustaining the usage of more specialized categories to bias PageRank on, in order to improve topic specific search