2,791 research outputs found

    Explorations in tag suggestion and query expansion

    Full text link
    The query used in a search system is only an approximation to the user’s true information need, and as a result, many factors can reduce the quality of search results. One is query ambiguity, causing searchers with different needs to issue the same query. For example, for the query java, some users may want to find java tutorial while others may want to download java software. Other factors include a vocabulary mismatch and a lack of knowledge regarding the contents of the document collection. In any case, many users benefit from assistance in forming a good query. As a result, some commercial services provide query suggestions for many queries. In this paper, we propose a Tag Suggestion System that takes advantage of tags associated with query results to expand a searcher’s query. Since not every web page is associated with existing tags, we first build an auto-tagging system which can assign multiple tags to web pages, including news, blogs, etc. The current system contains the most popular 140 tags in del.icio.us, with high precision performance. A small user study is performed to evaluate anecdotally the performance of our Tag Suggestion System, showing better quality than the query suggestion mechanisms provided by Yahoo! and Google. The result pages of expanded queries generated by the Tag Suggestion System are also significantly better than those of the Google original system

    Multimedia question answering

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Concept-based Interactive Query Expansion Support Tool (CIQUEST)

    Get PDF
    This report describes a three-year project (2000-03) undertaken in the Information Studies Department at The University of Sheffield and funded by Resource, The Council for Museums, Archives and Libraries. The overall aim of the research was to provide user support for query formulation and reformulation in searching large-scale textual resources including those of the World Wide Web. More specifically the objectives were: to investigate and evaluate methods for the automatic generation and organisation of concepts derived from retrieved document sets, based on statistical methods for term weighting; and to conduct user-based evaluations on the understanding, presentation and retrieval effectiveness of concept structures in selecting candidate terms for interactive query expansion. The TREC test collection formed the basis for the seven evaluative experiments conducted in the course of the project. These formed four distinct phases in the project plan. In the first phase, a series of experiments was conducted to investigate further techniques for concept derivation and hierarchical organisation and structure. The second phase was concerned with user-based validation of the concept structures. Results of phases 1 and 2 informed on the design of the test system and the user interface was developed in phase 3. The final phase entailed a user-based summative evaluation of the CiQuest system. The main findings demonstrate that concept hierarchies can effectively be generated from sets of retrieved documents and displayed to searchers in a meaningful way. The approach provides the searcher with an overview of the contents of the retrieved documents, which in turn facilitates the viewing of documents and selection of the most relevant ones. Concept hierarchies are a good source of terms for query expansion and can improve precision. The extraction of descriptive phrases as an alternative source of terms was also effective. With respect to presentation, cascading menus were easy to browse for selecting terms and for viewing documents. In conclusion the project dissemination programme and future work are outlined

    Nanotechnology Publications and Patents: A Review of Social Science Studies and Search Strategies

    Get PDF
    This paper provides a comprehensive review of more than 120 social science studies in nanoscience and technology, all of which analyze publication and patent data. We conduct a comparative analysis of bibliometric search strategies that these studies use to harvest publication and patent data related to nanoscience and technology. We implement these strategies on 2006 publication data and find that Mogoutov and Kahane (2007), with their evolutionary lexical query search strategy, extract the highest number of records from the Web of Science. The strategies of Glanzel et al. (2003), Noyons et al. (2003), Porter et al. (2008) and Mogoutov and Kahane (2007) produce very similar ranking tables of the top ten nanotechnology subject areas and the top ten most prolific countries and institutions.nanotechnology, research and development, productivity, publications, patents, bibliometric analysis, search strategy

    Socializing the Semantic Gap: A Comparative Survey on Image Tag Assignment, Refinement and Retrieval

    Get PDF
    Where previous reviews on content-based image retrieval emphasize on what can be seen in an image to bridge the semantic gap, this survey considers what people tag about an image. A comprehensive treatise of three closely linked problems, i.e., image tag assignment, refinement, and tag-based image retrieval is presented. While existing works vary in terms of their targeted tasks and methodology, they rely on the key functionality of tag relevance, i.e. estimating the relevance of a specific tag with respect to the visual content of a given image and its social context. By analyzing what information a specific method exploits to construct its tag relevance function and how such information is exploited, this paper introduces a taxonomy to structure the growing literature, understand the ingredients of the main works, clarify their connections and difference, and recognize their merits and limitations. For a head-to-head comparison between the state-of-the-art, a new experimental protocol is presented, with training sets containing 10k, 100k and 1m images and an evaluation on three test sets, contributed by various research groups. Eleven representative works are implemented and evaluated. Putting all this together, the survey aims to provide an overview of the past and foster progress for the near future.Comment: to appear in ACM Computing Survey

    From Frequency to Meaning: Vector Space Models of Semantics

    Full text link
    Computers understand very little of the meaning of human language. This profoundly limits our ability to give instructions to computers, the ability of computers to explain their actions to us, and the ability of computers to analyse and process text. Vector space models (VSMs) of semantics are beginning to address these limits. This paper surveys the use of VSMs for semantic processing of text. We organize the literature on VSMs according to the structure of the matrix in a VSM. There are currently three broad classes of VSMs, based on term-document, word-context, and pair-pattern matrices, yielding three classes of applications. We survey a broad range of applications in these three categories and we take a detailed look at a specific open source project in each category. Our goal in this survey is to show the breadth of applications of VSMs for semantics, to provide a new perspective on VSMs for those who are already familiar with the area, and to provide pointers into the literature for those who are less familiar with the field

    Semantic and pragmatic characterization of learning objects

    Get PDF
    Tese de doutoramento. Engenharia Informática. Universidade do Porto. Faculdade de Engenharia. 201

    Doctor of Philosophy

    Get PDF
    dissertationVisualization has emerged as an effective means to quickly obtain insight from raw data. While simple computer programs can generate simple visualizations, and while there has been constant progress in sophisticated algorithms and techniques for generating insightful pictorial descriptions of complex data, the process of building visualizations remains a major bottleneck in data exploration. In this thesis, we present the main design and implementation aspects of VisTrails, a system designed around the idea of transparently capturing the exploration process that leads to a particular visualization. In particular, VisTrails explores the idea of provenance management in visualization systems: keeping extensive metadata about how the visualizations were created and how they relate to one another. This thesis presents the provenance data model in VisTrails, which can be easily adopted by existing visualization systems and libraries. This lightweight model entirely captures the exploration process of the user, and it can be seen as an electronic analogue of the scientific notebook. The provenance metadata collected during the creation of pipelines can be reused to suggest similar content in related visualizations and guide semi-automated changes. This thesis presents the idea of building visualizations by analogy in a system that allows users to change many visualizations at once, without requiring them to interact with the visualization specifications. It then proposes techniques to help users construct pipelines by consensus, automatically suggesting completions based on a database of previously created pipelines. By presenting these predictions in a carefully designed interface, users can create visualizations and other data products more efficiently because they can augment their normal work patterns with the suggested completions. VisTrails leverages the workflow specifications to identify and avoid redundant operations. This optimization is especially useful while exploring multiple visualizations. When variations of the same pipeline need to be executed, substantial speedups can be obtained by caching the results of overlapping subsequences of the pipelines. We present the design decisions behind the execution engine, and how it easily supports the execution of arbitrary third-party modules. These specifications also facilitate the reproduction of previous results. We will present a description of an infrastructure that makes the workflows a complete description of the computational processes, including information necessary to identify and install necessary system libraries. In an environment where effective visualization and data analysis tasks combine many different software packages, this infrastructure can mean the difference between being able to replicate published results and getting lost in a sea of software dependencies and missing libraries. The thesis concludes with a discussion of the system architecture, design decisions and learned lessons in VisTrails. This discussion is meant to clarify the issues present in creating a system based around a provenance tracking engine, and should help implementors decide how to best incorporate these notions into their own systems
    corecore