1,362 research outputs found

    Human-competitive automatic topic indexing

    Get PDF
    Topic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document's topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance. Three kinds of indexing are investigated: term assignment, a task commonly performed by librarians, who select topics from a controlled vocabulary; tagging, a popular activity of web users, who choose topics freely; and a new method of keyphrase extraction, where topics are equated to Wikipedia article names. A general two-stage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domain-specific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples. This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is human-competitive because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages

    Focused image search in the social Web.

    Get PDF
    Recently, social multimedia-sharing websites, which allow users to upload, annotate, and share online photo or video collections, have become increasingly popular. The user tags or annotations constitute the new multimedia meta-data . We present an image search system that exploits both image textual and visual information. First, we use focused crawling and DOM Tree based web data extraction methods to extract image textual features from social networking image collections. Second, we propose the concept of visual words to handle the image\u27s visual content for fast indexing and searching. We also develop several user friendly search options to allow users to query the index using words and image feature descriptions (visual words). The developed image search system tries to bridge the gap between the scalable industrial image search engines, which are based on keyword search, and the slower content based image retrieval systems developed mostly in the academic field and designed to search based on image content only. We have implemented a working prototype by crawling and indexing over 16,056 images from flickr.com, one of the most popular image sharing websites. Our experimental results on a working prototype confirm the efficiency and effectiveness of the methods, that we proposed

    Information extraction from the web using a search engine

    Get PDF

    Identifying Synonymous Terms in Preparation for Technology Mining

    Get PDF
    In this research, the development of a `concept-clumping algorithm\u27 designed to improve the clustering of technical concepts is demonstrated . The algorithm developed first identifies a list of technically relevant noun phrases from a cleaned extracted list and then applies a rule-based algorithm for identifying synonymous terms based on shared words in each term. An assessment of the algorithm found that the algorithm has an 89—91% precision rate, was successful in moving technically important terms higher in the term frequency list, and improved the technical specificity of term clusters

    Source Code Retrieval using Case Based Reasoning

    Get PDF
    Formal verification of source code has been extensively used in the past few years in order to create dependable software systems. However, although formal languages like Spec# or JML are getting more and more popular, the set of verified implementations is very small and only growing slowly. Our work aims to automate some of the steps involved in writing specifications and their implementations, by reusing existing verified programs. That is, for a given implementation we seek to retrieve similar verified code and then reapply the missing specification that accompanies that code. In this thesis, I present the retrieval system that is part of the Arís (Analogical Reasoning for reuse of Implementation & Specification) project. The overall methodology of the Arís project is very similar to Case-Based Reasoning (CBR) and its parent discipline of Analogical Reasoning (AR), centered on the activities of solution retrieval and reuse. CBR’s retrieval phase is achieved using semantic and structural characteristics of source code. API calls are used as semantic anchors and characteristics of conceptual graphs are used to express the structure of implementations. Finally, we transfer the knowledge (i.e. formal specification) between the input implementation and the retrieved code artefacts to produce a specification for a given implementation. The evaluation results are promising and our experiments show that the proposed approach has real potential in generating formal specifications using past solutions
    • 

    corecore