83,919 research outputs found

    LSHTC: A Benchmark for Large-Scale Text Classification

    Full text link
    LSHTC is a series of challenges which aims to assess the performance of classification systems in large-scale classification in a a large number of classes (up to hundreds of thousands). This paper describes the dataset that have been released along the LSHTC series. The paper details the construction of the datsets and the design of the tracks as well as the evaluation measures that we implemented and a quick overview of the results. All of these datasets are available online and runs may still be submitted on the online server of the challenges

    Developing information architecture through records management classification techniques

    Get PDF
    Purpose – This work aims to draw attention to information retrieval philosophies and techniques allied to the records management profession, advocating a wider professional consideration of a functional approach to information management, in this instance in the development of information architecture. Design/methodology/approach – The paper draws from a hypothesis originally presented by the author that advocated a viewpoint whereby the application of records management techniques, traditionally applied to develop business classification schemes, was offered as an additional solution to organising information resources and services (within a university intranet), where earlier approaches, notably subject- and administrative-based arrangements, were found to be lacking. The hypothesis was tested via work-based action learning and is presented here as an extended case study. The paper also draws on evidence submitted to the Joint Information Systems Committee in support of the Abertay University's application for consideration for the JISC award for innovation in records and information management. Findings – The original hypothesis has been tested in the workplace. Information retrieval techniques, allied to records management (functional classification), were the main influence in the development of pre- and post-coordinate information retrieval systems to support a wider information architecture, where the subject approach was found to be lacking. Their use within the workplace has since been extended. Originality/value – The paper advocates that the development of information retrieval as a discipline should include a wider consideration of functional classification, as this alternative to the subject approach is largely ignored in mainstream IR works

    Naive Bayes vs. Decision Trees vs. Neural Networks in the Classification of Training Web Pages

    Get PDF
    Web classification has been attempted through many different technologies. In this study we concentrate on the comparison of Neural Networks (NN), NaĂŻve Bayes (NB) and Decision Tree (DT) classifiers for the automatic analysis and classification of attribute data from training course web pages. We introduce an enhanced NB classifier and run the same data sample through the DT and NN classifiers to determine the success rate of our classifier in the training courses domain. This research shows that our enhanced NB classifier not only outperforms the traditional NB classifier, but also performs similarly as good, if not better, than some more popular, rival techniques. This paper also shows that, overall, our NB classifier is the best choice for the training courses domain, achieving an impressive F-Measure value of over 97%, despite it being trained with fewer samples than any of the classification systems we have encountered

    Automated annotation of landmark images using community contributed datasets and web resources

    Get PDF
    A novel solution to the challenge of automatic image annotation is described. Given an image with GPS data of its location of capture, our system returns a semantically-rich annotation comprising tags which both identify the landmark in the image, and provide an interesting fact about it, e.g. "A view of the Eiffel Tower, which was built in 1889 for an international exhibition in Paris". This exploits visual and textual web mining in combination with content-based image analysis and natural language processing. In the first stage, an input image is matched to a set of community contributed images (with keyword tags) on the basis of its GPS information and image classification techniques. The depicted landmark is inferred from the keyword tags for the matched set. The system then takes advantage of the information written about landmarks available on the web at large to extract a fact about the landmark in the image. We report component evaluation results from an implementation of our solution on a mobile device. Image localisation and matching oers 93.6% classication accuracy; the selection of appropriate tags for use in annotation performs well (F1M of 0.59), and it subsequently automatically identies a correct toponym for use in captioning and fact extraction in 69.0% of the tested cases; finally the fact extraction returns an interesting caption in 78% of cases

    Construction of a Pragmatic Base Line for Journal Classifications and Maps Based on Aggregated Journal-Journal Citation Relations

    Full text link
    A number of journal classification systems have been developed in bibliometrics since the launch of the Citation Indices by the Institute of Scientific Information (ISI) in the 1960s. These systems are used to normalize citation counts with respect to field-specific citation patterns. The best known system is the so-called "Web-of-Science Subject Categories" (WCs). In other systems papers are classified by algorithmic solutions. Using the Journal Citation Reports 2014 of the Science Citation Index and the Social Science Citation Index (n of journals = 11,149), we examine options for developing a new system based on journal classifications into subject categories using aggregated journal-journal citation data. Combining routines in VOSviewer and Pajek, a tree-like classification is developed. At each level one can generate a map of science for all the journals subsumed under a category. Nine major fields are distinguished at the top level. Further decomposition of the social sciences is pursued for the sake of example with a focus on journals in information science (LIS) and science studies (STS). The new classification system improves on alternative options by avoiding the problem of randomness in each run that has made algorithmic solutions hitherto irreproducible. Limitations of the new system are discussed (e.g. the classification of multi-disciplinary journals). The system's usefulness for field-normalization in bibliometrics should be explored in future studies.Comment: accepted for publication in the Journal of Informetrics, 20 July 201
    • …
    corecore