62,972 research outputs found

    Searching and Stopping: An Analysis of Stopping Rules and Strategies

    Get PDF
    Searching naturally involves stopping points, both at a query level (how far down the ranked list should I go?) and at a session level (how many queries should I issue?). Understanding when searchers stop has been of much interest to the community because it is fundamental to how we evaluate search behaviour and performance. Research has shown that searchers find it difficult to formalise stopping criteria, and typically resort to their intuition of what is "good enough". While various heuristics and stopping criteria have been proposed, little work has investigated how well they perform, and whether searchers actually conform to any of these rules. In this paper, we undertake the first large scale study of stopping rules, investigating how they influence overall session performance, and which rules best match actual stopping behaviour. Our work is focused on stopping at the query level in the context of ad-hoc topic retrieval, where searchers undertake search tasks within a fixed time period. We show that stopping strategies based upon the disgust or frustration point rules - both of which capture a searcher's tolerance to non-relevance - typically result in (i) the best overall performance, and (ii) provide the closest approximation to actual searcher behaviour, although a fixed depth approach also performs remarkably well. Findings from this study have implications regarding how we build measures, and how we conduct simulations of search behaviours

    Machine Learning in Automated Text Categorization

    Full text link
    The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey

    One-Class Classification: Taxonomy of Study and Review of Techniques

    Full text link
    One-class classification (OCC) algorithms aim to build classification models when the negative class is either absent, poorly sampled or not well defined. This unique situation constrains the learning of efficient classifiers by defining class boundary just with the knowledge of positive class. The OCC problem has been considered and applied under many research themes, such as outlier/novelty detection and concept learning. In this paper we present a unified view of the general problem of OCC by presenting a taxonomy of study for OCC problems, which is based on the availability of training data, algorithms used and the application domains applied. We further delve into each of the categories of the proposed taxonomy and present a comprehensive literature review of the OCC algorithms, techniques and methodologies with a focus on their significance, limitations and applications. We conclude our paper by discussing some open research problems in the field of OCC and present our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure

    Probabilistic Memory Model for Visual Images Categorization

    Get PDF
    During the past decades, numerous memory models have been proposed, which focused mainly on how spoken words are studied, whereas models on how visual images are studied are still limited. In this study, we propose a probabilistic memory model (PMM) for visual images categorization which is able to mimic the workings of the human brain during the image storage and retrieval. First, in the learning phase, the visual images are represented by the feature vectors extracted with convolutional neural network (CNN) and each feature component is assumed to conform to a Gaussian distribution and may be incompletely copied with a certain probability or randomly produced in accordance to an exponential distribution. Then, in the test phase, the likelihood ratio between the test image and each studied image is calculated based on the probabilistic inference theory, and an odd value in favor of an old item over a new one is obtained based on all likelihood values. Finally, if the odd value is above a certain threshold, the Bayesian decision rule is applied for image classification. Experimental results on two benchmark image datasets demonstrate that the proposed PMM can perform well on categorization tasks for both studied and non-studied images

    Meaning-focused and Quantum-inspired Information Retrieval

    Full text link
    In recent years, quantum-based methods have promisingly integrated the traditional procedures in information retrieval (IR) and natural language processing (NLP). Inspired by our research on the identification and application of quantum structures in cognition, more specifically our work on the representation of concepts and their combinations, we put forward a 'quantum meaning based' framework for structured query retrieval in text corpora and standardized testing corpora. This scheme for IR rests on considering as basic notions, (i) 'entities of meaning', e.g., concepts and their combinations and (ii) traces of such entities of meaning, which is how documents are considered in this approach. The meaning content of these 'entities of meaning' is reconstructed by solving an 'inverse problem' in the quantum formalism, consisting of reconstructing the full states of the entities of meaning from their collapsed states identified as traces in relevant documents. The advantages with respect to traditional approaches, such as Latent Semantic Analysis (LSA), are discussed by means of concrete examples.Comment: 11 page

    Using Information Filtering in Web Data Mining Process

    Get PDF
    Web service-oriented Grid is becoming a standard for achieving loosely coupled distributed computing. Grid services could easily be specified with web-service based interfaces. In this paper we first envisage a realistic Grid market with players such as end-users, brokers and service providers participating co-operatively with an aim to meet requirements and earn profit. End-users wish to use functionality of Grid services by paying the minimum possible price or price confined within a specified budget, brokers aim to maximise profit whilst establishing a SLA (Service Level Agreement) and satisfying end-user needs and at the same time resisting the volatility of service execution time and availability. Service providers aim to develop price models based on end-user or broker demands that will maximise their profit. In this paper we focus on developing stochastic approaches to end-user workflow scheduling that provides QoS guarantees by establishing a SLA. We also develop a novel 2-stage stochastic programming technique that aims at establishing a SLA with end-users regarding satisfying their workflow QoS requirements. We develop a scheduling (workload allocation) technique based on linear programming that embeds the negotiated workflow QoS into the program and model Grid services as generalised queues. This technique is shown to outperform existing scheduling techniques that don't rely on real-time performance information
    • ā€¦
    corecore