166 research outputs found

    Data-driven evaluation metrics for heterogeneous search engine result pages

    Get PDF
    Evaluation metrics for search typically assume items are homoge- neous. However, in the context of web search, this assumption does not hold. Modern search engine result pages (SERPs) are composed of a variety of item types (e.g., news, web, entity, etc.), and their influence on browsing behavior is largely unknown. In this paper, we perform a large-scale empirical analysis of pop- ular web search queries and investigate how different item types influence how people interact on SERPs. We then infer a user brows- ing model given people’s interactions with SERP items – creating a data-driven metric based on item type. We show that the proposed metric leads to more accurate estimates of: (1) total gain, (2) total time spent, and (3) stopping depth – without requiring extensive parameter tuning or a priori relevance information. These results suggest that item heterogeneity should be accounted for when de- veloping metrics for SERPs. While many open questions remain concerning the applicability and generalizability of data-driven metrics, they do serve as a formal mechanism to link observed user behaviors directly to how performance is measured. From this approach, we can draw new insights regarding the relationship be- tween behavior and performance – and design data-driven metrics based on real user behavior rather than using metrics reliant on some hypothesized model of user browsing behavior

    cwl_eval : An evaluation tool for information retrieval

    Get PDF
    We present a tool (“cwl_eval”) which unifies many metrics typically used to evaluate information retrieval systems using test collections. In the C/W/L framework metrics are specified via a single function which can be used to derive a number of related measurements: Expected Utility per item, Expected Total Utility, Expected Cost per item, Expected Total Cost, and Expected Depth. The C/W/L framework brings together several independent approaches for measuring the quality of a ranked list, and provides a coherent user model-based framework for developing measures based on utility (gain) and cost.Here we outline the C/W/L measurement framework; describe the cwl_eval architecture; and provide examples of how to use it. We provide implementations of a number of recent metrics, including Time Biased Gain, U-Measure, Bejewelled Measure, and the Information Foraging Based Measure, as well as previous metrics such as Precision, Average Precision, Discounted Cumulative Gain, Rank-Biased Precision, and INST. By providing state-of-the-art and traditional metrics within the same framework, we promote a standardised approach to evaluating search effectiveness

    Measuring the utility of search engine result pages : an information foraging based measure

    Get PDF
    Web Search Engine Result Pages (SERPs) are complex responses to queries, containing many heterogeneous result elements (web results, advertisements, and specialised “answers”) positioned in a variety of layouts. This poses numerous challenges when trying to measure the quality of a SERP because standard measures were designed for homogeneous ranked lists. In this paper, we aim to measure the utility and cost of SERPs. To ground this work we adopt the C/W/L framework which enables a direct comparison between different measures in the same units of measurement, i.e. expected (total) utility and cost. Within this framework, we propose a new measure based on information foraging theory, which can account for the heterogeneity of elements, through different costs, and which naturally motivates the development of a user stopping model that adapts behaviour depending on the rate of gain. This directly connects models of how people search with how we measure search, providing a number of new dimensions in which to investigate and evaluate user behaviour and performance. We perform an analysis over 1000 popular queries issued to a major search engine, and report the aggregate utility experienced by users over time. Then in an comparison against common measures, we show that the proposed foraging based measure provides a more accurate reflection of the utility and of observed behaviours (stopping rank and time spent)

    A Meta-Evaluation of C/W/L/A Metrics: System Ranking Similarity, System Ranking Consistency and Discriminative Power

    Full text link
    Recently, Moffat et al. proposed an analytic framework, namely C/W/L/A, for offline evaluation metrics. This framework allows information retrieval (IR) researchers to design evaluation metrics through the flexible combination of user browsing models and user gain aggregations. However, the statistical stability of C/W/L/A metrics with different aggregations is not yet investigated. In this study, we investigate the statistical stability of C/W/L/A metrics from the perspective of: (1) the system ranking similarity among aggregations, (2) the system ranking consistency of aggregations and (3) the discriminative power of aggregations. More specifically, we combined various aggregation functions with the browsing model of Precision, Discounted Cumulative Gain (DCG), Rank-Biased Precision (RBP), INST, Average Precision (AP) and Expected Reciprocal Rank (ERR), examing their performances in terms of system ranking similarity, system ranking consistency and discriminative power on two offline test collections. Our experimental result suggests that, in terms of system ranking consistency and discriminative power, the aggregation function of expected rate of gain (ERG) has an outstanding performance while the aggregation function of maximum relevance usually has an insufficient performance. The result also suggests that Precision, DCG, RBP, INST and AP with their canonical aggregation all have favourable performances in system ranking consistency and discriminative power; but for ERR, replacing its canonical aggregation with ERG can further strengthen the discriminative power while obtaining a system ranking list similar to the canonical version at the same time

    Evaluating Interpolation and Extrapolation Performance of Neural Retrieval Models

    Full text link
    A retrieval model should not only interpolate the training data but also extrapolate well to the queries that are different from the training data. While neural retrieval models have demonstrated impressive performance on ad-hoc search benchmarks, we still know little about how they perform in terms of interpolation and extrapolation. In this paper, we demonstrate the importance of separately evaluating the two capabilities of neural retrieval models. Firstly, we examine existing ad-hoc search benchmarks from the two perspectives. We investigate the distribution of training and test data and find a considerable overlap in query entities, query intent, and relevance labels. This finding implies that the evaluation on these test sets is biased toward interpolation and cannot accurately reflect the extrapolation capacity. Secondly, we propose a novel evaluation protocol to separately evaluate the interpolation and extrapolation performance on existing benchmark datasets. It resamples the training and test data based on query similarity and utilizes the resampled dataset for training and evaluation. Finally, we leverage the proposed evaluation protocol to comprehensively revisit a number of widely-adopted neural retrieval models. Results show models perform differently when moving from interpolation to extrapolation. For example, representation-based retrieval models perform almost as well as interaction-based retrieval models in terms of interpolation but not extrapolation. Therefore, it is necessary to separately evaluate both interpolation and extrapolation performance and the proposed resampling method serves as a simple yet effective evaluation tool for future IR studies.Comment: CIKM 2022 Full Pape

    The Game Situation:An object-based game analysis framework

    Get PDF

    Identity Creation and World-Building Through Discourse in Video Game Narratives

    Get PDF
    An exploration of how characters develop identity through language use in a video game’s narrative, specifically in video games that do not allow for players to make narrative-altering choices. The concept of literacies is used to create a critical framework through which to view video games themselves as a literacy, as well as to view them as a discourse. This thesis analyzes how dialogue is used to develop and showcase gender, sexuality, personality, moral identity in characters in and out of the player’s control, as well as how dialogue builds the narrative world in which these characters exist. This thesis includes two case studies, on the Borderlands video game series and the Halo video game series, approaching each series from a perspective that showcases how its unique world and characters are created through language. Each series has a specific facet to its narrative that is examined in depth; in the Borderlands series, the importance of storytelling to world-building and moral identity, and in the Halo series, the significance of speech itself as an act. Additionally, Destiny is used as a key supplement to these case studies, as it bridges the gap between the identity-and world-building methods used by Halo and Borderlands. Destiny also incorporates the player’s engagement in an online multiplayer universe, creating a unique type of discourse between the player and the game. Finally, as paratexts affect character identity and world-building within each series, the concept of paratexts as they connect to literacy and narrative is thus also examined as an important facet of identity- and world-building

    Examining the Effects of Casual Video Gameplay as an Intervention to Alleviate Symptoms of Depression on both Subjective and Objective Measures

    Get PDF
    Depression can be a debilitating illness that affects more than 300 million people worldwide. Although there are successful treatments for depression with pharmaceuticals and behavioral approaches such as psychotherapy, these approaches are often very costly and may carry a stigma of treatment for some individuals. The purpose of this dissertation study was to compare results of previously collected data that examine whether a prescribed regimen of casual videogame play (CVG) could reduce symptoms associated with depression. This dissertation specifically focused on comparing results of a study group and a comparison group on the self-report instrument, the Patient Health Questionnaire-9 (PHQ-9) as well as objectively measured changes in alpha wave, Electroencephalogram (EEG) data. Participants in the original study were screened for depression using the PHQ-9. There were a total of 57 participants who met the study inclusion criteria. Each participant that met the inclusion criteria was then randomized into either the comparison group (n=29) or the study group (n=28). Experimental group participants were prescribed to play one of three CVGs three times per week (with 24 hours between each session). This process occurred for 30 minutes each session, over a 1-month period. Comparison group participants reviewed the National Institute of Mental Health's webpage on depression during a pre-test and a post-test session. The participants in this group did not engage in any intervention over the one-month period of time between the pre-test and post-test sessions. A repeated-measures analysis of covariance (ANCOVA) was completed to examine three research questions between subjects at Time 1 and Time 3 to compare changes in depression symptoms on both subjective, self-report (PHQ-9) and objective alpha wave EEG measures. The CVGs used as the intervention factor were either Peggle, Bejeweled or Bookworm Adventure. Study analysis revealed significant decreases in depression symptoms reported in the study group on the PHQ-9 self-report scale. Results along the objective, EEG alpha wave scale revealed non-statistically significant changes. Potential reasons for the non-significant findings along with recommendations for future research are also discussed. Conclusions from this study found that a prescribed regimen of CVG may have potential as an intervention to help reduce symptoms of depression as measured on the PHQ-9 scale. Further research should consider examining intricacies of CVG play as a potential intervention to address symptoms related to depression. Findings also revealed that while EEG findings were not statistically significant, participants self-report responses were significant and may underscore the importance of individual's subjective feelings in the therapeutic process

    Dissecting developer policy violating apps: Characterization and detection

    Get PDF
    • …
    corecore