9,612 research outputs found

    Generalized Team Draft Interleaving

    Get PDF
    Interleaving is an online evaluation method that compares two ranking functions by mixing their results and interpret- ing the users' click feedback. An important property of an interleaving method is its sensitivity, i.e. the ability to obtain reliable comparison outcomes with few user interac- tions. Several methods have been proposed so far to im- prove interleaving sensitivity, which can be roughly divided into two areas: (a) methods that optimize the credit assign- ment function (how the click feedback is interpreted), and (b) methods that achieve higher sensitivity by controlling the interleaving policy (how often a particular interleaved result page is shown). In this paper, we propose an interleaving framework that generalizes the previously studied interleaving methods in two aspects. First, it achieves a higher sensitivity by per- forming a joint data-driven optimization of the credit as- signment function and the interleaving policy. Second, we formulate the framework to be general w.r.t. the search do- main where the interleaving experiment is deployed, so that it can be applied in domains with grid-based presentation, such as image search. In order to simplify the optimization, we additionally introduce a stratifed estimate of the exper- iment outcome. This stratifcation is also useful on its own, as it reduces the variance of the outcome and thus increases the interleaving sensitivity. We perform an extensive experimental study using large- scale document and image search datasets obtained from a commercial search engine. The experiments show that our proposed framework achieves marked improvements in sensitivity over efective baselines on both datasets

    Sensitive and Scalable Online Evaluation with Theoretical Guarantees

    Full text link
    Multileaved comparison methods generalize interleaved comparison methods to provide a scalable approach for comparing ranking systems based on regular user interactions. Such methods enable the increasingly rapid research and development of search engines. However, existing multileaved comparison methods that provide reliable outcomes do so by degrading the user experience during evaluation. Conversely, current multileaved comparison methods that maintain the user experience cannot guarantee correctness. Our contribution is two-fold. First, we propose a theoretical framework for systematically comparing multileaved comparison methods using the notions of considerateness, which concerns maintaining the user experience, and fidelity, which concerns reliable correct outcomes. Second, we introduce a novel multileaved comparison method, Pairwise Preference Multileaving (PPM), that performs comparisons based on document-pair preferences, and prove that it is considerate and has fidelity. We show empirically that, compared to previous multileaved comparison methods, PPM is more sensitive to user preferences and scalable with the number of rankers being compared.Comment: CIKM 2017, Proceedings of the 2017 ACM on Conference on Information and Knowledge Managemen

    JWalk: a tool for lazy, systematic testing of java classes by design introspection and user interaction

    Get PDF
    Popular software testing tools, such as JUnit, allow frequent retesting of modified code; yet the manually created test scripts are often seriously incomplete. A unit-testing tool called JWalk has therefore been developed to address the need for systematic unit testing within the context of agile methods. The tool operates directly on the compiled code for Java classes and uses a new lazy method for inducing the changing design of a class on the fly. This is achieved partly through introspection, using Java’s reflection capability, and partly through interaction with the user, constructing and saving test oracles on the fly. Predictive rules reduce the number of oracle values that must be confirmed by the tester. Without human intervention, JWalk performs bounded exhaustive exploration of the class’s method protocols and may be directed to explore the space of algebraic constructions, or the intended design state-space of the tested class. With some human interaction, JWalk performs up to the equivalent of fully automated state-based testing, from a specification that was acquired incrementally
    • …
    corecore