64,998 research outputs found

    What Users Ask a Search Engine: Analyzing One Billion Russian Question Queries

    Full text link
    We analyze the question queries submitted to a large commercial web search engine to get insights about what people ask, and to better tailor the search results to the users’ needs. Based on a dataset of about one billion question queries submitted during the year 2012, we investigate askers’ querying behavior with the support of automatic query categorization. While the importance of question queries is likely to increase, at present they only make up 3–4% of the total search traffic. Since questions are such a small part of the query stream and are more likely to be unique than shorter queries, clickthrough information is typically rather sparse. Thus, query categorization methods based on the categories of clicked web documents do not work well for questions. As an alternative, we propose a robust question query classification method that uses the labeled questions from a large community question answering platform (CQA) as a training set. The resulting classifier is then transferred to the web search questions. Even though questions on CQA platforms tend to be different to web search questions, our categorization method proves competitive with strong baselines with respect to classification accuracy. To show the scalability of our proposed method we apply the classifiers to about one billion question queries and discuss the trade-offs between performance and accuracy that different classification models offer. Our findings reveal what people ask a search engine and also how this contrasts behavior on a CQA platform

    A Route Confidence Evaluation Method for Reliable Hierarchical Text Categorization

    Full text link
    Hierarchical Text Categorization (HTC) is becoming increasingly important with the rapidly growing amount of text data available in the World Wide Web. Among the different strategies proposed to cope with HTC, the Local Classifier per Node (LCN) approach attains good performance by mirroring the underlying class hierarchy while enforcing a top-down strategy in the testing step. However, the problem of embedding hierarchical information (parent-child relationship) to improve the performance of HTC systems still remains open. A confidence evaluation method for a selected route in the hierarchy is proposed to evaluate the reliability of the final candidate labels in an HTC system. In order to take into account the information embedded in the hierarchy, weight factors are used to take into account the importance of each level. An acceptance/rejection strategy in the top-down decision making process is proposed, which improves the overall categorization accuracy by rejecting a few percentage of samples, i.e., those with low reliability score. Experimental results on the Reuters benchmark dataset (RCV1- v2) confirm the effectiveness of the proposed method, compared to other state-of-the art HTC methods

    Evaluating Multilingual Gisting of Web Pages

    Get PDF
    We describe a prototype system for multilingual gisting of Web pages, and present an evaluation methodology based on the notion of gisting as decision support. This evaluation paradigm is straightforward, rigorous, permits fair comparison of alternative approaches, and should easily generalize to evaluation in other situations where the user is faced with decision-making on the basis of information in restricted or alternative form.Comment: 7 pages, uses psfig and aaai style

    Embedding Feature Selection for Large-scale Hierarchical Classification

    Full text link
    Large-scale Hierarchical Classification (HC) involves datasets consisting of thousands of classes and millions of training instances with high-dimensional features posing several big data challenges. Feature selection that aims to select the subset of discriminant features is an effective strategy to deal with large-scale HC problem. It speeds up the training process, reduces the prediction time and minimizes the memory requirements by compressing the total size of learned model weight vectors. Majority of the studies have also shown feature selection to be competent and successful in improving the classification accuracy by removing irrelevant features. In this work, we investigate various filter-based feature selection methods for dimensionality reduction to solve the large-scale HC problem. Our experimental evaluation on text and image datasets with varying distribution of features, classes and instances shows upto 3x order of speed-up on massive datasets and upto 45% less memory requirements for storing the weight vectors of learned model without any significant loss (improvement for some datasets) in the classification accuracy. Source Code: https://cs.gmu.edu/~mlbio/featureselection.Comment: IEEE International Conference on Big Data (IEEE BigData 2016

    Is social categorization spatially organized in a “Mental Line”? Empirical evidences for spatial bias in intergroup differentiation

    Get PDF
    Social categorization is the differentiation between the self and others and between one’s own group and other groups and it is such a natural and spontaneous process that often we are not aware of it. The way in which the brain organizes social categorization remains an unresolved issue. We present three experiments investigating the hypothesis that social categories are mentally ordered from left to right on an ingroup–outgroup continuum when membership is salient. To substantiate our hypothesis, we consider empirical evidence from two areas of psychology: research on differences in processing of ingroups and outgroups and research on the effects of spatial biases on processing of quantitative information (e.g., time; numbers) which appears to be arranged from left to right on a small–large continuum, an effect known as the spatial-numerical association of response codes (SNARC). In Experiments 1 and 2 we tested the hypothesis that when membership of a social category is activated, people implicitly locate ingroup categories to the left of a mental line whereas outgroup categories are located on the far right of the same mental line. This spatial organization persists even when stimuli are presented on one of the two sides of the screen and their (explicit) position is spatially incompatible with the implicit mental spatial organization of social categories (Experiment 3). Overall the results indicate that ingroups and outgroups are processed differently. The results are discussed with respect to social categorization theory, spatial agency bias, i.e., the effect observed in Western cultures whereby the agent of an action is mentally represented on the left and the recipient on the right, and the SNARC effec
    corecore