64,998 research outputs found
What Users Ask a Search Engine: Analyzing One Billion Russian Question Queries
We analyze the question queries submitted to a large commercial web search engine to get insights about what people ask, and to better tailor the search results to the users’ needs. Based on a dataset of about one billion question queries submitted during the year 2012, we investigate askers’ querying behavior with the support of automatic query categorization. While the importance of question queries is likely to increase, at present they only make up 3–4% of the total search traffic. Since questions are such a small part of the query stream and are more likely to be unique than shorter queries, clickthrough information is typically rather sparse. Thus, query categorization methods based on the categories of clicked web documents do not work well for questions. As an alternative, we propose a robust question query classification method that uses the labeled questions from a large community question answering platform (CQA) as a training set. The resulting classifier is then transferred to the web search questions. Even though questions on CQA platforms tend to be different to web search questions, our categorization method proves competitive with strong baselines with respect to classification accuracy. To show the scalability of our proposed method we apply the classifiers to about one billion question queries and discuss the trade-offs between performance and accuracy that different classification models offer. Our findings reveal what people ask a search engine and also how this contrasts behavior on a CQA platform
A Route Confidence Evaluation Method for Reliable Hierarchical Text Categorization
Hierarchical Text Categorization (HTC) is becoming increasingly important
with the rapidly growing amount of text data available in the World Wide Web.
Among the different strategies proposed to cope with HTC, the Local Classifier
per Node (LCN) approach attains good performance by mirroring the underlying
class hierarchy while enforcing a top-down strategy in the testing step.
However, the problem of embedding hierarchical information (parent-child
relationship) to improve the performance of HTC systems still remains open. A
confidence evaluation method for a selected route in the hierarchy is proposed
to evaluate the reliability of the final candidate labels in an HTC system. In
order to take into account the information embedded in the hierarchy, weight
factors are used to take into account the importance of each level. An
acceptance/rejection strategy in the top-down decision making process is
proposed, which improves the overall categorization accuracy by rejecting a few
percentage of samples, i.e., those with low reliability score. Experimental
results on the Reuters benchmark dataset (RCV1- v2) confirm the effectiveness
of the proposed method, compared to other state-of-the art HTC methods
Evaluating Multilingual Gisting of Web Pages
We describe a prototype system for multilingual gisting of Web pages, and
present an evaluation methodology based on the notion of gisting as decision
support. This evaluation paradigm is straightforward, rigorous, permits fair
comparison of alternative approaches, and should easily generalize to
evaluation in other situations where the user is faced with decision-making on
the basis of information in restricted or alternative form.Comment: 7 pages, uses psfig and aaai style
Embedding Feature Selection for Large-scale Hierarchical Classification
Large-scale Hierarchical Classification (HC) involves datasets consisting of
thousands of classes and millions of training instances with high-dimensional
features posing several big data challenges. Feature selection that aims to
select the subset of discriminant features is an effective strategy to deal
with large-scale HC problem. It speeds up the training process, reduces the
prediction time and minimizes the memory requirements by compressing the total
size of learned model weight vectors. Majority of the studies have also shown
feature selection to be competent and successful in improving the
classification accuracy by removing irrelevant features. In this work, we
investigate various filter-based feature selection methods for dimensionality
reduction to solve the large-scale HC problem. Our experimental evaluation on
text and image datasets with varying distribution of features, classes and
instances shows upto 3x order of speed-up on massive datasets and upto 45% less
memory requirements for storing the weight vectors of learned model without any
significant loss (improvement for some datasets) in the classification
accuracy. Source Code: https://cs.gmu.edu/~mlbio/featureselection.Comment: IEEE International Conference on Big Data (IEEE BigData 2016
Is social categorization spatially organized in a “Mental Line”? Empirical evidences for spatial bias in intergroup differentiation
Social categorization is the differentiation between the self and others and between one’s own group and other groups and it is such a natural and spontaneous process that often we are not aware of it. The way in which the brain organizes social categorization remains an unresolved issue. We present three experiments investigating the hypothesis that social categories are mentally ordered from left to right on an ingroup–outgroup continuum when membership is salient. To substantiate our hypothesis, we consider empirical evidence from two areas of psychology: research on differences in processing of ingroups and outgroups and research on the effects of spatial biases on processing of quantitative information (e.g., time; numbers) which appears to be arranged from left to right on a small–large continuum, an effect known as the spatial-numerical association of response codes (SNARC). In Experiments 1 and 2 we tested the hypothesis that when membership of a social category is activated, people implicitly locate ingroup categories to the left of a mental line whereas outgroup categories are located on the far right of the same mental line. This spatial organization persists even when stimuli are presented on one of the two sides of the screen and their (explicit) position is spatially incompatible with the implicit mental spatial organization of social categories (Experiment 3). Overall the results indicate that ingroups and outgroups are processed differently. The results are discussed with respect to social categorization theory, spatial agency bias, i.e., the effect observed in Western cultures whereby the agent of an action is mentally represented on the left and the recipient on the right, and the SNARC effec
- …