29,484 research outputs found
Analysis of a very large web search engine query log
In this paper we present an analysis of an AltaVista Search Engine query log consisting of approximately 1 billion entries for search requests over a period of six weeks. This represents almost 285 million user sessions, each an attempt to fill a single information need. We present an analysis of individual queries, query duplication, and query sessions. We also present results of a correlation analysis of the log entries, studying the interaction of terms within queries. Our data supports the conjecture that web users differ significantly from the user assumed in the standard information retrieval literature. Specifically, we show that web users type in short queries, mostly look at the first 10 results only, and seldom modify the query. This suggests that traditional information retrieval techniques may not work well for answering web search requests. The correlation analysis showed that the most highly correlated items are constituents of phrases. This result indicates it may be useful for search engines to consider search terms as parts of phrases even if the user did not explicitly specify them as such
What Users Ask a Search Engine: Analyzing One Billion Russian Question Queries
We analyze the question queries submitted to a large commercial web search engine to get insights about what people ask, and to better tailor the search results to the users’ needs. Based on a dataset of about one billion question queries submitted during the year 2012, we investigate askers’ querying behavior with the support of automatic query categorization. While the importance of question queries is likely to increase, at present they only make up 3–4% of the total search traffic. Since questions are such a small part of the query stream and are more likely to be unique than shorter queries, clickthrough information is typically rather sparse. Thus, query categorization methods based on the categories of clicked web documents do not work well for questions. As an alternative, we propose a robust question query classification method that uses the labeled questions from a large community question answering platform (CQA) as a training set. The resulting classifier is then transferred to the web search questions. Even though questions on CQA platforms tend to be different to web search questions, our categorization method proves competitive with strong baselines with respect to classification accuracy. To show the scalability of our proposed method we apply the classifiers to about one billion question queries and discuss the trade-offs between performance and accuracy that different classification models offer. Our findings reveal what people ask a search engine and also how this contrasts behavior on a CQA platform
Deriving query suggestions for site search
Modern search engines have been moving away from simplistic interfaces that aimed at satisfying a user's need with a single-shot query. Interactive features are now integral parts of web search engines. However, generating good query modification suggestions remains a challenging issue. Query log analysis is one of the major strands of work in this direction. Although much research has been performed on query logs collected on the web as a whole, query log analysis to enhance search on smaller and more focused collections has attracted less attention, despite its increasing practical importance. In this article, we report on a systematic study of different query modification methods applied to a substantial query log collected on a local website that already uses an interactive search engine. We conducted experiments in which we asked users to assess the relevance of potential query modification suggestions that have been constructed using a range of log analysis methods and different baseline approaches. The experimental results demonstrate the usefulness of log analysis to extract query modification suggestions. Furthermore, our experiments demonstrate that a more fine-grained approach than grouping search requests into sessions allows for extraction of better refinement terms from query log files. © 2013 ASIS&T
Searching the intranet: Corporate users and their queries
By examining the log files from a corporate intranet search engine, we have analysed the actual web searching
behaviour of real users in a real business environment. While building on previous research on public search engines, we apply an alternative session definition that we argue is more appropriate. Our results regarding session length, query construction and result page viewing confirm some of the findings from similar studies carried out on public search engines but further our understanding of web searching by presenting details on corporate users’ activities. In particular, we suggest that search sessions are shorter than previously suggested, search queries have fewer terms than observed for public search engines, and number of examined result pages is smaller than reported in other research. More research on how corporate intranet users search for information is needed
Cross Validation Of Neural Network Applications For Automatic New Topic Identification
There are recent studies in the literature on automatic topic-shift identification in Web search engine user sessions; however most of this work applied their topic-shift identification algorithms on data logs from a single search engine. The purpose of this study is to provide the cross-validation of an artificial neural network application to automatically identify topic changes in a web search engine user session by using data logs of different search engines for training and testing the neural network. Sample data logs from the Norwegian search engine FAST (currently owned by Overture) and Excite are used in this study. Findings of this study suggest that it could be possible to identify topic shifts and continuations successfully on a particular search engine user session using neural networks that are trained on a different search engine data log
Query Chains: Learning to Rank from Implicit Feedback
This paper presents a novel approach for using clickthrough data to learn
ranked retrieval functions for web search results. We observe that users
searching the web often perform a sequence, or chain, of queries with a similar
information need. Using query chains, we generate new types of preference
judgments from search engine logs, thus taking advantage of user intelligence
in reformulating queries. To validate our method we perform a controlled user
study comparing generated preference judgments to explicit relevance judgments.
We also implemented a real-world search engine to test our approach, using a
modified ranking SVM to learn an improved ranking function from preference
data. Our results demonstrate significant improvements in the ranking given by
the search engine. The learned rankings outperform both a static ranking
function, as well as one trained without considering query chains.Comment: 10 page
The egalitarian effect of search engines
Search engines have become key media for our scientific, economic, and social
activities by enabling people to access information on the Web in spite of its
size and complexity. On the down side, search engines bias the traffic of users
according to their page-ranking strategies, and some have argued that they
create a vicious cycle that amplifies the dominance of established and already
popular sites. We show that, contrary to these prior claims and our own
intuition, the use of search engines actually has an egalitarian effect. We
reconcile theoretical arguments with empirical evidence showing that the
combination of retrieval by search engines and search behavior by users
mitigates the attraction of popular pages, directing more traffic toward less
popular sites, even in comparison to what would be expected from users randomly
surfing the Web.Comment: 9 pages, 8 figures, 2 appendices. The final version of this e-print
has been published on the Proc. Natl. Acad. Sci. USA 103(34), 12684-12689
(2006), http://www.pnas.org/cgi/content/abstract/103/34/1268
Efficient Diversification of Web Search Results
In this paper we analyze the efficiency of various search results
diversification methods. While efficacy of diversification approaches has been
deeply investigated in the past, response time and scalability issues have been
rarely addressed. A unified framework for studying performance and feasibility
of result diversification solutions is thus proposed. First we define a new
methodology for detecting when, and how, query results need to be diversified.
To this purpose, we rely on the concept of "query refinement" to estimate the
probability of a query to be ambiguous. Then, relying on this novel ambiguity
detection method, we deploy and compare on a standard test set, three different
diversification methods: IASelect, xQuAD, and OptSelect. While the first two
are recent state-of-the-art proposals, the latter is an original algorithm
introduced in this paper. We evaluate both the efficiency and the effectiveness
of our approach against its competitors by using the standard TREC Web
diversification track testbed. Results shown that OptSelect is able to run two
orders of magnitude faster than the two other state-of-the-art approaches and
to obtain comparable figures in diversification effectiveness.Comment: VLDB201
Spatio-textual indexing for geographical search on the web
Many web documents refer to specific geographic localities and many
people include geographic context in queries to web search engines. Standard
web search engines treat the geographical terms in the same way as other terms.
This can result in failure to find relevant documents that refer to the place of
interest using alternative related names, such as those of included or nearby
places. This can be overcome by associating text indexing with spatial indexing
methods that exploit geo-tagging procedures to categorise documents with
respect to geographic space. We describe three methods for spatio-textual
indexing based on multiple spatially indexed text indexes, attaching spatial
indexes to the document occurrences of a text index, and merging text index
access results with results of access to a spatial index of documents. These
schemes are compared experimentally with a conventional text index search
engine, using a collection of geo-tagged web documents, and are shown to be
able to compete in speed and storage performance with pure text indexing
- …