Search CORE

103 research outputs found

The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives

Author: Fröbe Maik
Gienapp Lukas
Hagen Matthias
Potthast Martin
Reimer Jan Heinrich
Scells Harrisen
Schmidt Sebastian
Stein Benno
Publication venue
Publication date: 31/07/2023
Field of study

The Archive Query Log (AQL) is a previously unused, comprehensive query log collected at the Internet Archive over the last 25 years. Its first version includes 356 million queries, 166 million search result pages, and 1.7 billion search results across 550 search providers. Although many query logs have been studied in the literature, the search providers that own them generally do not publish their logs to protect user privacy and vital business data. Of the few query logs publicly available, none combines size, scope, and diversity. The AQL is the first to do so, enabling research on new retrieval models and (diachronic) search engine analyses. Provided in a privacy-preserving manner, it promotes open research as well as more transparency and accountability in the search industry.Comment: SIGIR 2023 resource paper, 13 page

arXiv.org e-Print Archive

Query recommendation in the information domain of children

Author: Bilal
Gao
Haveliwala
Publication venue: 'Wiley'
Publication date
Field of study

Crossref

Query Suggestion and Data Fusion in Contextual Disambiguation

Author: Cucerzan S.
Friedman J. H.
Mihalkova L.
Shaw J. A.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date
Field of study

Crossref

Recommended from our members

A user-centred approach to information retrieval

Author: Aloteibi Saad
Publication venue: University of Cambridge
Publication date: 18/12/2020
Field of study

A user model is a fundamental component in user-centred information retrieval systems. It enables personalization of a user's search experience. The development of such a model involves three phases: collecting information about each user, representing such information, and integrating the model into a retrieval application. Progress in this area is typically met with privacy and scalability challenges that hinder the ability to synthesize collective knowledge from each user's search behaviour. In this thesis, I propose a framework that addresses each of these three phases. The proposed framework is based on social role theory from the social science literature and at the centre of this theory is the concept of a social position. A social position is a label for a group of users with similar behavioural patterns. Examples of such positions are traveller, patient, movie fan, and computer scientist. In this thesis, a social position acts as a label for users who are expected to have similar interests. The proposed framework does not require real users' data; rather it uses the web as a resource to model users. The proposed framework offers a data-driven and modular design for each of the three phases of building a user model. First, I present an approach to identify social positions from natural language sentences. I formulate this task as a binary classification task and develop a method to enumerate candidate social positions. The proposed classifier achieves an accuracy score of 85.8%, which indicates that social positions can be identified with good accuracy. Through an inter-annotator agreement study, I further show a reasonable level of agreement between users when identifying social positions. Second, I introduce a novel topic modelling-based approach to represent each social position as a multinomial distribution over words. This approach estimates a topic from a document collection for each position. To construct such a collection for a particular position, I propose a seeding algorithm that extracts a set of terms relevant to the social position. Coherence-based evaluation shows that the proposed approach learns significantly more coherent representations when compared with a relevance modelling baseline. Third, I present a diversification approach based on the proposed framework. Diversification algorithms aim to return a result list for a search query that would potentially satisfy users with diverse information needs. I propose to identify social positions that are relevant to a search query. These positions act as an implicit representation of the many possible interpretations of the search query. Then, relevant positions are provided to a diversification technique that proportionally diversifies results based on each social position's importance. I evaluate my approach using four test collections provided by the diversity task of the Text REtrieval Conference (TREC) web tracks for 2009, 2010, 2011, and 2012. Results demonstrate that my proposed diversification approach is effective and provides statistically significant improvements over various implicit diversification approaches. Fourth, I introduce a session-based search system under the framework of learning to rank. Such a system aims to improve the retrieval performance for a search query using previous user interactions during the search session. I present a method to match a search session to its most relevant social positions based on the session's interaction data. I then suggest identifying related sessions from query logs that are likely to be issued by users with similar information needs. Novel learning features are then estimated from the session's social positions, related sessions, and interaction data. I evaluate the proposed system using four test collections from the TREC session track. This approach achieves state-of-the-art results compared with effective session-based search systems. I demonstrate that such a strong performance is mainly attributed to features that are derived from social positions' data

Apollo (Cambridge)

Synsets improve short text clustering for search support: combining LDA and WordNet

Author: Wang Xiaomeng
Publication venue: University of North Carolina at Chapel Hill
Publication date: 01/01/2018
Field of study

In this study, I proposed a short text clustering approach with WordNet as the external resources to cluster documents from corpus.byu.edu. Experimental results show that our approach largely improved the clustering performance. The factors that have an influence on the performance of the topic model are the total number of documents, Synsets distribution among topics and words overlapping between the query’s Synsets. In addition, the performance will also be influenced by the missing Synset in WordNet. Finally, we provide an idea of using clustering approaches generating ranked query suggestion to disambiguate the query. Combining with Synsets of the query, text document clustering can provide an effective way to disambiguate user search query by organizing a large set of searching results into a small number of groups labeled with Synsets from WordNet.Master of Science in Information Scienc

Carolina Digital Repository

Exploratory information searching in the enterprise: a study of user satisfaction and task performance.

Author: Burnett Simon
Cleverley Paul H.
Muir Laura
Publication venue: 'Wiley'
Publication date: 23/09/2015
Field of study

No prior research has been identified that investigates the causal factors for workplace exploratory search task performance. The impact of user, task, and environmental factors on user satisfaction and task performance was investigated through a mixed methods study with 26 experienced information professionals using enterprise search in an oil and gas enterprise. Some participants found 75% of high-value items, others found none, with an average of 27%. No association was found between self-reported search expertise and task performance, with a tendency for many participants to overestimate their search expertise. Successful searchers may have more accurate mental models of both search systems and the information space. Organizations may not have effective exploratory search task performance feedback loops, a lack of learning. This may be caused by management bias towards technology, not capability, a lack of systems thinking. Furthermore, organizations may not “know” they “don't know” their true level of search expertise, a lack of knowing. A metamodel is presented identifying the causal factors for workplace exploratory search task performance. Semistructured qualitative interviews with search staff from the defense, pharmaceutical, and aerospace sectors indicates the potential transferability of the finding that organizations may not know their search expertise levels

Open Access Institutional Repository at Robert Gordon University

The Use of Social Tags in Text and Image Searching on the Web.

Author: Kim Yong-Mi
Publication venue
Publication date
Field of study

In recent years, tags have become a standard feature on a diverse range of sites on the Web, accompanying blog posts, photos, videos, and online news stories. Tags are descriptive terms attached to Internet resources. Despite the rapid adoption of tagging, how people use tags during the search process is not well understood. There is little empirical data on the use and perceptions of tags created by those other than the searcher. Previous research on tags focused on the motivations and behaviors of taggers, although non-taggers represent a larger proportion of Web users than taggers. This study examines how people use tags, created by others, during the search process. Forty-eight subjects were each assigned four search tasks in a within-subjects study. Subjects searched for text documents and images in a controlled laboratory setting, using information retrieval interfaces differing in their incorporation of tags. User behavior and perception data were collected through search logs and interviews. Both direct and indirect uses of tags across the search process were examined. Tags are used directly when they are clicked on, resulting in a new query, while tags are used indirectly when used for judgments of relevance or to obtain additional terms for query reformulation. Tags increased interactions with the information retrieval system, as subjects issued more queries and saw more search results when using the tagged interface. For both text and image searches, tags were used for query reformulation, predictive judgment, and evaluative judgment of relevance. Subjects interacted most frequently with tags on the search results page, using them for query reformulation and predictive judgment. Tags were more likely to be used for predictive judgment in text searches than in image searches. Subjects’ understanding of tags focused on the role of tags in search, especially findability through a search engine. Tags were not uniformly perceived as being user-generated; site owners and automatic generation were mentioned as sources of tags. Several implications for the design of search interfaces and presentation of tags to support information interactions are discussed in the conclusion.Ph.D.InformationUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/89816/1/kimym_1.pd

Deep Blue Documents at the University of Michigan

DIR 2011: Dutch_Belgian Information Retrieval Workshop Amsterdam

Author: Boscarino C.
de Rijke M.
Hofmann K.
Jijkoun V.
Meij E.
Weerkamp W.
Publication venue: University of Amsterdam, Information and Language Processing group
Publication date: 01/01/2011
Field of study

International Migration, Integration and Social Cohesion online publications

Named entity recognition and classification in search queries

Author: Alasiry Areej Mohammed
Publication venue
Publication date
Field of study

Named Entity Recognition and Classification is the task of extracting from text, instances of different entity classes such as person, location, or company. This task has recently been applied to web search queries in order to better understand their semantics, where a search query consists of linguistic units that users submit to a search engine to convey their search need. Discovering and analysing the linguistic units comprising a search query enables search engines to reveal and meet users' search intents. As a result, recent research has concentrated on analysing the constituent units comprising search queries. However, since search queries are short, unstructured, and ambiguous, an approach to detect and classify named entities is presented in this thesis, in which queries are augmented with the text snippets of search results for search queries. The thesis makes the following contributions: 1. A novel method for detecting candidate named entities in search queries, which utilises both query grammatical annotation and query segmentation. 2. A novel method to classify the detected candidate entities into a set of target entity classes, by using a seed expansion approach; the method presented exploits the representation of the sets of contextual clues surrounding the entities in the snippets as vectors in a common vector space. 3. An exploratory analysis of three main categories of search refiners: nouns, verbs, and adjectives, that users often incorporate in entity-centric queries in order to further refine the entity-related search results. 4. A taxonomy of named entities derived from a search engine query log. By using a large commercial query log, experimental evidence is provided that the work presented herein is competitive with the existing research in the field of entity recognition and classification in search queries