103 research outputs found
The Archive Query Log: Mining Millions of Search Result Pages of Hundreds of Search Engines from 25 Years of Web Archives
The Archive Query Log (AQL) is a previously unused, comprehensive query log
collected at the Internet Archive over the last 25 years. Its first version
includes 356 million queries, 166 million search result pages, and 1.7 billion
search results across 550 search providers. Although many query logs have been
studied in the literature, the search providers that own them generally do not
publish their logs to protect user privacy and vital business data. Of the few
query logs publicly available, none combines size, scope, and diversity. The
AQL is the first to do so, enabling research on new retrieval models and
(diachronic) search engine analyses. Provided in a privacy-preserving manner,
it promotes open research as well as more transparency and accountability in
the search industry.Comment: SIGIR 2023 resource paper, 13 page
Recommended from our members
A user-centred approach to information retrieval
A user model is a fundamental component in user-centred information retrieval systems. It enables personalization of a user's search experience. The development of such a model involves three phases: collecting information about each user, representing such information, and integrating the model into a retrieval application. Progress in this area is typically met with privacy and scalability challenges that hinder the ability to synthesize collective knowledge from each user's search behaviour. In this thesis, I propose a framework that addresses each of these three phases. The proposed framework is based on social role theory from the social science literature and at the centre of this theory is the concept of a social position. A social position is a label for a group of users with similar behavioural patterns. Examples of such positions are traveller, patient, movie fan, and computer scientist. In this thesis, a social position acts as a label for users who are expected to have similar interests. The proposed framework does not require real users' data; rather it uses the web as a resource to model users.
The proposed framework offers a data-driven and modular design for each of the three phases of building a user model. First, I present an approach to identify social positions from natural language sentences. I formulate this task as a binary classification task and develop a method to enumerate candidate social positions. The proposed classifier achieves an accuracy score of 85.8%, which indicates that social positions can be identified with good accuracy. Through an inter-annotator agreement study, I further show a reasonable level of agreement between users when identifying social positions.
Second, I introduce a novel topic modelling-based approach to represent each social position as a multinomial distribution over words. This approach estimates a topic from a document collection for each position. To construct such a collection for a particular position, I propose a seeding algorithm that extracts a set of terms relevant to the social position. Coherence-based evaluation shows that the proposed approach learns significantly more coherent representations when compared with a relevance modelling baseline.
Third, I present a diversification approach based on the proposed framework. Diversification algorithms aim to return a result list for a search query that would potentially satisfy users with diverse information needs. I propose to identify social positions that are relevant to a search query. These positions act as an implicit representation of the many possible interpretations of the search query. Then, relevant positions are provided to a diversification technique that proportionally diversifies results based on each social position's importance. I evaluate my approach using four test collections provided by the diversity task of the Text REtrieval Conference (TREC) web tracks for 2009, 2010, 2011, and 2012. Results demonstrate that my proposed diversification approach is effective and provides statistically significant improvements over various implicit diversification approaches.
Fourth, I introduce a session-based search system under the framework of learning to rank. Such a system aims to improve the retrieval performance for a search query using previous user interactions during the search session. I present a method to match a search session to its most relevant social positions based on the session's interaction data. I then suggest identifying related sessions from query logs that are likely to be issued by users with similar information needs. Novel learning features are then estimated from the session's social positions, related sessions, and interaction data. I evaluate the proposed system using four test collections from the TREC session track. This approach achieves state-of-the-art results compared with effective session-based search systems. I demonstrate that such a strong performance is mainly attributed to features that are derived from social positions' data
Synsets improve short text clustering for search support: combining LDA and WordNet
In this study, I proposed a short text clustering approach with WordNet as the external resources to cluster documents from corpus.byu.edu. Experimental results show that our approach largely improved the clustering performance. The factors that have an influence on the performance of the topic model are the total number of documents, Synsets distribution among topics and words overlapping between the query’s Synsets. In addition, the performance will also be influenced by the missing Synset in WordNet. Finally, we provide an idea of using clustering approaches generating ranked query suggestion to disambiguate the query. Combining with Synsets of the query, text document clustering can provide an effective way to disambiguate user search query by organizing a large set of searching results into a small number of groups labeled with Synsets from WordNet.Master of Science in Information Scienc
Exploratory information searching in the enterprise: a study of user satisfaction and task performance.
No prior research has been identified that investigates the causal factors for workplace exploratory search task performance. The impact of user, task, and environmental factors on user satisfaction and task performance was investigated through a mixed methods study with 26 experienced information professionals using enterprise search in an oil and gas enterprise. Some participants found 75% of high-value items, others found none, with an average of 27%. No association was found between self-reported search expertise and task performance, with a tendency for many participants to overestimate their search expertise. Successful searchers may have more accurate mental models of both search systems and the information space. Organizations may not have effective exploratory search task performance feedback loops, a lack of learning. This may be caused by management bias towards technology, not capability, a lack of systems thinking. Furthermore, organizations may not “know” they “don't know” their true level of search expertise, a lack of knowing. A metamodel is presented identifying the causal factors for workplace exploratory search task performance. Semistructured qualitative interviews with search staff from the defense, pharmaceutical, and aerospace sectors indicates the potential transferability of the finding that organizations may not know their search expertise levels
The Use of Social Tags in Text and Image Searching on the Web.
In recent years, tags have become a standard feature on a diverse range of sites on the Web, accompanying blog posts, photos, videos, and online news stories. Tags are descriptive terms attached to Internet resources. Despite the rapid adoption of tagging, how people use tags during the search process is not well understood. There is little empirical data on the use and perceptions of tags created by those other than the searcher. Previous research on tags focused on the motivations and behaviors of taggers, although non-taggers represent a larger proportion of Web users than taggers. This study examines how people use tags, created by others, during the search process.
Forty-eight subjects were each assigned four search tasks in a within-subjects study. Subjects searched for text documents and images in a controlled laboratory setting, using information retrieval interfaces differing in their incorporation of tags. User behavior and perception data were collected through search logs and interviews. Both direct and indirect uses of tags across the search process were examined. Tags are used directly when they are clicked on, resulting in a new query, while tags are used indirectly when used for judgments of relevance or to obtain additional terms for query reformulation.
Tags increased interactions with the information retrieval system, as subjects issued more queries and saw more search results when using the tagged interface. For both text and image searches, tags were used for query reformulation, predictive judgment, and evaluative judgment of relevance. Subjects interacted most frequently with tags on the search results page, using them for query reformulation and predictive judgment. Tags were more likely to be used for predictive judgment in text searches than in image searches. Subjects’ understanding of tags focused on the role of tags in search, especially findability through a search engine. Tags were not uniformly perceived as being user-generated; site owners and automatic generation were mentioned as sources of tags. Several implications for the design of search interfaces and presentation of tags to support information interactions are discussed in the conclusion.Ph.D.InformationUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/89816/1/kimym_1.pd
Named entity recognition and classification in search queries
Named Entity Recognition and Classification is the task of extracting from text, instances of
different entity classes such as person, location, or company. This task has recently been
applied to web search queries in order to better understand their semantics, where a search
query consists of linguistic units that users submit to a search engine to convey their search
need. Discovering and analysing the linguistic units comprising a search query enables search
engines to reveal and meet users' search intents. As a result, recent research has concentrated
on analysing the constituent units comprising search queries. However, since search queries
are short, unstructured, and ambiguous, an approach to detect and classify named entities is
presented in this thesis, in which queries are augmented with the text snippets of search results
for search queries.
The thesis makes the following contributions:
1. A novel method for detecting candidate named entities in search queries, which utilises
both query grammatical annotation and query segmentation.
2. A novel method to classify the detected candidate entities into a set of target entity
classes, by using a seed expansion approach; the method presented exploits the representation
of the sets of contextual clues surrounding the entities in the snippets as vectors
in a common vector space.
3. An exploratory analysis of three main categories of search refiners: nouns, verbs, and
adjectives, that users often incorporate in entity-centric queries in order to further refine
the entity-related search results.
4. A taxonomy of named entities derived from a search engine query log.
By using a large commercial query log, experimental evidence is provided that the work
presented herein is competitive with the existing research in the field of entity recognition and
classification in search queries
- …