16 research outputs found
Querylog-based assessment of retrievability bias in a large newspaper corpus
Bias in the retrieval of documents can directly influence the information access of a digital library. In the worst case, systematic favoritism for a certain type of document can render other parts of the collection invisible to users. This potential bias can be evaluated by measuring the retrievability for all documents in a collection. Previous evaluations have been performed on TREC collections using simulated query sets. The question remains, however, how representative this approach is of more realistic settings. To address this question, we investigate the effectiveness of the retrievability measure using a large digitized newspaper corpus, featuring two characteristics that distinguishes our experiments from previous studies: (1) compared to TREC collections, our collection contains noise originating from OCR processing, historical spelling and use of language; and (2) instead of simula
Retrievability in an Integrated Retrieval System: An Extended Study
Retrievability measures the influence a retrieval system has on the access to
information in a given collection of items. This measure can help in making an
evaluation of the search system based on which insights can be drawn. In this
paper, we investigate the retrievability in an integrated search system
consisting of items from various categories, particularly focussing on
datasets, publications \ijdl{and variables} in a real-life Digital Library
(DL). The traditional metrics, that is, the Lorenz curve and Gini coefficient,
are employed to visualize the diversity in retrievability scores of the
\ijdl{three} retrievable document types (specifically datasets, publications,
and variables). Our results show a significant popularity bias with certain
items being retrieved more often than others. Particularly, it has been shown
that certain datasets are more likely to be retrieved than other datasets in
the same category. In contrast, the retrievability scores of items from the
variable or publication category are more evenly distributed. We have observed
that the distribution of document retrievability is more diverse for datasets
as compared to publications and variables.Comment: To appear in International Journal on Digital Libraries (IJDL). arXiv
admin note: substantial text overlap with arXiv:2205.0093
Competent Men and Warm Women: Gender Stereotypes and Backlash in Image Search Results
There is much concern about algorithms that underlie
information services and the view of the world they
present. We develop a novel method for examining the
content and strength of gender stereotypes in image
search, inspired by the trait adjective checklist method.
We compare the gender distribution in photos retrieved by
Bing for the query “person” and for queries based on 68
character traits (e.g., “intelligent person”) in four regional
markets. Photos of men are more often retrieved for
“person,” as compared to women. As predicted, photos of
women are more often retrieved for warm traits (e.g.,
“emotional”) whereas agentic traits (e.g., “rational”) are
represented by photos of men. A backlash effect, where
stereotype-incongruent individuals are penalized, is
observed. However, backlash is more prevalent for
“competent women” than “warm men.” Results underline
the need to understand how and why biases enter search
algorithms and at which stages of the engineering proces
Abstract Images Have Different Levels of Retrievability Per Reverse Image Search Engine
Much computer vision research has focused on natural images, but technical
documents typically consist of abstract images, such as charts, drawings,
diagrams, and schematics. How well do general web search engines discover
abstract images? Recent advancements in computer vision and machine learning
have led to the rise of reverse image search engines. Where conventional search
engines accept a text query and return a set of document results, including
images, a reverse image search accepts an image as a query and returns a set of
images as results. This paper evaluates how well common reverse image search
engines discover abstract images. We conducted an experiment leveraging images
from Wikimedia Commons, a website known to be well indexed by Baidu, Bing,
Google, and Yandex. We measure how difficult an image is to find again
(retrievability), what percentage of images returned are relevant (precision),
and the average number of results a visitor must review before finding the
submitted image (mean reciprocal rank). When trying to discover the same image
again among similar images, Yandex performs best. When searching for pages
containing a specific image, Google and Yandex outperform the others when
discovering photographs with precision scores ranging from 0.8191 to 0.8297,
respectively. In both of these cases, Google and Yandex perform better with
natural images than with abstract ones achieving a difference in retrievability
as high as 54\% between images in these categories. These results affect anyone
applying common web search engines to search for technical documents that use
abstract images.Comment: 20 pages; 7 figures; to be published in the proceedings of the
Drawings and abstract Imagery: Representation and Analysis (DIRA) Workshop
from ECCV 202
Analyzing the influence of bigrams on retrieval bias and effectiveness
Prior work on using retrievability measures in the evaluation of information retrieval (IR) systems has laid out the foundations for investigating the relationship between retrieval effectiveness and retrieval bias. While various factors influencing bias have been examined, there has been no work examining the impact of using bigram within the index on retrieval bias. Intuitively, how the documents are represented, and what terms they contain, will influence whether they are retrievable or not. In this paper, we investigate how the bias of a system changes depending on how the documents are represented using unigrams, bigrams or both. Our analysis of three different retrieval models on three TREC collections, shows that using a bigram only representation results in the lowest bias compared to unigram only representation, but at the expense of retrieval effectiveness. However, when both representations are combined it results in reducing the overall bias, as well as increasing effectiveness. These findings suggest that when configuring and indexing the collection, that the bag-of-words approach (unigrams), should be augmented with bigrams to create better and fairer retrieval systems
Investigating User Perception of Gender Bias in Image Search
There is growing evidence that search engines produce results that are socially biased, reinforcing a view of the world that aligns with prevalent social stereotypes. One means to promote greater transparency of search algorithms - which are typically complex and proprietary - is to raise user awareness of biased result sets. However, to date, little is known concerning how users perceive bias in search results, and the degree to which their perceptions differ and/or might be predicted based on user attributes. One particular area of search that has recently gained attention, and forms the focus of this study, is image retrieval and gender bias. We conduct a controlled experiment via crowdsourcing using participants recruited from three countries to measure the extent to which workers perceive a given image results set to be subjective or objective. Demographic information about the workers, along with measures of sexism, are gathered and analysed to investigate whether (gender) biases in the image search results can be detected. Amongst other findings, the results confirm that sexist people are less likely to detect and report gender biases in image search results
Fairness and transparency throughout a digital humanities workflow: Challenges and recommendations
How can we achieve sufficient levels of transparency and fairness for (humanities) research
based on historical newspapers? Which concrete measures should be taken by data providers
such as libraries, research projects and individual researchers? We approach these questions
from the vantage point that digitised newspapers are complex sources with a high degree
of heterogeneity caused by a long chain of processing steps, ranging, e.g., from digitisation
policies, copyright restrictions to the evolving performance of tools for their enrichment such
as OCR or article segmentation. Overall, we emphasise the need for careful documentation
of data processing, research practices and the acknowledgement of support from institutions
and collaborators
The relationship between retrievability bias and retrieval performance
A long standing problem in the domain of Information Retrieval (IR) has been the influence of biases within an IR system on the ranked results presented to a user. Retrievability is an IR evaluation measure which provides a means to assess the level of bias present in a system by evaluating how \emph{easily} documents in the collection can be found by the IR system in place. Retrievability is intrinsically related to retrieval performance because a document needs to be retrieved before it can be judged relevant. It is therefore reasonable to expect that lowering the level of bias present within a system could lead to improvements in retrieval performance. In this thesis, we undertake an investigation of the nature of the relationship between classical retrieval performance and retrievability bias. We explore the interplay between the two as we alter different aspects of the IR system in an attempt to investigate the \emph{Fairness Hypothesis}: that a system which is fairer (i.e. exerts the least amount of retrievability bias), performs better.
To investigate the relationship between retrievability bias and retrieval performance we utilise a set of 6 standard TREC collections (3 news and 3 web) and a suite of standard retrieval models. We investigate this relationship by looking at four main aspects of the retrieval process using this set of TREC collections to also explore how generalisable the findings are. We begin by investigating how the retrieval model used relates to both bias and performance by issuing a large set of queries to a set of common retrieval models. We find a general trend where using a retrieval model that is evaluated to be more \emph{fair} (i.e. less biased) leads to improved performance over less fair systems. Hinting that providing documents with a more equal opportunity for access can lead to better retrieval performance.
Following on from our first study, we investigate how bias and performance are affected by tuning length normalisation of several parameterised retrieval models. We explore the space of the length normalisation parameters of BM25, PL2 and Language Modelling. We find that tuning these parameters often leads to a trade off between performance and bias such that minimising bias will often not equate to maximising performance when traditional TREC performance measures are used. However, we find that measures which account for document length and users stopping strategies tend to evaluate the least biased settings to also be the maximum (or near maximum) performing parameter, indicating that the Fairness Hypothesis holds.
Following this, we investigate the impact that query length has on retrievability bias. We issue various automatically generated query sets to the system to see if longer or shorter queries tend to influence the level of bias associated with the system. We find that longer queries tend to reduce bias, possibly due to the fact that longer queries will often lead to more documents being retrieved, but the reductions in bias are in diminishing returns. Our studies show that after issuing two terms, each additional term reduces bias by significantly less.
Finally, we build on our work by employing some fielded retrieval models. We look at typical fielding, where the field relevance scores are computed individually then combined, and compare it with an enhanced version of fielding, where fields are weighted and combined then scored. We see that there are inherent biases against particular documents in the former model, especially in cases where a field is empty and as such see the latter tends to both perform better and also lower bias when compared with the former.
In this thesis, we have examined several different ways in which performance and bias can be related. We conclude that while the Fairness Hypothesis has its merits, it is not a universally applicable idea. We further add to this by noting that the method used to compute bias does not distinguish between positive and negative biases and this influences our results. We do however support the idea that reducing the bias of a system by eliminating biases that are known to be negative should result in improvements in system performance
Expanding the Usage of Web Archives by Recommending Archived Webpages Using Only the URI
Web archives are a window to view past versions of webpages. When a user requests a webpage on the live Web, such as http://tripadvisor.com/where_to_t ravel/, the webpage may not be found, which results in an HyperText Transfer Protocol (HTTP) 404 response. The user then may search for the webpage in a Web archive, such as the Internet Archive. Unfortunately, if this page had never been archived, the user will not be able to view the page, nor will the user gain any information on other webpages that have similar content in the archive, such as the archived webpage http://classy-travel.net. Similarly, if the user requests the webpage http://hokiesports.com/football/ from the Internet Archive, the user will only find the requested webpage, and the user will not gain any information on other webpages that have similar content in the archive, such as the archived webpage http://techsideline.com. In this research, we will build a model for selecting and ranking possible recommended webpages at a Web archive. This is to enhance both HTTP 404 responses and HTTP 200 responses by surfacing webpages in the archive that the user may not know existed. First, we detect semantics in the requested Uniform Resource Identifier (URI). Next, we classify the URI using an ontology, such as DMOZ or any website directory. Finally, we filter and rank candidates based on several features, such as archival quality, webpage popularity, temporal similarity, and content similarity. We measure the performance of each step using different techniques, including calculating the F1 to measure of different tokenization methods and the classification. We tested the model using human evaluation to determine if we could classify and find recommendations for a sample of requests from the Internet Archive’s Wayback Machine access log. Overall, when selecting the full categorization, reviewers agreed with 80.3% of the recommendations, which is much higher than “do not agree” and “I do not know”. This indicates the reviewer is more likely to agree on the recommendations when selecting the full categorization. But when selecting the first level only, reviewers only agreed with 25.5% of the recommendations. This indicates that having deep level categorization improves the performance of finding relevant recommendations