6 research outputs found
Increasing the Efficiency of High-Recall Information Retrieval
The goal of high-recall information retrieval (HRIR) is to find all,
or nearly all, relevant documents while maintaining reasonable assessment effort.
Achieving high recall is a key problem in the use of applications such as
electronic discovery, systematic review, and construction of test collections for
information retrieval tasks. State-of-the-art HRIR systems commonly rely on iterative relevance feedback in which
human assessors continually assess machine learning-selected documents.
The relevance of the assessed documents is then fed back to
the machine learning model to improve its ability to select the next set of
potentially relevant documents for assessment. In many instances, thousands of human assessments might be required to achieve high recall. These assessments represent the main cost of such HRIR
applications. Therefore, their effectiveness in achieving high recall
is limited by their reliance on human input when assessing the relevance of
documents. In this thesis, we test different methods in order to improve the effectiveness and
efficiency of finding relevant documents using state-of-the-art HRIR
system. With regard to the effectiveness, we try to build a machine-learned
model that retrieves relevant documents more accurately.
For efficiency, we try to help human assessors make
relevance assessments more easily and quickly via our HRIR system.
Furthermore, we try to establish a stopping criteria for the
assessment process so as to avoid excessive assessment.
In particular, we hypothesize that total assessment effort to achieve high
recall can be reduced by using shorter document excerpts
(e.g., extractive summaries) in place of full documents for the assessment of
relevance and using a high-recall retrieval system based on continuous active
learning (CAL). In order to test this hypothesis, we implemented a
high-recall retrieval system based on state-of-the-art implementation of CAL. This high-recall retrieval system could display
either full documents or short document excerpts for relevance assessment.
A search engine was also integrated into our system to provide
assessors the option of conducting interactive search and judging.
We conducted a simulation study, and separately, a 50-person controlled user study to test our hypothesis.
The results of the simulation study show that judging even a single
extracted sentence for relevance feedback may be adequate for CAL
to achieve high recall. The results of the controlled user study
confirmed that human assessors were able to find
a significantly larger number of relevant documents within limited time when they used the
system with paragraph-length document excerpts as opposed to full documents.
In addition, we found that allowing participants to compose and execute their
own search queries did not improve their ability to find relevant
documents and, by some measures, impaired performance.
Moreover, integrating sampling methods with active
learning can yield accurate estimates of the number of relevant documents, and thus avoid excessive assessments
Overview of the CLEF 2018 Consumer Health Search Task
This paper details the collection, systems and evaluation
methods used in the CLEF 2018 eHealth Evaluation Lab, Consumer
Health Search (CHS) task (Task 3). This task investigates the effectiveness of search engines in providing access to medical information present
on the Web for people that have no or little medical knowledge. The task
aims to foster advances in the development of search technologies for
Consumer Health Search by providing resources and evaluation methods
to test and validate search systems. Built upon the the 2013-17 series
of CLEF eHealth Information Retrieval tasks, the 2018 task considers
both mono- and multilingual retrieval, embracing the Text REtrieval
Conference (TREC) -style evaluation process with a shared collection of
documents and queries, the contribution of runs from participants and
the subsequent formation of relevance assessments and evaluation of the
participants submissions.
For this year, the CHS task uses a new Web corpus and a new set of
queries compared to the previous years. The new corpus consists of Web
pages acquired from the CommonCrawl and the new set of queries consists of 50 queries issued by the general public to the Health on the Net
(HON) search services. We then manually translated the 50 queries to
French, German, and Czech; and obtained English query variations of
the 50 original queries.
A total of 7 teams from 7 different countries participated in the 2018 CHS
task: CUNI (Czech Republic), IMS Unipd (Italy), MIRACL (Tunisia),
QUT (Australia), SINAI (Spain), UB-Botswana (Botswana), and UEvora
(Portugal)
Overview of the CLEF 2018 Consumer Health Search Task
This paper details the collection, systems and evaluation
methods used in the CLEF 2018 eHealth Evaluation Lab, Consumer Health Search (CHS) task (Task 3). This task investigates the effectiveness of search engines in providing access to medical information present on the Web for people that have no or little medical knowledge. The task aims to foster advances in the development of search technologies for Consumer Health Search by providing resources and evaluation methods to test and validate search systems. Built upon the the 2013-17 series of CLEF eHealth Information Retrieval tasks, the 2018 task considers
both mono- and multilingual retrieval, embracing the Text REtrieval Conference (TREC) -style evaluation process with a shared collection of documents and queries, the contribution of runs from participants and the subsequent formation of relevance assessments and evaluation of the participants submissions.
For this year, the CHS task uses a new Web corpus and a new set of queries compared to the previous years. The new corpus consists of Web pages acquired from the CommonCrawl and the new set of queries consists of 50 queries issued by the general public to the Health on the Net (HON) search services. We then manually translated the 50 queries to
French, German, and Czech; and obtained English query variations of the 50 original queries.
A total of 7 teams from 7 different countries participated in the 2018 CHS task: CUNI (Czech Republic), IMS Unipd (Italy), MIRACL (Tunisia), QUT (Australia), SINAI (Spain), UB-Botswana (Botswana), and UEvora (Portugal)
On Design and Evaluation of High-Recall Retrieval Systems for Electronic Discovery
High-recall retrieval is an information retrieval task model where the goal is to
identify, for human consumption, all, or as many as practicable, documents relevant to
a particular information need.
This thesis investigates the ways in which one can evaluate high-recall retrieval
systems and explores several design considerations that should be accounted for when designing
such systems for electronic discovery.
The primary contribution of this work is a framework for conducting high-recall retrieval
experimentation in a controlled and repeatable way.
This framework builds upon lessons learned from similar tasks to facilitate the use
of retrieval systems on collections that cannot be distributed due to the sensitivity
or privacy of the material contained within.
Accordingly, a Web API is used to distribute document collections,
informations needs, and corresponding relevance assessments in a one-document-at-a-time manner.
Validation is conducted through the successful deployment of this architecture in the 2015 TREC
Total Recall track over the live Web and in controlled environments.
Using the runs submitted to the Total Recall track and other test collections, we explore the
efficacy of a variety of new and existing effectiveness measures to high-recall retrieval tasks.
We find that summarizing the trade-off between recall and the effort required to attain that
recall is a non-trivial task and that several measures are sensitive to properties of the test
collections themselves.
We conclude that the gain curve, a de facto standard, and variants of the gain curve are the most robust to
variations in test collection properties and the evaluation of high-recall systems.
This thesis also explores the effect that non-authoritative, surrogate assessors can have
when training machine learning algorithms.
Contrary to popular thought, we find that surrogate assessors appear to be inferior
to authoritative assessors due to differences of opinion rather than innate inferiority in
their ability to identify relevance.
Furthermore, we show that several techniques for diversifying and liberalizing a surrogate
assessor's conception of relevance can yield substantial improvement in the surrogate
and, in some cases, rival the authority.
Finally, we present the results of a user study conducted to investigate the effect that
three archetypal high-recall retrieval systems have on judging behaviour.
Compared to using random and uncertainty sampling, selecting documents for training using relevance sampling significantly decreases the probability that
a user will identify that document as relevant.
On the other hand, no substantial difference between the test conditions is observed in the time taken to render
such assessments
General Index to Research Notes for: A History of Blacks in Kentucky, Part II
This index, general in nature, is organized under seventeen larger topics: Camp Nelson Slavery Slave Hiring Free Blacks Underground Railroad - Fugitives Post-Civil War Living Conditions Society & Culture - Medical Care Professions - Employment Freedmen\u27s Bureau Civil Rights Politics Recreation Population Segregation- Changes in the 1890s Civil War Education Religion
Under these general headings, there are numerous subtopics. The research notes are numbered and presented in numerical order, and they are searchable by note numbers, names, dates, events, and topics (occasional hand-written numbers may not appear in searches). There are no missing notes, but there are occasional missing numbers, especially near the end of the research notes. The areas of missing numbers are between notes 4099 and 5000 (the largest group), between 6325-A and 6412, between 7148-D and 7158, and between 7328-B and 7379. My hope is that historians at all levels interested in African American history will find my research notes helpful in their research.
Part I contains introduction, index and cards 1 to 1676 - http://digitalcommons.wku.edu/hist_ky/1/
Part II contains cards 1677 to 3737 - http://digitalcommons.wku.edu/hist_ky/2
Part III contains cards 3738 to 6155-B - http://digitalcommons.wku.edu/hist_ky/3
Part IV contains cards 6155-C to 7429-B - http://digitalcommons.wku.edu/hist_ky/
GVSU Press Releases, 1973
A compilation of press releases for the year 1973 submitted by University Communications (formerly News & Information Services) to news agencies concerning the people, places, and events related to Grand Valley State University