15 research outputs found
Denmark's Participation in the Search Engine TREC COVID-19 Challenge: Lessons Learned about Searching for Precise Biomedical Scientific Information on COVID-19
This report describes the participation of two Danish universities,
University of Copenhagen and Aalborg University, in the international search
engine competition on COVID-19 (the 2020 TREC-COVID Challenge) organised by the
U.S. National Institute of Standards and Technology (NIST) and its Text
Retrieval Conference (TREC) division. The aim of the competition was to find
the best search engine strategy for retrieving precise biomedical scientific
information on COVID-19 from the largest, at that point in time, dataset of
curated scientific literature on COVID-19 -- the COVID-19 Open Research Dataset
(CORD-19). CORD-19 was the result of a call to action to the tech community by
the U.S. White House in March 2020, and was shortly thereafter posted on Kaggle
as an AI competition by the Allen Institute for AI, the Chan Zuckerberg
Initiative, Georgetown University's Center for Security and Emerging
Technology, Microsoft, and the National Library of Medicine at the US National
Institutes of Health. CORD-19 contained over 200,000 scholarly articles (of
which more than 100,000 were with full text) about COVID-19, SARS-CoV-2, and
related coronaviruses, gathered from curated biomedical sources. The TREC-COVID
challenge asked for the best way to (a) retrieve accurate and precise
scientific information, in response to some queries formulated by biomedical
experts, and (b) rank this information decreasingly by its relevance to the
query.
In this document, we describe the TREC-COVID competition setup, our
participation to it, and our resulting reflections and lessons learned about
the state-of-art technology when faced with the acute task of retrieving
precise scientific information from a rapidly growing corpus of literature, in
response to highly specialised queries, in the middle of a pandemic
Semantic Modelling of Citation Contexts for Context-Aware Citation Recommendation
Contents
The four CSV files are the data used for the evaluation in:
Saier T., Färber M. (2020) Semantic Modelling of Citation Contexts for Context-Aware Citation Recommendation. In: Advances in Information Retrieval. ECIR 2020. Lecture Notes in Computer Science, vol 12035.
DOI: 10.1007/978-3-030-45439-5_15
Code: github.com/IllDepence/ecir2020
The evaluation was conducted in a citation re-prediction setting.
CSV Format
7 columns divided by \u241E
cited document ID
for *_nomarker.csv: citation marker position ambiguous
for *_withmarker.csv: citation marker position at 'MAINCIT' in citation context
adjacent cited document IDs
only given in citrec_unarxive_*.csv
divided by \u241F
order matches 'CIT' markers in citation context
citing document ID
citation context
MAG field of study IDs
divided by \u241F
predicate:argument tuples generated based on PredPatt
JSON
noun phrases
for *_nomarker.csv: divided by \u241F
for *_withmarker.csv:
divided by \u241D into
noun phrases
noun phrase directly preceding citation marker
Data Sources
citrec_unarxive_cs_withmarker.csv
data set
unarXive
Paper DOI: 10.1007/s11192-020-03382-z
Data DOI: 10.5281/zenodo.2553522
filter
citing doc from computer science
cited doc is cited at least 5 times
citrec_mag_cs_en.csv
data set
Microsoft Academic Graph (MAG)
Paper DOI: 10.1145/2740908.2742839
filter
citing doc from computer science and in English
citing doc abstract in MAG given
cited doc is cited at least 50 times
citrec_refseer.csv
data set
RefSeer
Paper URL: ojs.aaai.org/index.php/AAAI/article/view/9528
Data URL: psu.app.box.com/v/refseer
filter
for citing and cited docs title, venue, venuetype, abstract, and year not NULL
citrec_acl-arc_withmarker.csv
data set
ACL ARC
Paper URL: aclanthology.org/L08-1005
Data URL: acl-arc.comp.nus.edu.sg/
filter
cited doc has a DBLP ID
Paper Citation
@inproceedings{Saier2020ECIR,
author = {Tarek Saier and
Michael F{\"{a}}rber},
title = {{Semantic Modelling of Citation Contexts for Context-aware Citation Recommendation}},
booktitle = {Proceedings of the 42nd European Conference on Information Retrieval},
pages = {220--233},
year = {2020},
month = apr,
doi = {10.1007/978-3-030-45439-5_15},
Citation recommendation: approaches and datasets
Citation recommendation describes the task of recommending citations for a given text. Due to the overload of published scientific works in recent years on the one hand, and the need to cite the most appropriate publications when writing scientific texts on the other hand, citation recommendation has emerged as an important research topic. In recent years, several approaches and evaluation data sets have been presented. However, to the best of our knowledge, no literature survey has been conducted explicitly on citation recommendation. In this article, we give a thorough introduction to automatic citation recommendation research. We then present an overview of the approaches and data sets for citation recommendation and identify differences and commonalities using various dimensions. Last but not least, we shed light on the evaluation methods and outline general challenges in the evaluation and how to meet them. We restrict ourselves to citation recommendation for scientific publications, as this document type has been studied the most in this area. However, many of the observations and discussions included in this survey are also applicable to other types of text, such as news articles and encyclopedic articles
Three real-world datasets and neural computational models for classification tasks in patent landscaping
Patent Landscaping, one of the central tasks of intellectual property management, includes selecting and grouping patents according to user-defined technical or application-oriented criteria. While recent transformer-based models have been shown to be effective for classifying patents into taxonomies such as CPC or IPC, there is yet little research on how to support real-world Patent Landscape Studies (PLSs) using natural language processing methods. With this paper, we release three labeled datasets for PLS-oriented classification tasks covering two diverse domains. We provide a qualitative analysis and report detailed corpus statistics.Most research on neural models for patents has been restricted to leveraging titles and abstracts. We compare strong neural and non-neural baselines, proposing a novel model that takes into account textual information from the patents’ full texts as well as embeddings created based on the patents’ CPC labels. We find that for PLS-oriented classification tasks, going beyond title and abstract is crucial, CPC labels are an effective source of information, and combining all features yields the best results
Citation Recommendation: Approaches and Datasets
Citation recommendation describes the task of recommending citations for a
given text. Due to the overload of published scientific works in recent years
on the one hand, and the need to cite the most appropriate publications when
writing scientific texts on the other hand, citation recommendation has emerged
as an important research topic. In recent years, several approaches and
evaluation data sets have been presented. However, to the best of our
knowledge, no literature survey has been conducted explicitly on citation
recommendation. In this article, we give a thorough introduction into automatic
citation recommendation research. We then present an overview of the approaches
and data sets for citation recommendation and identify differences and
commonalities using various dimensions. Last but not least, we shed light on
the evaluation methods, and outline general challenges in the evaluation and
how to meet them. We restrict ourselves to citation recommendation for
scientific publications, as this document type has been studied the most in
this area. However, many of the observations and discussions included in this
survey are also applicable to other types of text, such as news articles and
encyclopedic articles.Comment: to be published in the International Journal on Digital Librarie
Geographic information extraction from texts
A large volume of unstructured texts, containing valuable geographic information, is available online. This information – provided implicitly or explicitly – is useful not only for scientific studies (e.g., spatial humanities) but also for many practical applications (e.g., geographic information retrieval). Although large progress has been achieved in geographic information extraction from texts, there are still unsolved challenges and issues, ranging from methods, systems, and data, to applications and privacy. Therefore, this workshop will provide a timely opportunity to discuss the recent advances, new ideas, and concepts but also identify research gaps in geographic information extraction
Spoken conversational search: audio-only interactive information retrieval
Speech-based web search where no keyboard or screens are available to present search engine results is becoming ubiquitous, mainly through the use of mobile devices and intelligent assistants such as Apple's HomePod, Google Home, or Amazon Alexa. Currently, these intelligent assistants do not maintain a lengthy information exchange. They do not track context or present information suitable for an audio-only channel, and do not interact with the user in a multi-turn conversation. Understanding how users would interact with such an audio-only interaction system in multi-turn information seeking dialogues, and what users expect from these new systems, are unexplored in search settings. In particular, the knowledge on how to present search results over an audio-only channel and which interactions take place in this new search paradigm is crucial to incorporate while producing usable systems. Thus, constructing insight into the conversational structure of information seeking processes provides researchers and developers opportunities to build better systems while creating a research agenda and directions for future advancements in Spoken Conversational Search (SCS). Such insight has been identified as crucial in the growing SCS area. At the moment, limited understanding has been acquired for SCS, for example how the components interact, how information should be presented, or how task complexity impacts the interactivity or discourse behaviours. We aim to address these knowledge gaps. This thesis outlines the breadth of SCS and forms a manifesto advancing this highly interactive search paradigm with new research directions including prescriptive notions for implementing identified challenges. We investigate SCS through quantitative and qualitative designs: (i) log and crowdsourcing experiments investigating different interaction and results presentation styles, and (ii) the creation and analysis of the first SCS dataset and annotation schema through designing and conducting an observational study of information seeking dialogues. We propose new research directions and design recommendations based on the triangulation of three different datasets and methods: the log analysis to identify practical challenges and limitations of existing systems while informing our future observational study; the crowdsourcing experiment to validate a new experimental setup for future search engine results presentation investigations; and the observational study to establish the SCS dataset (SCSdata), form the first Spoken Conversational Search Annotation Schema (SCoSAS), and study interaction behaviours for different task complexities. Our principle contributions are based on our observational study for which we developed a novel methodology utilising a qualitative design. We show that existing information seeking models may be insufficient for the new SCS search paradigm because they inadequately capture meta-discourse functions and the system's role as an active agent. Thus, the results indicate that SCS systems have to support the user through discourse functions and be actively involved in the users' search process. This suggests that interactivity between the user and system is necessary to overcome the increased complexity which has been imposed upon the user and system by the constraints of the audio-only communication channel. We then present the first schematic model for SCS which is derived from the SCoSAS through the qualitative analysis of the SCSdata. In addition, we demonstrate the applicability of our dataset by investigating the effect of task complexity on interaction and discourse behaviour. Lastly, we present SCS design recommendations and outline new research directions for SCS. The implications of our work are practical, conceptual, and methodological. The practical implications include the development of the SCSdata, the SCoSAS, and SCS design recommendations. The conceptual implications include the development of a schematic SCS model which identifies the need for increased interactivity and pro-activity to overcome the audio-imposed complexity in SCS. The methodological implications include the development of the crowdsourcing framework, and techniques for developing and analysing SCS datasets. In summary, we believe that our findings can guide researchers and developers to help improve existing interactive systems which are less constrained, such as mobile search, as well as more constrained systems such as SCS systems