690 research outputs found
UV follow-up observations of five recently active novae in M31
Recently we initiated the Transient UV Objects (TUVO) project, in which we search for serendipitous UV transients in near-real time in Swift/UVOT data using a purposely-built pipeline
The user perspective in professional information search
Computer Systems, Imagery and Medi
Is de zoekmachine van de toekomst een chatbot?
Oratie uitgesproken door Prof. Dr. Suzan Verberne bij de aanvaarding van het ambt van hoogleraar Natural Language Processing aan de Universiteit Leiden op maandag 3 juni 2024____________________________________________________________Text also in English : Is the search engine of the future a chatbot?Oratie uitgesproken door Prof. Dr. Suzan Verberne bij de aanvaarding van het ambt van hoogleraar Natural Language Processing aan de Universiteit Leiden op maandag 3 juni 2024Computer Systems, Imagery and Medi
A Test Collection of Synthetic Documents for Training Rankers:ChatGPT vs. Human Experts
We investigate the usefulness of generative large language models (LLMs) in generating training data for cross-encoder re-rankers in a novel direction: generating synthetic documents instead of synthetic queries. We introduce a new dataset, ChatGPT-RetrievalQA, and compare the effectiveness of strong models fine-tuned on both LLM-generated and human-generated data. We build ChatGPT-RetrievalQA based on an existing dataset, the human ChatGPT comparison corpus (HC3), consisting of multiple public question collections featuring both human- and ChatGPT-generated responses. We fine-tune a range of cross-encoder re-rankers on either human-generated or ChatGPT-generated data. Our evaluation on MS MARCO DEV, TREC DL'19, and TREC DL'20 demonstrates that cross-encoder re-ranking models trained on LLM-generated responses are significantly more effective for out-of-domain re-ranking than those trained on human responses. For in-domain re-ranking, however, the human-trained re-rankers outperform the LLM-trained re-rankers. Our novel findings suggest that generative LLMs have high potential in generating training data for neural retrieval models and can be used to augment training data, especially in domains with less labeled data. ChatGPT-RetrievalQA presents various opportunities for analyzing and improving rankers with both human- and LLM-generated data. Our data, code, and model checkpoints are publicly available.</p
Using skipgrams and POS-based feature selection for patent classification
Contains fulltext :
116289.pdf (publisher's version ) (Open Access)19 p
Transfer Learning for Health-related Twitter Data
Algorithms and the Foundations of Software technolog
Citation Metrics for Legal Information Retrieval Systems
This paper examines citations in legal information retrieval. Citation metrics can be a factor of relevance in the ranking algorithms of legal information retrieval systems. We provide an overview of the Dutch legal publishing culture. To analyze citations in legal publications, we manually analyze a set of documents and register by what (type of) documents they are cited: document type, intended audience of documents, actual audience of documents and author affiliations. An analysis of 9 cited documents and 217 citing documents shows no strict separation in citations between documents aimed at scholars and documents aimed at practitioners. Our results suggest that citations in legal documents do not measure the impact on scholarly publications and scholars, but measure a broader scope of impact, or relevance, for the legal field.Computer Science
CLosER: Conversational Legal Longformer with Expertise-Aware Passage Response Ranker for Long Contexts
In this paper, we investigate the task of response ranking in conversational legal search. We propose a novel method for conversational passage response retrieval (ConvPR) for long conversations in domains with mixed levels of expertise. Conversational legal search is challenging because the domain includes long, multi-participant dialogues with domain-specific language. Furthermore, as opposed to other domains, there typically is a large knowledge gap between the questioner (a layperson) and the responders (lawyers), participating in the same conversation. We collect and release a large-scale real-world dataset called LegalConv with nearly one million legal conversations from a legal community question answering (CQA) platform. We address the particular challenges of processing legal conversations, with our novel Conversational Legal Longformer with Expertise-Aware Response Ranker, called CLosER. The proposed method has two main innovations compared to state-of-the-art methods for ConvPR: (i) Expertise-Aware Post-Training; a learning objective that takes into account the knowledge gap difference between participants to the conversation; and (ii) a simple but effective strategy for re-ordering the context utterances in long conversations to overcome the limitations of the sparse attention mechanism of the Longformer architecture. Evaluation on LegalConv shows that our proposed method substantially and significantly outperforms existing state-of-the-art models on the response selection task. Our analysis indicates that our Expertise-Aware Post-Training, i.e., continued pre-training or domain/task adaptation, plays an important role in the achieved effectiveness. Our proposed method is generalizable to other tasks with domain-specific challenges and can facilitate future research on conversational search in other domains.</p
- …