1,641 research outputs found
A Robust Logical and Computational Characterisation of Peer-to-Peer Database Systems
In this paper we give a robust logical and computational characterisation of peer-to-peer (p2p) database systems. We first define a precise model-theoretic semantics of a p2p system, which allows for local inconsistency handling. We then characterise the general computational properties for the problem of answering queries to such a p2p system. Finally, we devise tight complexity bounds and distributed procedures for the problem of answering queries in few relevant special cases
Causal Parrots: Large Language Models May Talk Causality But Are Not Causal
Some argue scale is all what is needed to achieve AI, covering even causal
models. We make it clear that large language models (LLMs) cannot be causal and
give reason onto why sometimes we might feel otherwise. To this end, we define
and exemplify a new subgroup of Structural Causal Model (SCM) that we call meta
SCM which encode causal facts about other SCM within their variables. We
conjecture that in the cases where LLM succeed in doing causal inference,
underlying was a respective meta SCM that exposed correlations between causal
facts in natural language on whose data the LLM was ultimately trained. If our
hypothesis holds true, then this would imply that LLMs are like parrots in that
they simply recite the causal knowledge embedded in the data. Our empirical
analysis provides favoring evidence that current LLMs are even weak `causal
parrots.'Comment: Published in Transactions in Machine Learning Research (TMLR)
(08/2023). Main paper: 17 pages, References: 3 pages, Appendix: 7 pages.
Figures: 5 main, 3 appendix. Tables: 3 mai
Unlock Multi-Modal Capability of Dense Retrieval via Visual Module Plugin
This paper proposes Multi-modAl Retrieval model via Visual modulE pLugin
(MARVEL) to learn an embedding space for queries and multi-modal documents to
conduct retrieval. MARVEL encodes queries and multi-modal documents with a
unified encoder model, which helps to alleviate the modality gap between images
and texts. Specifically, we enable the image understanding ability of a
well-trained dense retriever, T5-ANCE, by incorporating the image features
encoded by the visual module as its inputs. To facilitate the multi-modal
retrieval tasks, we build the ClueWeb22-MM dataset based on the ClueWeb22
dataset, which regards anchor texts as queries, and exact the related texts and
image documents from anchor linked web pages. Our experiments show that MARVEL
significantly outperforms the state-of-the-art methods on the multi-modal
retrieval dataset WebQA and ClueWeb22-MM. Our further analyses show that the
visual module plugin method is tailored to enable the image understanding
ability for an existing dense retrieval model. Besides, we also show that the
language model has the ability to extract image semantics from image encoders
and adapt the image features in the input space of language models. All codes
are available at https://github.com/OpenMatch/MARVEL
The Ubuntu Dialogue Corpus: A Large Dataset for Research in Unstructured Multi-Turn Dialogue Systems
This paper introduces the Ubuntu Dialogue Corpus, a dataset containing almost
1 million multi-turn dialogues, with a total of over 7 million utterances and
100 million words. This provides a unique resource for research into building
dialogue managers based on neural language models that can make use of large
amounts of unlabeled data. The dataset has both the multi-turn property of
conversations in the Dialog State Tracking Challenge datasets, and the
unstructured nature of interactions from microblog services such as Twitter. We
also describe two neural learning architectures suitable for analyzing this
dataset, and provide benchmark performance on the task of selecting the best
next response.Comment: SIGDIAL 2015. 10 pages, 5 figures. Update includes link to new
version of the dataset, with some added features and bug fixes. See:
https://github.com/rkadlec/ubuntu-ranking-dataset-creato
Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation
The Differentiable Search Index (DSI) is an emerging paradigm for information
retrieval. Unlike traditional retrieval architectures where index and retrieval
are two different and separate components, DSI uses a single transformer model
to perform both indexing and retrieval.
In this paper, we identify and tackle an important issue of current DSI
models: the data distribution mismatch that occurs between the DSI indexing and
retrieval processes. Specifically, we argue that, at indexing, current DSI
methods learn to build connections between the text of long documents and the
identifier of the documents, but then retrieval of document identifiers is
based on queries that are commonly much shorter than the indexed documents.
This problem is further exacerbated when using DSI for cross-lingual retrieval,
where document text and query text are in different languages.
To address this fundamental problem of current DSI models, we propose a
simple yet effective indexing framework for DSI, called DSI-QG. When indexing,
DSI-QG represents documents with a number of potentially relevant queries
generated by a query generation model and re-ranked and filtered by a
cross-encoder ranker. The presence of these queries at indexing allows the DSI
models to connect a document identifier to a set of queries, hence mitigating
data distribution mismatches present between the indexing and the retrieval
phases. Empirical results on popular mono-lingual and cross-lingual passage
retrieval datasets show that DSI-QG significantly outperforms the original DSI
model.Comment: 11 page
- …