233 research outputs found
Learnings from Data Integration for Augmented Language Models
One of the limitations of large language models is that they do not have
access to up-to-date, proprietary or personal data. As a result, there are
multiple efforts to extend language models with techniques for accessing
external data. In that sense, LLMs share the vision of data integration systems
whose goal is to provide seamless access to a large collection of heterogeneous
data sources. While the details and the techniques of LLMs differ greatly from
those of data integration, this paper shows that some of the lessons learned
from research on data integration can elucidate the research path we are
conducting today on language models
Harnessing the Deep Web: Present and Future
Over the past few years, we have built a system that has exposed large
volumes of Deep-Web content to Google.com users. The content that our system
exposes contributes to more than 1000 search queries per-second and spans over
50 languages and hundreds of domains. The Deep Web has long been acknowledged
to be a major source of structured data on the web, and hence accessing
Deep-Web content has long been a problem of interest in the data management
community. In this paper, we report on where we believe the Deep Web provides
value and where it does not. We contrast two very different approaches to
exposing Deep-Web content -- the surfacing approach that we used, and the
virtual integration approach that has often been pursued in the data management
literature. We emphasize where the values of each of the two approaches lie and
caution against potential pitfalls. We outline important areas of future
research and, in particular, emphasize the value that can be derived from
analyzing large collections of potentially disparate structured data on the
web.Comment: CIDR 200
An XML Query Engine for Network-Bound Data
XML has become the lingua franca for data exchange and integration across administrative and enterprise boundaries. Nearly all data providers are adding XML import or export capabilities, and standard XML Schemas and DTDs are being promoted for all types of data sharing. The ubiquity of XML has removed one of the major obstacles to integrating data from widely disparate sources –- namely, the heterogeneity of data formats.
However, general-purpose integration of data across the wide area also requires a query processor that can query data sources on demand, receive streamed XML data from them, and combine and restructure the data into new XML output -- while providing good performance for both batch-oriented and ad-hoc, interactive queries. This is the goal of the Tukwila data integration system, the first system that focuses on network-bound, dynamic XML data sources. In contrast to previous approaches, which must read, parse, and often store entire XML objects before querying them, Tukwila can return query results even as the data is streaming into the system. Tukwila is built with a new system architecture that extends adaptive query processing and relational-engine techniques into the XML realm, as facilitated by a pair of operators that incrementally evaluate a query’s input path expressions as data is read. In this paper, we describe the Tukwila architecture and its novel aspects, and we experimentally demonstrate that Tukwila provides better overall query performance and faster initial answers than existing systems, and has excellent scalability
Piazza: Data Management Infrastructure for Semantic Web Applications
The Semantic Web envisions a World Wide Web in which data is described with rich semantics and applications can pose complex queries. To this point, researchers have defined new languages for specifying meanings for concepts and developed techniques for reasoning about them, using RDF as the data model. To flourish, the Semantic Web needs to be able to accommodate the huge amounts of existing data and the applications operating on them. To achieve this, we are faced with two problems. First, most of the world\u27s data is available not in RDF but in XML; XML and the applications consuming it rely not only on the domain structure of the data, but also on its document structure. Hence, to provide interoperability between such sources, we must map between both their domain structures and their document structures. Second, data management practitioners often prefer to exchange data through local point-to-point data translations, rather than mapping to common mediated schemas or ontologies. This paper describes the Piazza system, which addresses these challenges. Piazza offers a language for mediating between data sources on the Semantic Web, which maps both the domain structure and document structure. Piazza also enables interoperation of XML data with RDF data that is accompanied by rich OWL ontologies. Mappings in Piazza are provided at a local scale between small sets of nodes, and our query answering algorithm is able to chain sets mappings together to obtain relevant data from across the Piazza network. We also describe an implemented scenario in Piazza and the lessons we learned from it
Detecting Inspiring Content on Social Media
Inspiration moves a person to see new possibilities and transforms the way
they perceive their own potential. Inspiration has received little attention in
psychology, and has not been researched before in the NLP community. To the
best of our knowledge, this work is the first to study inspiration through
machine learning methods. We aim to automatically detect inspiring content from
social media data. To this end, we analyze social media posts to tease out what
makes a post inspiring and what topics are inspiring. We release a dataset of
5,800 inspiring and 5,800 non-inspiring English-language public post unique ids
collected from a dump of Reddit public posts made available by a third party
and use linguistic heuristics to automatically detect which social media
English-language posts are inspiring.Comment: accepted at ACII 202
Multimodal Neural Databases
The rise in loosely-structured data available through text, images, and other
modalities has called for new ways of querying them. Multimedia Information
Retrieval has filled this gap and has witnessed exciting progress in recent
years. Tasks such as search and retrieval of extensive multimedia archives have
undergone massive performance improvements, driven to a large extent by recent
developments in multimodal deep learning. However, methods in this field remain
limited in the kinds of queries they support and, in particular, their
inability to answer database-like queries. For this reason, inspired by recent
work on neural databases, we propose a new framework, which we name Multimodal
Neural Databases (MMNDBs). MMNDBs can answer complex database-like queries that
involve reasoning over different input modalities, such as text and images, at
scale. In this paper, we present the first architecture able to fulfill this
set of requirements and test it with several baselines, showing the limitations
of currently available models. The results show the potential of these new
techniques to process unstructured data coming from different modalities,
paving the way for future research in the area. Code to replicate the
experiments will be released at
https://github.com/GiovanniTRA/MultimodalNeuralDatabase
NormBank: A Knowledge Bank of Situational Social Norms
We present NormBank, a knowledge bank of 155k situational norms. This
resource is designed to ground flexible normative reasoning for interactive,
assistive, and collaborative AI systems. Unlike prior commonsense resources,
NormBank grounds each inference within a multivalent sociocultural frame, which
includes the setting (e.g., restaurant), the agents' contingent roles (waiter,
customer), their attributes (age, gender), and other physical, social, and
cultural constraints (e.g., the temperature or the country of operation). In
total, NormBank contains 63k unique constraints from a taxonomy that we
introduce and iteratively refine here. Constraints then apply in different
combinations to frame social norms. Under these manipulations, norms are
non-monotonic - one can cancel an inference by updating its frame even
slightly. Still, we find evidence that neural models can help reliably extend
the scope and coverage of NormBank. We further demonstrate the utility of this
resource with a series of transfer experiments
- …