5,963 research outputs found

    A Survey on Array Storage, Query Languages, and Systems

    Full text link
    Since scientific investigation is one of the most important providers of massive amounts of ordered data, there is a renewed interest in array data processing in the context of Big Data. To the best of our knowledge, a unified resource that summarizes and analyzes array processing research over its long existence is currently missing. In this survey, we provide a guide for past, present, and future research in array processing. The survey is organized along three main topics. Array storage discusses all the aspects related to array partitioning into chunks. The identification of a reduced set of array operators to form the foundation for an array query language is analyzed across multiple such proposals. Lastly, we survey real systems for array processing. The result is a thorough survey on array data storage and processing that should be consulted by anyone interested in this research topic, independent of experience level. The survey is not complete though. We greatly appreciate pointers towards any work we might have forgotten to mention.Comment: 44 page

    Learning to Map Natural Language to Executable Programs Over Databases

    Get PDF
    Natural language is a fundamental form of information and communication and is becoming the next frontier in computer interfaces. As the amount of data available online has increased exponentially, so has the need for Natural Language Interfaces (NLIs, which is not used for natural language inference in this thesis) to connect the data and the user by easily using natural language, significantly promoting the possibility and efficiency of information access for many users besides data experts. All consumer-facing software will one day have a dialogue interface, and this is the next vital leap in the evolution of search engines. Such intelligent dialogue systems should understand the meaning of language grounded in various contexts and generate effective language responses in different forms for information requests and human-computer communication.Developing these intelligent systems is challenging due to (1) limited benchmarks to drive advancements, (2) alignment mismatches between natural language and formal programs, (3) lack of trustworthiness and interpretability, (4) context dependencies in both human conversational interactions and the target programs, and (5) joint language understanding between dialog questions and NLI environments (e.g. databases and knowledge graphs). This dissertation presents several datasets, neural algorithms, and language models to address these challenges for developing deep learning technologies for conversational natural language interfaces (more specifically, NLIs to Databases or NLIDB). First, to drive advancements towards neural-based conversational NLIs, we design and propose several complex and cross-domain NLI benchmarks, along with introducing several datasets. These datasets enable training large, deep learning models. The evaluation is done on unseen databases. (e.g., about course arrangement). Systems must generalize well to not only new SQL queries but also to unseen database schemas to perform well on these tasks. Furthermore, in real-world applications, users often access information in a multi-turn interaction with the system by asking a sequence of related questions. The users may explicitly refer to or omit previously mentioned entities and constraints and may introduce refinements, additions, or substitutions to what has already been said. Therefore, some of them require systems to model dialog dynamics and generate natural language explanations for user verification. The full dialogue interaction with the systemā€™s responses is also important as this supports clarifying ambiguous questions, verifying returned results, and notifying users of unanswerable or unrelated questions. A robust dialogue-based NLI system that can engage with users by forming its responses has thus become an increasingly necessary component for the query process. Moreover, this thesis presents the development of scalable algorithms designed to parse complex and sequential questions to formal programs (e.g., mapping questions to SQL queries that can execute against databases). We propose a novel neural model that utilizes type information from knowledge graphs to better understand rare entities and numbers in natural language questions. We also introduce a neural model based on syntax tree neural networks, which was the first methodology proposed for generating complex programs from language. Finally, language modeling creates contextualized vector representations of words by training a model to predict the next word given context words, which are the basis of deep learning for NLP. Recently, pre-trained language models such as BERT and RoBERTa achieve tremendous success in many natural language processing tasks such as text understanding and reading comprehension. However, most language models are pre-trained only on free-text such as Wikipedia articles and Books. Given that language in semantic parsing is usually related to some formal representations such as logic forms and SQL queries and has to be grounded in structural environments (e.g., databases), we propose better language models for NLIs by enforcing such compositional interpolation in them. To show they could better jointly understand dialog questions and NLI environments (e.g. databases and knowledge graphs), we show that these language models achieve new state-of-the-art results for seven representative tasks on semantic parsing, dialogue state tracking, and question answering. Also, our proposed pre-training method is much more effective than other prior work

    Knowledge Rich Natural Language Queries over Structured Biological Databases

    Full text link
    Increasingly, keyword, natural language and NoSQL queries are being used for information retrieval from traditional as well as non-traditional databases such as web, document, image, GIS, legal, and health databases. While their popularity are undeniable for obvious reasons, their engineering is far from simple. In most part, semantics and intent preserving mapping of a well understood natural language query expressed over a structured database schema to a structured query language is still a difficult task, and research to tame the complexity is intense. In this paper, we propose a multi-level knowledge-based middleware to facilitate such mappings that separate the conceptual level from the physical level. We augment these multi-level abstractions with a concept reasoner and a query strategy engine to dynamically link arbitrary natural language querying to well defined structured queries. We demonstrate the feasibility of our approach by presenting a Datalog based prototype system, called BioSmart, that can compute responses to arbitrary natural language queries over arbitrary databases once a syntactic classification of the natural language query is made

    Semantics, Modelling, and the Problem of Representation of Meaning -- a Brief Survey of Recent Literature

    Full text link
    Over the past 50 years many have debated what representation should be used to capture the meaning of natural language utterances. Recently new needs of such representations have been raised in research. Here I survey some of the interesting representations suggested to answer for these new needs.Comment: 15 pages, no figure

    Rumble: Data Independence for Large Messy Data Sets

    Full text link
    This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogeneous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The design is based on two key insights: (i) how to map JSONiq expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR clauses to Spark SQL on DataFrames. We have developed a working implementation of these mappings showing that JSONiq can efficiently run on Spark to query billions of objects into, at least, the TB range. The JSONiq code is concise in comparison to Spark's host languages while seamlessly supporting the nested, heterogeneous data sets that Spark SQL does not. The ability to process this kind of input, commonly found, is paramount for data cleaning and curation. The experimental analysis indicates that there is no excessive performance loss, occasionally even a gain, over Spark SQL for structured data, and a performance gain over PySpark. This demonstrates that a language such as JSONiq is a simple and viable approach to large-scale querying of denormalized, heterogeneous, arborescent data sets, in the same way as SQL can be leveraged for structured data sets. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.Comment: Preprint, 9 page

    Four Lessons in Versatility or How Query Languages Adapt to the Web

    Get PDF
    Exposing not only human-centered information, but machine-processable data on the Web is one of the commonalities of recent Web trends. It has enabled a new kind of applications and businesses where the data is used in ways not foreseen by the data providers. Yet this exposition has fractured the Web into islands of data, each in different Web formats: Some providers choose XML, others RDF, again others JSON or OWL, for their data, even in similar domains. This fracturing stifles innovation as application builders have to cope not only with one Web stack (e.g., XML technology) but with several ones, each of considerable complexity. With Xcerpt we have developed a rule- and pattern based query language that aims to give shield application builders from much of this complexity: In a single query language XML and RDF data can be accessed, processed, combined, and re-published. Though the need for combined access to XML and RDF data has been recognized in previous work (including the W3Cā€™s GRDDL), our approach differs in four main aspects: (1) We provide a single language (rather than two separate or embedded languages), thus minimizing the conceptual overhead of dealing with disparate data formats. (2) Both the declarative (logic-based) and the operational semantics are unified in that they apply for querying XML and RDF in the same way. (3) We show that the resulting query language can be implemented reusing traditional database technology, if desirable. Nevertheless, we also give a unified evaluation approach based on interval labelings of graphs that is at least as fast as existing approaches for tree-shaped XML data, yet provides linear time and space querying also for many RDF graphs. We believe that Web query languages are the right tool for declarative data access in Web applications and that Xcerpt is a significant step towards a more convenient, yet highly efficient data access in a ā€œWeb of Dataā€
    • ā€¦
    corecore