157 research outputs found
Automatic Data Transformation Using Large Language Model: An Experimental Study on Building Energy Data
Existing approaches to automatic data transformation are insufficient to meet
the requirements in many real-world scenarios, such as the building sector.
First, there is no convenient interface for domain experts to provide domain
knowledge easily. Second, they require significant training data collection
overheads. Third, the accuracy suffers from complicated schema changes. To
bridge this gap, we present a novel approach that leverages the unique
capabilities of large language models (LLMs) in coding, complex reasoning, and
zero-shot learning to generate SQL code that transforms the source datasets
into the target datasets. We demonstrate the viability of this approach by
designing an LLM-based framework, termed SQLMorpher, which comprises a prompt
generator that integrates the initial prompt with optional domain knowledge and
historical patterns in external databases. It also implements an iterative
prompt optimization mechanism that automatically improves the prompt based on
flaw detection. The key contributions of this work include (1) pioneering an
end-to-end LLM-based solution for data transformation, (2) developing a
benchmark dataset of 105 real-world building energy data transformation
problems, and (3) conducting an extensive empirical evaluation where our
approach achieved 96% accuracy in all 105 problems. SQLMorpher demonstrates the
effectiveness of utilizing LLMs in complex, domain-specific challenges,
highlighting the potential of their potential to drive sustainable solutions.Comment: 10 pages, 7 figure
A data driven semantic framework for clinical trial eligibility criteria
Title from PDF of title page, viewed on January 17, 2012Thesis advisor: Deendayal DinakarpandianVitaIncludes bibliographic references (p. 90-93)Thesis (M.S.)--School of Computing and Engineering. University of Missouri--Kansas City, 2011An important step in the discovery of new treatments for medical conditions is the matching of potential subjects with appropriate clinical trials. Eligibility criteria for clinical trials are typically specified in free text as inclusion and exclusion criteria for each study. While this is sufficient for a human to guide a recruitment interview, it cannot be reliably parsed to identify potential subjects computationally. Standardizing the representation of eligibility criteria can help in increasing the efficiency and accuracy of this process. This thesis proposes a semantic framework for intelligent match matching to determine a minimal set of eligibility criteria with maximal coverage of clinical trials. In contrast to top down existing manual standardization efforts, a bottom-up data driven approach is presented that finds the canonical non-redundant representation of an arbitrary collection of clinical trial criteria set to facilitate intelligent match-making. The approach is based on semantic clustering. The methodology been validated on a corpus of 708 clinical trials related to Generalized Anxiety Disorder containing 2760 inclusion and 4871 exclusion eligibility criteria. This corpus is represented by a relatively small number of 126 inclusion clusters and 175 exclusion clusters, each of which represents a semantically distinct criterion. Internal and external validation measures provide an objective evaluation of the method. Based on the clustering, an eligibility criteria ontology has been constructed. The resulting model has been incorporated into the development of the MindTrial clinical trial recruiting system. The prototype for clinical trial recruitment illustrates the real world effectiveness of the methodology in characterizing clinical trials and subjects, and accurate matching between them.Introduction -- Related work -- Data driven model for clinical trial eligibility criteria -- Creation of mock clinical trial subject database -- Ontology creation for clinical trials -- Case study on clinical trials -- WEB interface for GAD eligibility criteria -- Validation -- Conclusion and future work -- Appendi
Facilitating Information Access for Heterogeneous Data Across Many Languages
Information access, which enables people to identify, retrieve, and use information freely and effectively, has attracted interest from academia and industry. Systems for document retrieval and question answering have helped people access information in powerful and useful ways. Recently, natural language technologies based on neural network have been applied to various tasks for information access. Specifically, transformer-based pre-trained models have pushed tasks such as document and passage retrieval to new state-of-the-art effectiveness. (1) Most of the research has focused on helping people access passages and documents on the web. However, there is abundant information stored in other formats such as semi-structured tables and domain-specific relational databases in companies. Development of the models and frameworks that support access information from these data formats is also essential. (2) Moreover, most of the advances in information access research are based on English, leaving other languages less explored. It is insufficient and inequitable in our globalized and connected world to serve only speakers of English.
In this thesis, we explore and develop models and frameworks that could alleviate the aforementioned challenges. This dissertation consists of three parts. We begin with a discussion on developing models designed for accessing data in formats other than passages and documents. We mainly focus on two data formats, namely semi-structured tables and relational databases. In the second part, we discuss methods that can enhance the user experience for non-English speakers when using information access systems. Specifically, we first introduce model development for multilingual knowledge graph integration, which can benefit many information access applications such as cross-lingual question answering systems and other knowledge-driven cross-lingual NLP applications. We further focus on multilingual document dense retrieval and reranking that boost the effectiveness of search engines for non-English information access. Last but not least, we take a step further based on the aforementioned two parts by investigating models and frameworks that can facilitate non-English speakers to access structured data. In detail, we present cross-lingual Text-to-SQL semantic parsing systems that enable non-English speakers to query relational databases with queries in their languages
Distributed Conversion of RDF Data to the Relational Model
Ve formátu RDF je ukládán rostoucí objem hodnotných informací. Relační databáze však stále přinášejí výhody z hlediska výkonu a množství podporovaných nástrojů. Představujeme RDF2X, nástroj pro automatický distribuovaný převod RDF dat do relačního modelu. Poskytujeme srovnání souvisejících přístupů, analyzujeme měření převodu 8.4 miliard RDF trojic a ilustrujeme přínos našeho nástroje na dvou případových studiích.The Resource Description Framework (RDF) stores a growing volume of valuable information. However, relational databases still provide advantages in terms of performance, familiarity and the number of supported tools. We present RDF2X, a tool for automatic distributed conversion of RDF datasets to the relational model. We provide a comparison of related approaches, report on the conversion of 8.4 billion RDF triples and demonstrate the contribution of our tool on case studies from two different domains
Is Neuro-Symbolic AI Meeting its Promise in Natural Language Processing? A Structured Review
Advocates for Neuro-Symbolic Artificial Intelligence (NeSy) assert that
combining deep learning with symbolic reasoning will lead to stronger AI than
either paradigm on its own. As successful as deep learning has been, it is
generally accepted that even our best deep learning systems are not very good
at abstract reasoning. And since reasoning is inextricably linked to language,
it makes intuitive sense that Natural Language Processing (NLP), would be a
particularly well-suited candidate for NeSy. We conduct a structured review of
studies implementing NeSy for NLP, with the aim of answering the question of
whether NeSy is indeed meeting its promises: reasoning, out-of-distribution
generalization, interpretability, learning and reasoning from small data, and
transferability to new domains. We examine the impact of knowledge
representation, such as rules and semantic networks, language structure and
relational structure, and whether implicit or explicit reasoning contributes to
higher promise scores. We find that systems where logic is compiled into the
neural network lead to the most NeSy goals being satisfied, while other factors
such as knowledge representation, or type of neural architecture do not exhibit
a clear correlation with goals being met. We find many discrepancies in how
reasoning is defined, specifically in relation to human level reasoning, which
impact decisions about model architectures and drive conclusions which are not
always consistent across studies. Hence we advocate for a more methodical
approach to the application of theories of human reasoning as well as the
development of appropriate benchmarks, which we hope can lead to a better
understanding of progress in the field. We make our data and code available on
github for further analysis.Comment: Surve
- …