31 research outputs found
Learning to Skim Text
Recurrent Neural Networks are showing much promise in many sub-areas of
natural language processing, ranging from document classification to machine
translation to automatic question answering. Despite their promise, many
recurrent models have to read the whole text word by word, making it slow to
handle long documents. For example, it is difficult to use a recurrent network
to read a book and answer questions about it. In this paper, we present an
approach of reading text while skipping irrelevant information if needed. The
underlying model is a recurrent network that learns how far to jump after
reading a few words of the input text. We employ a standard policy gradient
method to train the model to make discrete jumping decisions. In our benchmarks
on four different tasks, including number prediction, sentiment analysis, news
article classification and automatic Q\&A, our proposed model, a modified LSTM
with jumping, is up to 6 times faster than the standard sequential LSTM, while
maintaining the same or even better accuracy
PURR: Efficiently Editing Language Model Hallucinations by Denoising Language Model Corruptions
The remarkable capabilities of large language models have been accompanied by
a persistent drawback: the generation of false and unsubstantiated claims
commonly known as "hallucinations". To combat this issue, recent research has
introduced approaches that involve editing and attributing the outputs of
language models, particularly through prompt-based editing. However, the
inference cost and speed of using large language models for editing currently
bottleneck prompt-based methods. These bottlenecks motivate the training of
compact editors, which is challenging due to the scarcity of training data for
this purpose. To overcome these challenges, we exploit the power of large
language models to introduce corruptions (i.e., noise) into text and
subsequently fine-tune compact editors to denoise the corruptions by
incorporating relevant evidence. Our methodology is entirely unsupervised and
provides us with faux hallucinations for training in any domain. Our Petite
Unsupervised Research and Revision model, PURR, not only improves attribution
over existing editing methods based on fine-tuning and prompting, but also
achieves faster execution times by orders of magnitude
Design of exceptionally strong and conductive Cu alloys beyond the conventional speculation via the interfacial energy-controlled dispersion of gamma-Al2O3 nanoparticles
The development of Cu-based alloys with high-mechanical properties (strength, ductility) and electrical conductivity plays a key role over a wide range of industrial applications. Successful design of the materials, however, has been rare due to the improvement of mutually exclusive properties as conventionally speculated. In this paper, we demonstrate that these contradictory material properties can be improved simultaneously if the interfacial energies of heterogeneous interfaces are carefully controlled. We uniformly disperse γ-Al2O3 nanoparticles over Cu matrix, and then we controlled atomic level morphology of the interface γ-Al2O3 //Cu by adding Ti solutes. It is shown that the Ti dramatically drives the interfacial phase transformation from very irregular to homogeneous spherical morphologies resulting in substantial enhancement of the mechanical property of Cu matrix. Furthermore, the Ti removes impurities (O and Al) in the Cu matrix by forming oxides leading to recovery of the electrical conductivity of pure Cu. We validate experimental results using TEM and EDX combined with first-principles density functional theory (DFT) calculations, which all consistently poise that our materials are suitable for industrial applications.1
Extending QGrams to Estimate Selectivity of String Matching with Low Edit Distance ∗ ABSTRACT
There are many emerging database applications that require accurate selectivity estimation of approximate string matching queries. Edit distance is one of the most commonly used string similarity measures. In this paper, we study the problem of estimating selectivity of string matching with low edit distance. Our framework is based on extending q-grams with wildcards. Based on the concepts of replacement semilattice, string hierarchy and a combinatorial analysis, we develop the formulas for selectivity estimation and provide the algorithm BasicEQ. We next develop the algorithm OptEQ by enhancing BasicEQ with two novel improvements. Finally we show a comprehensive set of experiments using three benchmarks comparing OptEQ with the stateof-the-art method SEPIA. Our experimental results show that OptEQ delivers more accurate selectivity estimations. 1
Selectivity estimation of approximate predicates on text
This dissertation studies selectivity estimation of approximate predicates on text. Intuitively, we aim to count the number of strings that are similar to a given query string. This type of problem is crucial in handling text in RDBMSs in an error-tolerant way.
A common difficulty in handling textual data is that they may contain typographical errors, or use similar but different textual representations for the same real-world entity. To handle such data in databases, approximate text processing has gained extensive interest and commercial databases have begun to incorporate such functionalities. One of the key components in successful integration of approximate text processing in RDBMSs is the selectivity estimation module, which is central in optimizing queries involving such predicates. However, these developments are relatively new and ad-hoc approaches, e.g., using a constant, have been employed.
This dissertation studies reliable selectivity estimation techniques for approximate predicates on text. Among many possible predicates, we focus on two types of predicates which are fundamental building blocks of SQL queries: selections and joins. We study two different semantics for each type of operator. We propose a set of related summary structures and algorithms to estimate selectivity of selection and join operators with approximate matching. A common challenge is that there can be a huge number of variants to consider. The proposed data structures enable efficient counting by considering a group of similar variants together rather than each and every one separately. A lattice-based framework is proposed to consider overlapping counts among the groups.
We performed extensive evaluation of proposed techniques using real-world and synthetic data sets. Our techniques support popular similarity measures including edit distance, Jaccard similarity and cosine similarity and show how to extend the techniques to other measures. Proposed solutions are compared with state-of-the-arts and baseline methods. Experimental results show that the proposed techniques are able to deliver accurate estimates with small space overhead.Science, Faculty ofComputer Science, Department ofGraduat
Variance Aware Optimization of Parameterized Queries
Parameterized queries are commonly used in database applications. In a parameterized query, the same SQL statement is potentially executed multiple times with different parameter values. In today‟s DBMSs the query optimizer typically chooses a single execution plan that is reused for multiple instances of the same query. A key problem is that even if a plan with low average cost across instances is chosen, its variance can be high, which is undesirable in many production settings. In this paper, we describe techniques for selecting a plan that can better address the trade-off between the average and variance of cost across instances of a parameterized query. We show how to efficiently compute the skyline in the average-variance cost space. We have implemented our techniques on top of a commercial DBMS. We present experimental results on benchmark and real-world decision support queries